Ok, a while back when TekTonic (A’tuin’s ill-favored hosting service) crashed and had all kinds of problems in general, this caused a reboot of my system. Quite understandable that. Unfortunately, the box had been up for a while (probably 3 months or so of uptime) and they must have applied some kernel patches or something to their system because it has been unstable ever since.

Most notable among these problems is a heinous memory management problem – new processes segfault in stead of making something else swap out. Only slightly less notable is the fact that TeamSpeak ceased to work at the same time.

ammon@atuin:/usr/local/tss$ ./teamspeak2-server_startscript start
starting the teamspeak2 server
Runtime error   0 at BFFFFBB8
./teamspeak2-server_startscript: line 92: 27874 Segmentation fault      ./server_linux -PID=tsserver2.pid

Yeah. Fun stuff. Over time, the only difference in the output tends to be the memory address at which it borks out on me ;) Didn’t really bother me much since I wasn’t actually hosting any chat services – I had actually only installed it on a whim in the first place anyways.

I’m also not the only one with this problem. Before they mysteriously vanished, the TekTonic support forums had at least 3 or 4 threads complaining about this exact problem.

Well, earlier this week, I had a discussion with guild leader on CoV where we talked about possibilities for voice chat. I brought up A’tuin and its currently functional Ventrilo setup – but the Vent license only allows 8 clients to connect to a non-pro account… and they’re very picky about with whom they will do business (pro accounts supporting a minimum of 1000 clients at once spread across multiple servers and stuff).

Our SG has more than 8 members now, and it would be a shame if people were getting rejected just because Vent refuses to sell me a license. So, I started looking back into TS again today. It still crashed.

Using the trick taught to me back during my first Google tech interview, I applied strace to the binary. Lo and behold, the trace reveals that the crash occurs whilst trying to open /usr/lib/locale/en/LC_CTYPE. I checked, and sure enough… my machine seemed to have a broken installation of the locales package… and along with it, a rather old version of libc.

So… patching these up gives me progress. I’m going to see what else needs to be done to bring things up to speed, but I might actually get TS running on this monster tonight.

Update – 12:30am, Feb 22

After updating glibc and locales (and practically every other system library), I am still getting the following output from strace:

...snip...
open("/usr/lib/locale/locale-archive", O_RDONLY|O_LARGEFILE) = 4
fstat64(4, {st_mode=S_IFREG|0644, st_size=1583760, ...}) = 0
mmap2(NULL, 1583760, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40170000
close(4)                                = 0
pipe([4, 5])                            = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
rt_sigreturn(0)                         = 136425499
...snip...

Blech. So… it’s closing the locale archive document and then trying to pipe it somewhere else? Don’t know if this is actually the problem though. I really need to learn how to use this utility better. Mumble, grumble. I’m gonna dig through ML and forums and stuff to see if I can come up with anything further.

Update – 7:30am, Feb 23

Well, fooey. I looked at man pages online (since I can’t seem to locate which Debian package actually installs the syscall man pages for me…) and aside from discovering that there is apparently nothing fishy about the pipe call, noticed something else in the sigreturn man page:

sigreturn never returns.

Well, not terribly interesting. If the machine is segfaulting, the kernel is well within its rights to make whatever syscalls are necessary to recover and report error messages to me and stuff, no? So that might be a dead end.

Looking further up through the entire stack trace, I noticed that attempts to read from this /etc/ld.so.nohwcap file that I’d never heard of were all failing. Google turns up this blog entry. So, I created the file and nothing really changed except that the program doesn’t fail on those particular reads any more. I’m assuming that the access to /etc/ld.so.preload is also harmless, so will ignore this one.

So, I figured I’d try to force a segfault of my own to compare the two straces for any similarities. Not much there, eh? But when I do an ltrace on my file – it clearly specifies what’s crashing.

The ltrace of the TS binary is way too verbose to put online anywhere, so yeah. It looks like it’s dying on some assortment of pthreads mutex calls, which I guess is entirely possible and realistic.

Update – 8:50am, Feb 23

The TeamSpeak forums seem to have a handfull of threads dealing with this exact sort of problem – which seems to have affected people on multiple distros (including Debian). A German thread is the longest discussion of the problem and seems to have an answer.

They use setarch to fool the machine into thinking that it is something it is not… Erm, ok, I guess. It sounds like problems with architecture-specific behaviors failing when run on the wrong kind of box. That makes some sort of sense.

It especially makes sense on my machine – a VPS on dual Opterons. Not the most normal architecture out there. When you run uname -m, it just says i686… so shrug.

Of course, setarch isn’t available for Debian. The solution given in the german thread is to use alien to turn an RPM into a DPKG for installation ;) I’ll try that out later and see what it does to me.

Ok, some time around 11:30ish local time, I noticed that A’tuin wasn’t black holing connection attempts. After a bit of difficulty, I managed to actually connect to the Virtuozzo console – but couldn’t log in. After a few more minutes, it actually let me log in to VZPP, but I noticed that I did not have a functional SSH daemon running.

Much toil and one hour later, I discovered a wierd problem in the system startup scripts and repaired them to actually boot our services correctly again.

The database server is running, the mud is running, the web server is running. Life is, in general, good again.

I am wondering what the official response by TekTonic will be on this issue – since after something well over 30 hours, they still haven’t made a public statement on the outages forum, and haven’t given satisfactory answers to my trouble tickets.

Oh well, problem is over for now. I look for alternative hosting, but am not very hopeful of finding anything with this set of server options anywhere near this price bracket. I wonder if an entire day of unplanned and unexplained downtime is worth the thousand dollars or so a year that I am saving by going with them as opposed to my next best option? Dunno.

At least I’ll be able to sleep tonight w/o waking up to 50 user complaints :P

As some of you may know, I have stopped hosting my own servers recently – in stead paying for space on other peoples’ machines. This site (as well as a few others I run) are located on pair Networks space.

However, several of my projects (the mud for one) require more than mere web space. Thus, I investigated alternative options and have been enjoying TekTonic’s VPS service for the past several months (the hostname being A’tuin).

Yesterday afternoon, they had a serious networking problem. I found out about it around an hour after things went down (my IM started ringing off the hook as users from all over checked in to see what was wrong). I submitted a trouble ticket and got a prompt answer – that things were down and they were fixing them.

An hour later, a sales rep closed my ticket, saying that the problem had been resolved and the affected servers were booting back up.

An hour later, we still didn’t have service. So I started sending follow-ups to the first ticket in hopes of receiving news of the problem. No such luck. They didn’t respond to any of my querries, and they didn’t post any news of the problem to their support forums.

This morning, I sent a 911 ticket to the sales department, and after about 20 minutes, got another short response. This is the last I have heard from them. They’ve still not posted a formal announcement of the details and I am definately not the only user being affected by this.

The TAMS Alumni site is probably located on the same machine as mine (since they’re successive IP’s), and some other users have complained on the forums (this thread, not about our problem but it was the newest thread in the outage category). Another good thread on the subject on their forums is here. There are at least 5 more threads all related to this same problem – and still no official announcement on the subject :P

My saga of tickets so far goes like this (timestamps are east coast):

#33610: atuin.simud.org unreachable

me, 01/16/2006 6:24:09PM

For at least an hour it seems, my VPS has been inaccessable via any means. An SSH session I had open to the machine was just hung and any attempts to connect to any services running on the account or log in to the web interface at [address] have failed.

Since there is no message posted in the forums, I am assuming this is a new issue. Thank you for getting this back online as soon as possible.

them, 01/16/2006 6:25:49PM

Hi
There is a network issue at present that is affecting a large number of our servers.
We are working on the issue as fast as we can.

Rob
Support

them, 01/16/2006 7:29:42PM

We had some power issues on some the racks, it has been taken care of, the
servers are coming back up and may already be up.


-Ryan M. Adzima
Tektonic Network Solutions | sales@tektonic.net

[ticket closed]

me, 01/16/2006 8:19:06PM

It has been 50 minutes since you declared the problem solved, yet, my machine is still down.

me, 01/16/2006 9:22:31PM

Hello? Any response would be nice.

My VPS is still not up, it has now been about two hours since you said that devices were coming back up. I’m guessing that it doesn’t take this long to boot a machine.

[two more posts of this nature snipped because they're not that interesting]

#34155: poor customer support

me, 01/17/2006 12:32:46PM

Howdy.

Yesterday afternoon, there seems to have been a fairly big networking/hardware problem that affected multiple servers, including the one that my services are operating off of.

I submitted a trouble ticket and got an almost immediate response – they were actually working on the problem. Then, an hour later, Ryan Adzima closed my ticket saying that problems had been resolved.

They have not. My machine is still inaccessable, and despite my submitting multiple follow-up requests to the ticket, I have not heard back from the support department since they closed my issue in the first place.

As I am rapidly approaching 24 hours of downtime and since that ticket (#33610) is apparently being ignored, I am attempting submission of a new one in the hopes that I will actually get a response this time.

<Insert angry words here>

them, 01/17/2006 12:53:42PM

The support department is quite busy right now dealing with an outage on a particular server that was affected by the power issue. They are troubleshooting the problems but it looks like it may be a hardware issue. I understand that this is an unacceptabel amount of time, but the team has been up all night dealing with it.


-Ryan M. Adzima
Tektonic Network Solutions | sales@tektonic.net

[ticket closed]


I love how he can’t spell the word unacceptable :)