Server crashing randomly + segfaults

SloPro · August 4, 2020, 10:34am

Hello, recently we’ve been trying out various builds on our Linux server - we’ve tried 2744 and 2788 - and we’ve been encountering an issue where FXserver will randomly just close itself with 0 output on either stderr or stdout.

The only indication that anything wrong’s been going on that I’ve found is in /var/log/syslog.

Here is an excerpt of various segfaults:

#running 2744, nothing happened
Jul 30 17:02:14 kernel: ld-musl-x86_64.[26522]: segfault at 7f1097d0e4e0 ip 00007f10c871b289 sp 00007f10b8c20028 error 4 in libcitizen-scripting-mono.so[7f10c85da000+281000]
#running 2744, crashed at the same time
Aug  1 22:07:30 kernel: luv_svMain[24975]: segfault at 7f0800000009 ip 00007f085d4e04ff sp 00007f0851951cd0 error 6 in libnet-http-server.so[7f085d4ce000+67000]
#running 2744, nothing happened
Aug  3 10:46:28 kernel: luv_svMain[2675]: segfault at 79 ip 00007f030926977c sp 00007f02fe3a9c10 error 4 in libcitizen-resources-core.so[7f030924a000+9c000]
#running 2788, crashed at the same time
Aug  4 04:56:13 kernel: luv_svMain[10892]: segfault at 0 ip 0000000000000000 sp 00007f9db55f4198 error 14

It’s crashed a few extra times with nothing in syslog either so I have no idea as to what could be causing this.

If there is any extra information that I can provide somehow just let me know (& how).

nta · August 4, 2020, 11:12am

How do you even have so many different crash origins?

Anyway, I don’t think this is even debuggable that way, given core dumps of non-system libc (if you can even enable them, for distros all have different means for this) are entirely useless and not even loadable (with errors with no meaningful Google results) due to flaws in… gdb? glibc? musl? Linux’ dynamic loader design? the core concept of POSIX signals?

If you’re having any sort of load at all, please, just run Windows, even if it’s on a CPU-pinned dedicated-IO VM on your Linux machine with netfilter doing forwarding of packets.

The facts that ETW ‘just works’ unlike perf, SEH is vastly superior to POSIX signals, there’s no weird libc mismatch debacle when trying to debug anything, unwind info is included in any binary and not just some scarce -debug package that’s often missing, WER dumps are 100% reliable compared to systemd-coredump or whatever, and all those other tiny things makes it… much, much easier to diagnose and tune for load on Windows at all.

And that’s not to mention non-diagnostic-related concerns such as the memory fragmentation seen with common Linux libc allocators, the trouble with interposing any other allocator, and so on.

If whatever isn’t an option, try enabling core dumps (very distro-specific…) and recreating a debug tree using, uh, the .NET Core SymClient utility like this:

# build and install it somehow beforehand (https://github.com/dotnet/symstore/tree/master/src/SymClient)
./SymClient -s https://runtime.fivem.net/client/symbols/ -oi --symbols -d $FXROOT/alpine/opt/cfx-server/*.so $FXROOT/alpine/opt/cfx-server/FXServer

… and then chroot’ing (chroot .../alpine /bin/sh) into the FX alpine/ dir, apk add gdb, and loading the core dump in GDB and showing a backtrace.

SloPro · August 4, 2020, 11:23am

How do you even have so many different crash origins?

If you’re having any sort of load at all, please, just run Windows, even if it’s on a CPU-pinned dedicated-IO VM on your Linux machine with netfilter doing forwarding of packets.

Yeah we were running a Windows VM for some time using qemu/VirtIO but it started shitting itself at ~120 players due to single core CPU load (loads of hitch warnings) so we moved back in May

If whatever isn’t an option, try enabling core dumps (very distro-specific…) and recreating a debug tree using, uh, the .NET Core SymClient utility like this:
… and then chroot’ing ( chroot .../alpine /bin/sh ) into the FX alpine/ dir, apk add gdb , and loading the core dump in GDB and showing a backtrace.

I’ll try that and see if I can get it working

nta · August 4, 2020, 11:24am

There’s a reason I constantly mention CPU thread pinning when I recommend qemu-kvm. Generally guides for setting up ‘gaming VMs’ with GPU have good guidelines for how to correctly handle CPU pinning since that’s also a single-core-heavy latency-sensitive system concept.

SloPro · August 4, 2020, 11:33am

Ah, I’ll try that if/when we go back to the VM; it’s a shame as we haven’t had any issues with Linux FXServer up until the last week or so, we’ve been running very stable for weeks at a time between restarts even

Do you happen to know a way to reduce host CPU load during high network usage as well? At 800+ Mbit it would pin 4 cores to almost 100% (not that that’s a problem, it’s just high). We were using the virtio network driver with e1000 in the qemu setup as that’s what I generally found recommended online.

nta · August 4, 2020, 11:36am

I’m suspecting the recent uptake in load issues might be related to some new threat actors finding a few new silly points of denial-of-service.

As to I/O, there’s no combination of using virtio together with e1000 at once - while e1000 might be fine at lower loads, netkvm with normal virtio seems to handle even higher loads without too much guest CPU time. Plus, you might have to pin the I/O thread anyway.

If your server runs on a higher-end platform, you might actually have the luxury of SR-IOV NICs and can directly pass through the host NIC, but I don’t know how common this is at normal server providers.

SloPro · August 4, 2020, 11:46am

As to I/O, there’s no combination of using virtio together with e1000 at once

Yeah my bad, we started with e1000 but didn’t notice any major difference moving to
<model type='virtio'/> so I got it confused.

without too much guest CPU time

I was specifically asking about it using a lot of host CPU time, not guest, guest CPU time was fine apart from 1 or 2 cores getting pinned by FXServer (this was before the new sync thread was introduced)

If your server runs on a higher-end platform, you might actually have the luxury of SR-IOV NICs

Our NIC is an I219-LM (i7 7700) which sadly doesn’t seem to support SR-IOV from what I can find online

SloPro · August 4, 2020, 12:40pm

Alright so I built symclient and downloaded the various CFX symbols using the stated command. I then simulated a server crash using _crash and moved the 800 or so MB core dump from server-data to $FXROOT/alpine/root. I then chrooted into $FXROOT/alpine, cd’d into /opt/cfx-server.
When I run

gdb /opt/cfx-server/FXServer /root/core

and then do a bt full it still prints:

What am I doing wrong?

nta · August 4, 2020, 12:45pm

It might be gdb has a few weird search paths that aren’t typically expected and I forgot to note down since the steps I took with SymClient were meant for perf (which, fyi, doesn’t work until Alpine ships libstdc++ symbols ).

Generally, I believe, the default search path is along the lines of, uh

/a/b.so -> /usr/lib/debug/a/b.so.debug

(inside the chroot, in this case)

I don’t know how easily this can be changed in gdb, but a quick move command for the .dbg files may help as well.

SloPro · August 4, 2020, 1:20pm

It does say this initially as it’s executed but it still says “No symbol table info available” later in bt

If I move the .dbg files to /usr/lib/debug/opt/cfx-server it prints

But again, “No symbol table info available” in bt full

nta · August 4, 2020, 1:32pm

Hm. “info sharedlib” output?

SloPro · August 4, 2020, 1:36pm

Here you go

Here is the full startup print

…

nta · August 4, 2020, 1:43pm

Yeah… that stupid gdb issue. If gdb was indeed running inside the chroot, that’s fucked indeed and one of the silly behaviors why I suggest Windows, as a core dump written with a different libc is apparently already inherently useless.

SloPro · August 5, 2020, 8:24am

Ah yes I forgot to mention, when it crashes we get a bunch of these in syslog as well:

Aug  5 03:19:29 kernel: ptrace attach of "/home/gameserver/FiveM/server/alpine/opt/cfx-server/ld-musl-x86_64.so.1 --library-path 
	/home/gameserver/FiveM/server/alpine/usr/lib/v8/:/home/gameserver/FiveM/server/alpine/lib/:/home/gameserver/FiveM/server/alpine/usr/lib/ --
	/home/gameserver/FiveM/server/alpine/opt/cfx-server/FXServer +set citizen_dir 
	/home/gameserver/FiveM/server/alpine/opt/cfx-server/citizen/ +exec server.cfg +set onesync_enabled 1 +set onesync_enableInfinity 1 +set onesync_enableBeyond 1"
[14422] was attempted by "/home/gameserver/FiveM/server/alpine/opt/cfx-server/ld-musl-x86_64.so.1 
	--library-path /home/gameserver/FiveM/server/alpine/usr/lib/v8/:/home/gameserver/FiveM/server/alpine/lib/:/home/gameserver/FiveM/server/alpine/usr/lib/ -- 
	/home/gameserver/FiveM/server/alpine/opt/cfx-server/FXServer +set citizen_dir /home/gameserver/FiveM/server/alpine/opt/cfx-server/citizen/ -dumpserver:4 -parentppe:6"

there’s 10 of these in a row separated by a few ms, followed by a segfault - in this case:

Aug  5 03:19:29 kernel: luv_svMain[14443]: segfault at 0 ip 0000000000000000 sp 00007ff2f43a9198 error 14`

Does this mean anything useful ? We’re working towards Windows soon, I’m just trying any last leads…

nta · August 5, 2020, 8:27am

huh, why did the dumpserver indeed try to ptrace but fail?

is this running in a container, odd selinux context or any other point where it can’t really ptrace?

SloPro · August 5, 2020, 8:34am

No, it’s running on native Ubuntu 18.04. I don’t remember touching SELinux so… whatever is default?

nta · August 5, 2020, 9:13am

apparmor? would be odd still

SloPro · August 5, 2020, 11:58am

Nope, I don’t have any app armor profiles setup at all for FXServer in aa-status. In fact I haven’t touched AppArmor at all