Mass client drops / Network events breaking

kn0pee · May 21, 2021, 3:04pm

We have been having issues recently with the our FiveM server dropping all connected clients. It is happening every 3-6 hours during peak time, and causes all the players to suddenly drop from the server.

Description of Issue
Just before everyone disconnects, the players can still see each other moving around, and can interact with each other, however net events stop working at all. In the server console, there is a large “network thread hitch warning” spam, usually with each hitch being < 500ms.

After everyone drops, the server is unrecoverable. People can still join, however never make it off the load screen. They drop shortly after the finish loading (according to F8 console), and a server restart is needed to allow people to connect again.

During this time, I set the server capacity to 5 slots. I was able to connect (for no more than a few minutes). It would seem that broadcast events worked flawlessly (Client events which target -1) however events directed to a specific player only worked sporadically.

Once the player is unable to receive directed net events, they disconnect shortly after.

Server Details
Currently running on v3961. At one point I had a modified scheduler.lua to log network events sent by the server. We used this to try and find problematic patterns however have not had much luck yet. Crashes occur without the modified file as well.

ETW Trace
Here is a few ETW traces when the server is in the broken state, 1 is trace to file, and the other is circular trace buffers (not sure which is useful)

etwtraces.zip (76.4 MB)

Client Error Message
The message the clients get is show below (Just a standard timeout message).

One thing of note is the “syncTrackedVehicles” which is a net event which fires fairly frequently, about every 2.5 seconds. However, this used to be other net events, which we removed to test if it would improve. It seems to just appear in the list because they are frequent, but potentially not the cause.

I’ll keep updating this post with more information as I discover it, even some of the info here might not be super accurate because it is quite challenging to decipher things.

nta · May 22, 2021, 5:39am

That’s curious, as both are implemented the exact same way:

github.com

citizenfx/fivem/blob/517c63ee1e25f9bce9b09a4563d030fdf504f3b9/code/components/citizen-server-impl/src/ServerResources.cpp#L915-L933


	// do we have a specific client to send to?
	if (targetSrc)
	{
		int targetNetId = atoi(targetSrc->data());
		auto client = clientRegistry->GetClientByNetID(targetNetId);


		if (client)
		{
			// TODO(fxserver): >MTU size?
			client->SendPacket(0, outBuffer, (!replayed) ? NetPacketType_Reliable : NetPacketType_ReliableReplayed);
		}
	}
	else
	{
		clientRegistry->ForAllClients([&](const fx::ClientSharedPtr& client)
		{
			client->SendPacket(0, outBuffer, (!replayed) ? NetPacketType_Reliable : NetPacketType_ReliableReplayed);
		});
	}

Right, does this happen one version below as well? Merged a PR for a svgui player list at that point and I don’t really trust the mutex usage there.

By that measure, a dump may be more helpful than a trace while broken, since it looks as if something is hung despite the watchdog not caring.

nta · May 22, 2021, 6:01am

… though from the trace it appears like some latent event stuff is hanging the network thread here, which implies the bug fix from earlier, while fixing the original repro, may be leading to long loops in some unknown cases.

kn0pee · May 22, 2021, 7:38am

Downdated to v3874, which another server who isn’t having issues uses. Will see if that helps…

I will get a dump next time it happens, right now I have downdated and deleted all the existing crash dumps like a spud. Though, it typically doesn’t end up crashing, just dropping everyone. Only very rarely does it generate a dump, back when we had significantly more net events. We have removed / disabled a lot to see if it would mitigate, which it did not.

If crashes persist, I can try and remove all latent events and see if it improves. Anything else I can provide here to help because I want to use the latent ones haha.

nta · May 22, 2021, 7:39am

“A dump” in the case of hangs would usually mean one made from Task Manager.

kn0pee · May 23, 2021, 8:22am

Server crashed on 3874, dump is uploading, it is about 2gb compressed, 10gb uncompressed.

(If there is issues with the download link, I can reupload to wherever is better)

I have removed latent events, of which we only had 2. Both of these targeted specific players, and did not target -1. This has completely resolved the issues which confirms what you said earlier.

Let me know if you need anything further, or if there is anything I can do to provide more information about the latent events. I did see this related post which could be linked.

nta · May 23, 2021, 8:28am

Does that mean latent events were often using ‘a lot’ of CPU time even before the attempted fix?

What is the event size and bps rate you’ve been sending with?

nta · May 23, 2021, 1:21pm

The dump does seem to show that disconnecting clients aren’t correctly getting dropped from the send list. Curious.

One likely cause could be that you’re trying to trigger a latent event on a client that’s already gone, which seems to not be validated for and then that will linger in the send list indefinitely.

Misstake · May 23, 2021, 4:29pm

This issue sounds a lot like an issue I get when using a proxy server in some cases.

People connect properly but get dropped with a last seen of around 100 ms, so they are still connected but get dropped anyway. They still see everybody moving but people get dropped anyway.

Don’t really have any more info than this, I was looking in the direction of buffer overflows on the network, but no clue if that would in any case be related to this. Just throwing my own experiences in here.

nta · May 23, 2021, 4:42pm

Newer server versions have a few more diagnostics for this specific kind of timeout.

kn0pee · May 24, 2021, 3:26am

I didn’t notice anything specifically, but also wasn’t looking for it. Right now I have left the latent events removed, however I will try and do a controlled test within the next week.

One of them would loop through a list of players in a table

Event size of no more than 500B
Probably about 1-3 seconds between events being triggered, but depends on usage of the people using the script (it is for radios, when people toggle them on / off)
BPS of 25k

The other is a 2.5 second loop

Event size of up to 2500-3000B
BPS of 25k
Triggers on all players using -1

This is definitely possible, whilst the script does handle removing players from the list of people that an event would trigger on, it is very possible that the event is sent before this occurs. This will be very difficult for me to resolve, since any delay would cause this to break.

Edit: I have just noticed a few potential fixes in the commit history, will try this within the next week.

kn0pee · June 2, 2021, 8:22am

Have tried the latest artifacts, and the issue appears to be resolved. No more crashes even when using the original set of latent events!

Misstake · August 24, 2021, 1:20am

Yes, after a lot of research I figured out this was caused by network events flooding some servers triggering some weird security somewhere in the hosting company, which caused the same packet constantly being dropped so the server gives up after a while since it can’t send the packet to the client.

So not really a FiveM related issue after all.

Basically using Latent events for events which send a lot of data fixed it, and speaking of which, latent events appear to work perfectly as of the fix a while ago.