Sync thread hitching above 2935

Hello,
after updating from 2935 to 2949 or 2960, our server starts spamming sync thread hitch warnings after ~6 hours of uptime. We had 0 hitches on 2935 after 2 days of uptime.

I assume it’s related to https://github.com/citizenfx/fivem/commit/f9da34d62f4f224facbdcc7359e235d16fdf3494 but I am only speculating…

etw:
https://impulse99.com/files/etw_2020-09-22_15-08-45.7z

Server CPU is an i7-7700, and although fxserver is running in a windows VM, it starts hitching on native linux as well.

Yeah so at 150 players with 6 cores allocated to the VM, 2935 was using 10-15% total CPU usage vs using 22-35% now on 2960 :confused:

The sync thread is also getting slower and slower as time goes on, it was hitching at 100-110ms when I posted this thread two hours ago, now it’s up to 250ms hitches. Here are some very rough hitch readouts I wrote down manually

Players RAM (GB) approx hitches (ms)
100 1.04 105
145 1.38 190
151 1.49 210
166 1.49 225
164 1.54 240
152 1.56 240
160 1.56 250

etw from 240ms hitches:
https://impulse99.com/files/etw_2020-09-22_17-46-50.7z

Seems to rather be something regarding https://github.com/citizenfx/fivem/commit/851acff3fdd50b4238a53e44e7406564a59fd664 though I’m not sure what conditions would lead to this… extreme packet loss? High amounts of entities within scope? Hard to analyze without it happening ‘live’.

It seems to have ceased towards the evening (there were still 160 players online at the time)

high amounts of entities

We have all empty cars + traffic being cleared every 5 minutes so this shouldn’t be the case

extreme packet loss

I don’t think it’s packet loss as I don’t see any logged issues scrolling back with Netdata, so our end seems to be fine.

On further investigation there was a guy who joined at 13:54, only loaded in 2.5 hours later at 16:28, then was detected as being AFK again after 10 minutes of not moving at 16:40.

The hitching started exceeding 100ms and being printed at 15:10, while it stopped at 19:22:21… and what happened exactly 60 seconds later? The same guy got “Timed out after 60 seconds”.
Maybe the guy was alt tabbed and it was messing up the packets?

So to sum it up, timeline of events:

Time Event
13:54 guy starts connecting
15:10 sync thread starts exceeding 100ms threshold and being printed as hitches
16:28 guy finishes loading in and spawns
16:40 guy is detected by our script as being AFK due to not moving for 10 mins
19:22:21 hitches stop
19.23:21 guy times out after 60 seconds

If or when this happens again, is there any more data I could provide you?

Right, so perhaps this player had packet lo-

oh.

Yeah that might be a bug in the client actually and I guess not moving or controlling any entity which moves leads to the server getting infinite delta path processing… at least if logs early on made any sense.

As a workaround you can start setting some wander task on ‘AFK’ players or something that’ll make them update a sync netobj.

Odd how someone can be getting delta state sent to them before whatever logic you use spawns them though, might need a server-side cap as well for delta states as jeez that’s bad. Would’ve led to infinite memory growth even before but perhaps worse now (aka CPU hangs too) since every delta is evaluated.

whatever logic you use spawns them

we just use spawnmanager to autospawn as soon as they finish loading

server-side cap

by this you mean pcap right?

infinite memory growth

we do have quite bad stints of memory quite rapidly going up occasionally :sweat_smile:


Maybe this could be the reason behind them?

perhaps worse now (aka CPU hangs too) since every delta is evaluated.

Well I did pm you about some sync thread hang crashes we were having a week or so ago but they went away on their own recently, at its worst we had like 10 sync thread hang crashes in 18 hours

mm, i expected longer waits but might be someone with massive streaming contention

nah, i mean logic on the server to limit delta states stored, heh, if some players can just not ack any for hours upon hours it seems

unbounded growth could be such, especially sudden drops might be correlated to such ‘hung’ users getting dropped and their client data freed

mm, i expected longer waits but might be someone with massive streaming contention

possibly, their download time during the playerConnecting->playerJoining phase was 16s for ~50 MB of rpf’s so their internet doesn’t seem too bad (can some sort of on-the-fly gzip be added to the rpfs? vehicle resource has 25 MB worth of metas alone and can be reduced to <2 MB with gzip -1). They might have spawned in a congested area with a few people in add-on cars nearby, but that shouldn’t be more than 100-200 MB or so at the very worst though, definitely not 2 hours worth of streaming

especially sudden drops

the very large drops to <1GB is just the server being restarted as this graph’s range is over the past 7 days

weird dualcores while running stuff in the background might always be worse than connectivity problems

ah. unbounded growth of this type should not be continuous at least so there’s likely something else growing still sometimes

as to the compression suggestion, sounds like a great way to get cpu load issues and caching compressed content might also lead to extra oddities, but will consider again (fiPackfile e.g. used to have support for inline compression but that might ruin seekability, so needs some evaluation)

weird dualcores

their CPU had 4 threads so yeah it could be a shitty 2c/4t CPU, the game itself was running in borderlesss windowed at 1080p

cpu load issues

could be compressed on server start-up and then uncompressed on each client when it’s downloaded on initial connect, that way it would only load the CPU on initial join and not during gameplay, wouldn’t it?

edit: oh yeah it also fetches rpfs when you restart a resource mid game :confused:

*server, it actually doesn’t send any frames that are empty so the client doesn’t get a chance to ack

good, so it’s just one side needing changing

Not really able to test stuff, but:

Cool, we’ll switch to it as soon as the build is done and report back if any issues arise

Given that you do seem to be running it currently at least it’s not worse than past versions, I’m hoping?

Seems to work fine so far, we’ve had 4 people dropped due to not acknowledging 1201/1501 sync packets in the past 2 hours.
The much higher CPU usage on 2935+ I mentioned in a previous post calmed way down after the hitching stopped yesterday, it was then only 2-4% higher than 2935. I’ll let you know with more certainty in the coming 8 hours or so as our player count picks up

Curious - out of how many unique connections?

5 now, out of 343 total connections from 258 unique licenses

So far 3.5% of unique players have been dropped from it or about 2.2% of all sessions.
So yeah it seems to work fine! CPU usage is fine as well, barely any higher than 2935
Memory usage is staying lower than usual as well :slight_smile:

The only odd thing I’ve seen were 4 network thread hitches as one guy was being dropped for not acknowledging, but it only happened once so I’m not worried about it, could just be a coincidence

Sep 23 19:42:46: network thread hitch warning: timer interval of 153 milliseconds
Sep 23 19:42:46: network thread hitch warning: timer interval of 185 milliseconds
Sep 23 19:42:47: network thread hitch warning: timer interval of 185 milliseconds
Sep 23 19:42:47: Player xxxxx dropped (Reason: Didn't acknowledge 1201 sync packets.)
Sep 23 19:42:47: network thread hitch warning: timer interval of 185 milliseconds

yesterday i saw same thing on mine!
Reason: Didn’t acknowledge 1201 sync packets.
Reason: Didn’t acknowledge 1309 sync packets
Didn’t acknowledge 1444 sync packets

network thread hitch warning: timer interval of 223 milliseconds
network thread hitch warning: timer interval of 228 milliseconds
network thread hitch warning: timer interval of 214 milliseconds
network thread hitch warning: timer interval of 276 milliseconds
network thread hitch warning: timer interval of 317 milliseconds
network thread hitch warning: timer interval of 232 milliseconds
network thread hitch warning: timer interval of 172 milliseconds
network thread hitch warning: timer interval of 170 milliseconds
network thread hitch warning: timer interval of 266 milliseconds
network thread hitch warning: timer interval of 269 milliseconds
network thread hitch warning: timer interval of 254 milliseconds
network thread hitch warning: timer interval of 257 milliseconds
network thread hitch warning: timer interval of 316 milliseconds
network thread hitch warning: timer interval of 253 milliseconds
network thread hitch warning: timer interval of 180 milliseconds
sync thread hitch warning: timer interval of 120 milliseconds
sync thread hitch warning: timer interval of 129 milliseconds

Every 10 minutes I get someone kicked for Didn’t acknowledge X sync packets