[FXServer] Server desync, instancing, hitching

We finished up porting our server code from CFX to FXServer (artifact 300), running on Linux, after the “threadsafe” fix was implemented as per [FXServer] Random error(s)

Testing with ~10 people doing random tasks on the server worked well, so we pushed to production today.
It did not go well.

Trying to run a production server with 32 slots failed with laggy commands, in-game world state, and instancing.

Our dedicated (non-VPS) hardware is practically asleep, and players were reporting extremely high lag. Many players were timing out, the server never quite reached capacity (32 of 32 players), and many players were getting randomly instanced. I was on the server, but but did not experience too much interruption and did not get instanced myself.

The MySQL database server has extremely low load as well.

Running CFX server has none of these problems on the same hardware.

Our players come from all over the world, and some have very poor, low-bandwidth connections.

One item in the logs caught my attention: the number of vouches. This is really suspect to me.

Extracted from the logs:

Received a vouch for Prion, they have 2 vouches and need 87810.
Received a vouch for Bella, they have 2 vouches and need 87812.
Received a vouch for Ricky, they have 2 vouches and need 26.
Received a vouch for SirSleepyEyes, they have 2 vouches and need 87812.
Received a vouch for Mindless, they have 2 vouches and need 87809.
Received a vouch for KTGgaming6666, they have 2 vouches and need 87809.
Received a vouch for UglierGoblino, they have 2 vouches and need 87808.
Received a vouch for Rekan Martinez, they have 2 vouches and need 27.
Received a vouch for Anando, they have 2 vouches and need 87807.
Received a vouch for Bubis, they have 2 vouches and need 87807.
Received a vouch for M_tt, they have 2 vouches and need 87806.
Received a vouch for Mindless, they have 2 vouches and need 43.
Received a vouch for gnosis., they have 2 vouches and need 87808.
Received a vouch for FarmTheGamer, they have 2 vouches and need 22.
Received a vouch for UglierGoblino, they have 2 vouches and need 87810.
Received a vouch for Fortimbraz, they have 2 vouches and need 20.
Received a vouch for AndyyCampbell, they have 2 vouches and need 87808.
Received a vouch for Addylad™, they have 2 vouches and need 87810.
Received a vouch for redwolfmoon99, they have 2 vouches and need 87809.
Received a vouch for FarmTheGamer, they have 2 vouches and need 87809.
Received a vouch for Ricky, they have 2 vouches and need 87809.
Received a vouch for Ricky, they have 2 vouches and need 87809.
Received a vouch for redwolfmoon99, they have 2 vouches and need 87810.
Received a vouch for Dishes, they have 2 vouches and need 87810.
Received a vouch for Zephahniah, they have 2 vouches and need 87809.
Received a vouch for Dishes, they have 2 vouches and need 87809.
Received a vouch for Prion, they have 2 vouches and need 87808.
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Got a late vouch for acidshadow - they're the current arbitrator!
Received a vouch for redwolfmoon99, they have 2 vouches and need 87808.
Received a vouch for FarmTheGamer, they have 2 vouches and need 87808.
Received a vouch for redwolfmoon99, they have 2 vouches and need 87806.
Received a vouch for redwolfmoon99, they have 2 vouches and need 87806.

87806 seems like an extreme number of vouches required. Ricky (near the top) only needed 26, which seems reasonable and that was approximately the maximum number of players the server got up to.

Everyone in the list above got instanced at some point during the test.

What also bothers me is the extremely high values of “hitch” times, both in frequency and number of milliseconds.

We are using fivem-mysql-async, with 27 Sync and 128 Async statements (82% async).

There are a few minor tracebacks in the server log, but I don’t think they are absolutely critical. We had some minor functionality degredation (some custom chat-based commands wouldn’t work, some would after trying again, and some would simply lag in processing).

On the client, we had many and frequent instances of crashes when calling GetSoundId().

Starting the server took several tries in the beginning, which printed huge tracebacks, which I think happened in native code, but the server stayed running on the 4th try. I can try to reproduce that, but the log for those rounds were lost. No netgraphs are available. I would hate to push the server live to get the native crashes and netgraphs, but we might have to, depending on feedback received here.

Full (anonymized) server log here:

Issues: lag, desync, instancing
Additional text for forum searches: instance, instanced, timeout, time out, disconnect, disconnection, disconnected

3 Likes

I also did not observe “out of order” or “dropped packets” in the client console.

That’s probably the main concern - most probably these end up being caused by scripts but no more detailed instrumentation is implemented to drill down into what script is causing this.

Given how the hitch times increase over time, starting at ~100msec and continuing up to ~1-2sec, I’m guessing there’s some script (or worse, internal server function) that doesn’t clear some kind of linear-scan list?

That’s part of legacy networking code and isn’t used anymore.

2 Likes

We’re investigating on our end as well given this information.

Any idea why this would be different between CFX Server and FXServer? Perhaps CFX simply masked the problems a little better?

@promethium When you say some scripts that doesn’t clear some kind of linear-scan list could you clarify the Linear-scan list

O(n) complexity.

For each element added to the list, it takes an additional amount of time to go through the list, and each element adds the same amount of time, so the effect is cumulative.

If it takes 1ms for each item in a list to be processed, a list of 2 items will take 2ms. A list of 3 items will take 3 ms, and so on.

Perhaps mysql-async’s rewrite, perhaps CitizenTaskScheduler, something at least changed. :\

1 Like

I found a MySQL.Sync call in a growing list. I think that’s our problem.

I am assuming that it wasn’t a problem for us before because we were able to use MySQL pooling with the old driver. Lag and hitching is a problem now because that growing number of queries, in synchronous mode, with (TCP + MySQL authentication + query time) was adding significant blocking delays.

fivem-mysql-async does not support pooling as of the time of this writing (even though the underlying MySqlConnector appears to), but I did create a hack to support pools which I need to re-implement and attempt to contribute to that project.

I think that between hacking in pooling and switching this particular loop to be a MySql.Async call we’ll be fine… in addition to a rewrite of our own code after we’re all on FXServer in production.

I wasn’t able to recover my pool hack for fivem-mysql-async, but I did use a combination of Citizen.CreateThread on the server, SetTimeout, and MySQL.Async queries to prevent scheduling too much on a single tick. In tests, this allows us to continue to use our terrible list without any hitching. We’ll try another live test using this optimization and report back.

Hitching is gone with the optimizations mentioned above, but we’re still getting huge numbers of instances, disconnects, and timeouts of players. We reverted again.

Vouches look great now for artifact 301:

Received a vouch for KatieRouu, they have 2 vouches and need 14.
Received a vouch for Rekan Martinez, they have 2 vouches and need 15.
Received a vouch for Phalanx™, they have 2 vouches and need 15.
Received a vouch for Mindless, they have 2 vouches and need 16.
Received a vouch for KennyH, they have 2 vouches and need 17.
Received a vouch for Marth, they have 2 vouches and need 18.
Received a vouch for Atomic_Wolf_, they have 2 vouches and need 17.
Received a vouch for Da True MVP, they have 2 vouches and need 17.
Received a vouch for Mindless, they have 2 vouches and need 20.
Received a vouch for Talbot, they have 2 vouches and need 19.
Received a vouch for Marth, they have 2 vouches and need 19.
Received a vouch for iamboogsy, they have 2 vouches and need 20.
Received a vouch for Krom_, they have 2 vouches and need 20.
Received a vouch for slacker_1980, they have 2 vouches and need 20.
Received a vouch for Atomic_Wolf_, they have 2 vouches and need 20.
Received a vouch for Mindless, they have 2 vouches and need 20.
Received a vouch for Talbot, they have 2 vouches and need 18.

We’re leaking MySQL connections with fivem-mysql-async beta2. I think it has to do with the MySqlConnector.dll update to 0.23.0 and will try to revert the blob to beta1’s version which we were running yesterday.

When we can confirm that we’re not leaking connections and are still getting high rates of instancing I’ll report back once again.

We need to let the game run so players can have time to play. Our next scheduled restart is in a few hours.

I am experiencing the same exact issue with the new FX server!

In fact, here is the post I literally just made before reading this one:

But the thing is,

I am using Couch DB, no SQL at all.

@promethium @aurum does that information help at all?

Older server builds will still hang on PerformHttpRequest due to an oversight. In addition, this is still a known issue (and is being discussed in server testing channels on Discord), however literally no common factor to the clients that are getting disconnected has been identified so far, even with the use of helper scripts to log extra information.

Weird thing is that there have been pretty much no design changes that could cause this to occur between CitizenMP.Server and FXServer, so this either fails due to a design oversight, or some other fluke is causing certain players to stop receiving packets at times.

A pcap dump of server port traffic with correlated real-world times to a log of player connections/disconnections could be a potential next step to figuring out what the hell happens that could cause these players to end up timing out.

I’ll tag a few people on the Discord too so they can see this idea.

2 Likes

I’ll discuss this with my team. As you can imagine, capturing IPs and passwords and whatnot could be a problem to share, especially trying to remove just those packets from a pcap, but I’ll see what I can get authorized since we’re extremely anxious to get off of CFX.

Is the Discord / dev area joinable? I’d like to pop in if it’s public, though I assume it’s private.

Edit: I joined the Discord as CFX.

Hmm, okay. Well let me know if I can help in some way!

Somewhat related cross-post:

3 Likes

Excuse me , it’s maybe a stupid question but what is the vouch?

Thank’s

Vouching is a way of keeping track of which client is the authoritative state of the game (i.e., the host). Clients vote (or “vouch”) for who should be the leader/host. If voting is messed up, clients ultimately get confused.

Instancing has been much improved in Artifact 306, but doesn’t quite solve the problem, particularly in the “higher numbered” ID slots; the last 2-3 slots before the server becomes full seem to have the most problems, but some people crash/instance constantly no matter what their ID.

1 Like

@justcfx2u

Thank you very much…

I don’t know if i can help the comunoty , but i tried a system on my server and it work!

I’m runing with VRP framework … on cfx server , i had a ton of time out , on FX no time out but ton on instance.

On Vrp , i deleted the update position ( \resources\vrp\modules\player_state.lua)

function tvRP.updatePos(x,y,z)
  local user_id = vRP.getUserId(source)
  if user_id ~= nil then
    local data = vRP.getUserDataTable(user_id)
    if data ~= nil then
      data.position = {x = tonumber(x), y = tonumber(y), z = tonumber(z)}
    end
  end
end

And finish the time out on cfx and a ton less instance on FX !

I’m sorry , english it’s not my first language , but in french i can explain to you how to solve the instance problem.

For anyone know arma 3 server and sql system for arma 3 , try to make the same like extdb2 sql with you server and you’ll undertsand :wink:

1 Like

I have already done it and I think no crash so far: P, but the update of the location is essential???

1 Like