Server network thread hitches and rapid memory rise

Our server running fxserver artifact 2155 win32 has been having weird incidents involving its memory usage and network hitches. They’ve occurred last Monday (March 9) and again today (March 17).

Each incident is seemingly divided into two phases:

  1. The memory usage begins to rise very quickly - around 20-30 MB per second - this continues for an undetermined amount of time (at least several minutes), consequently rising to many GB in size (max I’ve seen before stopping it manually is 22).
  2. Once it stops rising, it begins to spam network thread hitches, each 150-1500ms long, one after another, while the memory slowly goes down at about the same rate as before. At the same time there are very small amount of server thread hitches sprinkled throughout, but nothing major.
    The memory usage never drops back to the starting amount, so if these incidents happen to repeat after a while (as we’ve had happen anywhere from 20-30 min, sometimes hours later), the server will end up using more and more RAM. The most I’ve seen it climb to was 20 GB before I manually restarted it. While the network hitches are happening the server obviously becomes very laggy for the players and a lot of them end up timing out.
    The network hitches eventually cease once the memory stops dropping and then the server works okay, but there are still occasional server thread hitches, until the whole incident repeats again.
    Our other server (usually empty as it’s a test server), running on the same machine, is unaffected during these incidents.

We have captured several pcaps using Wireshark and ETW traces with heap profiling, however I’m not sure if posting them in this thread publicly is the best idea. I am happy to provide them in DMs.

As I wrote above, this first occurred last Monday and kept repeating throughout the day, until we restarted the server. We were then able to keep the server up continuously for over 7 days without it repeating or leaking any memory. We suspect it’s some sort of attack however we cannot prove it.

We restarted yesterday morning for a very minor script update and were fine until tonight when the incidents started happening again. The third time the memory started going skywards I just killed it and updated to artifact 2207 as a precaution. It appears to be stable for now…
Edit: still happening on 2207

Any help would be greatly appreciated.

… yeah uh do provide a heap trace?

also, hitches are normal when having crazy heap working sets and are not a separate thing of note.
Here you go, 3 different traces one for each part of the problem. If you need pcaps lemme know separately since those are quite large files

1 Like

ouch, these seem to be cpu traces and not heap traces? :frowning:

uhh is there any extra shit I need to do to get heap traces except for this?

there’s the ‘tracing to file/…’ select box which has a special option for heap traces

however this issue (checking graph correlation) reminds slightly of which was fixed in 2116… maybe there’s something else going on there

Not sure if this is relevant but I noticed one time when the hitches stopped this printed

[22:26:09] network thread hitch warning: timer interval of 153 milliseconds
[22:26:10] network thread hitch warning: timer interval of 418 milliseconds
[22:26:11] network thread hitch warning: timer interval of 463 milliseconds
[22:26:11] network thread hitch warning: timer interval of 226 milliseconds
[22:26:11] network thread hitch warning: timer interval of 191 milliseconds
[22:26:11] network thread hitch warning: timer interval of 308 milliseconds
[22:26:12] network thread hitch warning: timer interval of 154 milliseconds
[22:26:12] INFO: User [377] xxxx authenticated - [1] [377] xxxx@
[22:26:13] INFO: User [393] xxxx authenticated - [2] [393] xxxx@
[22:26:13] INFO: User [388] xxxx authenticated - [3] [388] xxxx@
[22:26:13] INFO: User [326] xxxx authenticated - [5] [326] xxxx@
[22:26:13] INFO: User [382] xxxx authenticated - [7] [382] xxxx@
[22:26:13] INFO: User [391] xxxx authenticated - [6] [391] xxxx@
[22:26:13] INFO: User [395] xxxx authenticated - [8] [395] xxxx@
[22:26:13] INFO: User [383] xxxx authenticated - [9] [383] xxxx@

not directly, but a likely side effect of a starved tcp thread

just noticed the other traces were even more odd, the pcaps might help here (send the link in a forum DM though!) as the pattern in the ‘memory dropping’ trace has been seen before.

just received pcaps… sadly they don’t help much at all at first glance as most affected traffic was HTTP/2 over TLS :frowning:

got some potential leads anyway however!

right - there’s actually a server bug in asset downloading, not sure if it is directly related to the issue seen here but this is a pretty critical issue and can be potentially abused on purpose (but I don’t see any reason to assume it has been)

there’ll hopefully be a server push later today!


Keep up the diligent work!

OK, the updated server build shipped with pipeline ID 2249.

Here are some heap traces that I just captured, still on 2207. The memory rises to like 6-7 GB and then goes down to 1.5, however much more slowly than last time, so there weren’t any network thread hitches.
(sorry in advance for how large the rising trace is)

Got some more (& smaller!) heap traces where the memory was rapidly going up. Memory usage went from 17 to 20 GB in a few minutes.

@SloPro have you updated your server artifacts since the update was pushed? If so, do you still have the issues?

No, I stayed on 2155 so I could provide the heap traces that @nta requested. I will be updating later today though

How do you make heap traces? Trying to debug a memory leak too

  • Stop the server
  • Open UIforETW
  • in settings, set Heap-profiled processes to “FXServer.exe” & press OK
  • in the main window, enable Fast Sampling and change “Circular buffer tracing” to Heap tracing to file
  • start the server
  • when you are ready, press start tracing to begin and then save trace buffers after like a minute
1 Like

Then where can i check which resources are using most memory?

1 Like

@nta I can confirm this still happens, at least on 2283. Sadly I don’t have a heap trace right now as UIforETW crashed mid-save and I would need to restart the server for the heap tracing to start working again :expressionless:

Also not sure how helpful this is, but when the memory is going up the CPU usage rises to like double the usual %, and usually webadmin ticks are the slowest in the CPU trace (example CPU trace from just now, and a pcap).