Chat server gets frozen sometimes #37
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: apps/BareRTC#37
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Sometimes (after about a week of uptime), the chat server becomes frozen and unresponsive and needs to be rebooted.
Symptoms
When the server gets stuck, the following symptoms seem to manifest:
For users currently on chat
For users trying to join the chat
(Including users who try and reload the page to get back in)
For users on the main website
The main website shows the number and list of usernames currently logged in to chat. Normally, we have between 20 and 50 users on chat at any given time.
During an especially long downtime event (~8 hours), the main website was showing an extremely high number of chatters online (150+) complete with 150 unique usernames.
This would seem to indicate that, during the downtime event:
For the chatbot
The chatbot has a deadlock detector (described below) where it tests for the above symptoms (not getting its own messages echoed back). However: the chatbot does see its own messages while the server is frozen!
There is an operator command to force a deadlock (based on mutexes) and it puts the server into the state where the above symptoms manifest, and the chatbot does detect that and reboot the server. But when the server locks up organically (after about a week of uptime): the symptoms appear but the chatbot does not detect it because its connection to the chat server still works OK. 🤔
Additionally: if a user online sends a DM to the chatbot while the server is frozen, the user does not see their echoed message but the chatbot does see it!
The HTTP server otherwise works OK
While the server is frozen, the rest of the HTTP server responds OK:
Hypothesis
From the behaviors seen by real users on the chat, I would suspect that the freeze is happening in the chat server / WebSockets layer.
For example: when sending a message in a chat room:
However: the chatbot has a deadlock detector and it seems to contradict this theory.
Deadlocks?
My first hunch is that a deadlock was blocking the server. For example: there once was a bug where the server got hung up while a user was sharing a fat GIF image, because it locked the subscriber list (WebSockets) for the entire duration that it was sending the fat GIF to everybody before unlocking it (fixed in
84da298c12
)In Go, deadlocks could occur when using mutexes (mutually exclusive locking) or channels. BareRTC doesn't use channels and the places it uses mutexes have been checked many times and seem OK.
Deadlock Detector (Chatbot)
To test the deadlock theory there is an operator command /debug-dangerous-force-deadlock which forces a deadlock by locking the subscriber list and not releasing it. This causes all of the symptoms above to manifest for users where they can't log in or send any messages.
On the chatbot: it has a deadlock detector where it will send a DM to itself and verify it gets the echoed response back from the chat server.
When I deliberately deadlock the server with /debug-dangerous-force-deadlock, the chatbot does successfully notice it doesn't get its DMs echoed back and it will reboot the server.
However: when the server locks up unpredictably, the chatbot doesn't notice because it does get its message echoed back to itself twice (one for local echo, the other for delivering the DM to itself).
The resolution to this turned out to be:
Users on slow connections were getting their WebSocket message queues filled up, and the logic to disconnect them wasn't implemented correctly. Eventually, all the server resources were consumed and the chat server locked up for everybody.