Chat server gets frozen sometimes #37

Open
opened 2023-09-27 20:54:00 +00:00 by kirsle · 0 comments

Sometimes (after about a week of uptime), the chat server becomes frozen and unresponsive and needs to be rebooted.

Symptoms

When the server gets stuck, the following symptoms seem to manifest:

For users currently on chat

  • If they send a message to a chat room or DM, nothing appears to happen (they don't see their own message echo back in the channel)
  • They stop seeing any further updates or changes to the Who's Online list.
  • They no longer see any new messages or presence updates in the chat rooms.
  • If they are connected to webcams: those continue to stay connected OK because the cam connections are peer-to-peer separate from the chat server.

For users trying to join the chat

(Including users who try and reload the page to get back in)

  • They see ChatClient messages that go up until the "WebSocket connected!" notice.
  • They do NOT get any ChatServer welcome messages, or presence messages (including their own).
  • They do NOT see the Who's Online list at all (the list is empty, with not even their own username listed)
  • Trying to send a message in chat: nothing happens and they don't see their message echoed back in the chat room either.

For users on the main website

The main website shows the number and list of usernames currently logged in to chat. Normally, we have between 20 and 50 users on chat at any given time.

During an especially long downtime event (~8 hours), the main website was showing an extremely high number of chatters online (150+) complete with 150 unique usernames.

This would seem to indicate that, during the downtime event:

  • For every person who tried to connect to chat, the server marked them as online and logged their username.
  • The server was not removing them from the online list after they left/closed the tab/disconnected from the WebSocket.

For the chatbot

The chatbot has a deadlock detector (described below) where it tests for the above symptoms (not getting its own messages echoed back). However: the chatbot does see its own messages while the server is frozen!

There is an operator command to force a deadlock (based on mutexes) and it puts the server into the state where the above symptoms manifest, and the chatbot does detect that and reboot the server. But when the server locks up organically (after about a week of uptime): the symptoms appear but the chatbot does not detect it because its connection to the chat server still works OK. 🤔

Additionally: if a user online sends a DM to the chatbot while the server is frozen, the user does not see their echoed message but the chatbot does see it!

The HTTP server otherwise works OK

While the server is frozen, the rest of the HTTP server responds OK:

  • The statistics endpoint responds and returns the count and list of users online.
  • The shutdown endpoint still functions so I can trigger a server reboot over HTTP.
  • The index page still functions (users can load the chat front-end OK)
  • Static file routes still function (javascript/css/images can still load OK).

Hypothesis

From the behaviors seen by real users on the chat, I would suspect that the freeze is happening in the chat server / WebSockets layer.

For example: when sending a message in a chat room:

  1. The client page posts the message to the chat server over WebSocket.
  2. The chat server echoes the message back to the user over WebSocket.
    • That the user doesn't see their echo could indicate either the message wasn't sent to server or the server couldn't send it back.
  3. The chat server sends the message to the other recipients (e.g. over DMs)

However: the chatbot has a deadlock detector and it seems to contradict this theory.

Deadlocks?

My first hunch is that a deadlock was blocking the server. For example: there once was a bug where the server got hung up while a user was sharing a fat GIF image, because it locked the subscriber list (WebSockets) for the entire duration that it was sending the fat GIF to everybody before unlocking it (fixed in 84da298c12)

In Go, deadlocks could occur when using mutexes (mutually exclusive locking) or channels. BareRTC doesn't use channels and the places it uses mutexes have been checked many times and seem OK.

Deadlock Detector (Chatbot)

To test the deadlock theory there is an operator command /debug-dangerous-force-deadlock which forces a deadlock by locking the subscriber list and not releasing it. This causes all of the symptoms above to manifest for users where they can't log in or send any messages.

On the chatbot: it has a deadlock detector where it will send a DM to itself and verify it gets the echoed response back from the chat server.

When I deliberately deadlock the server with /debug-dangerous-force-deadlock, the chatbot does successfully notice it doesn't get its DMs echoed back and it will reboot the server.

However: when the server locks up unpredictably, the chatbot doesn't notice because it does get its message echoed back to itself twice (one for local echo, the other for delivering the DM to itself).

Sometimes (after about a week of uptime), the chat server becomes frozen and unresponsive and needs to be rebooted. ## Symptoms When the server gets stuck, the following symptoms seem to manifest: ### For users currently on chat * If they send a message to a chat room or DM, nothing appears to happen (they don't see their own message echo back in the channel) * They stop seeing any further updates or changes to the Who's Online list. * They no longer see any new messages or presence updates in the chat rooms. * If they are connected to webcams: those continue to stay connected OK because the cam connections are peer-to-peer separate from the chat server. ### For users trying to join the chat (Including users who try and reload the page to get back in) * They see ChatClient messages that go up until the "WebSocket connected!" notice. * They do NOT get any ChatServer welcome messages, or presence messages (including their own). * They do NOT see the Who's Online list at all (the list is empty, with not even their own username listed) * Trying to send a message in chat: nothing happens and they don't see their message echoed back in the chat room either. ### For users on the main website The main website shows the number and list of usernames currently logged in to chat. Normally, we have between 20 and 50 users on chat at any given time. During an especially long downtime event (~8 hours), the main website was showing an extremely high number of chatters online (150+) complete with 150 unique usernames. This would seem to indicate that, during the downtime event: * For every person who _tried_ to connect to chat, the server marked them as online and logged their username. * The server was not removing them from the online list after they left/closed the tab/disconnected from the WebSocket. ### For the chatbot The chatbot has a deadlock detector (described below) where it tests for the above symptoms (not getting its own messages echoed back). However: the chatbot **does** see its own messages while the server is frozen! There is an operator command to force a deadlock (based on mutexes) and it puts the server into the state where the above symptoms manifest, and the chatbot **does** detect that and reboot the server. But when the server locks up organically (after about a week of uptime): the symptoms appear but the chatbot does not detect it because its connection to the chat server still works OK. 🤔 Additionally: if a user online sends a DM to the chatbot while the server is frozen, the user does not see their echoed message _but_ the chatbot does see it! ### The HTTP server otherwise works OK While the server is frozen, the rest of the HTTP server responds OK: * The statistics endpoint responds and returns the count and list of users online. * The shutdown endpoint still functions so I can trigger a server reboot over HTTP. * The index page still functions (users can load the chat front-end OK) * Static file routes still function (javascript/css/images can still load OK). ## Hypothesis From the behaviors seen by real users on the chat, I would suspect that the freeze is happening in the chat server / WebSockets layer. For example: when sending a message in a chat room: 1. The client page posts the message to the chat server over WebSocket. 2. The chat server echoes the message back to the user over WebSocket. * That the user doesn't see their echo could indicate either the message wasn't sent to server or the server couldn't send it back. 3. The chat server sends the message to the other recipients (e.g. over DMs) However: the chatbot has a deadlock detector and it seems to contradict this theory. ### Deadlocks? My first hunch is that a deadlock was blocking the server. For example: there once was a bug where the server got hung up while a user was sharing a fat GIF image, because it locked the subscriber list (WebSockets) for the entire duration that it was sending the fat GIF to everybody before unlocking it (fixed in https://git.kirsle.net/apps/BareRTC/commit/84da298c123d7bd6e527a2781345b957c30c75ea) In Go, deadlocks could occur when using mutexes (mutually exclusive locking) or channels. BareRTC doesn't use channels and the places it uses mutexes have been checked many times and seem OK. ### Deadlock Detector (Chatbot) To test the deadlock theory there is an operator command **/debug-dangerous-force-deadlock** which forces a deadlock by locking the subscriber list and not releasing it. This causes all of the symptoms above to manifest for users where they can't log in or send any messages. On the chatbot: it has a deadlock detector where it will send a DM to itself and verify it gets the echoed response back from the chat server. When I deliberately deadlock the server with **/debug-dangerous-force-deadlock**, the chatbot _does_ successfully notice it doesn't get its DMs echoed back and it will reboot the server. However: when the server locks up unpredictably, the chatbot doesn't notice because it does get its message echoed back to itself twice (one for local echo, the other for delivering the DM to itself).
kirsle added the
bug
label 2023-09-27 20:54:00 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: apps/BareRTC#37
There is no content yet.