Cogmasters / concord

A Discord API wrapper library made in C

Home Page:https://cogmasters.github.io/concord/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segfault due to race conditions in discord_cleanup when called from another thread

InterLinked1 opened this issue · comments

Describe the bug

I'm using libdiscord in a shared library module that I built for a large multithreaded application, e.g. Concord is only a very small part of the entire program, and there are lots of other threads. Everything is pretty much working, but I'm having segmentation faults on shut down, and it appears it may be because I'm shutting down the client in one thread, and perhaps this isn't legal. I looked at some of the code and it appears if have_sigint is set, then discord_run will exit. I'm not using SIGINT because that only makes sense for a standalone program, but perhaps this suggests that there's no way to terminate a session "out of band" (apart from this mechanism)? Is this the right understanding?

Essentially, I have this (some stuff removed for conciseness):

static void *discord_relay(void *varg)
{
	struct discord *client = varg;
	discord_run(client);
	return NULL;
}

static int load_module(void)
{
	ccord_global_init();
    discord_client = discord_init(token);
	if (!discord_client) {
		return -1;
	}

    discord_add_intents(discord_client, DISCORD_GATEWAY_MESSAGE_CONTENT | DISCORD_GATEWAY_GUILD_MESSAGES | DISCORD_GATEWAY_GUILD_MEMBERS);

    discord_set_on_ready(discord_client, &on_ready);
    discord_set_on_message_create(discord_client, &on_message_create);
    discord_set_on_message_update(discord_client, &on_message_update);
	discord_set_on_guild_members_chunk(discord_client, &on_guild_members_chunk);

	if (pthread_create(&discord_thread, NULL, discord_relay, discord_client)) {
		discord_shutdown(discord_client);
		discord_cleanup(discord_client);
		ccord_global_cleanup();
		return -1;
	}
	return 0;
}

static int unload_module(void)
{
	discord_shutdown(discord_client);
	pthread_join(discord_thread, NULL);
	discord_cleanup(discord_client);

    ccord_global_cleanup();
	return 0;
}

I initially had this ordering:

discord_shutdown(discord_client);
discord_cleanup(discord_client);
pthread_join(discord_thread, NULL);

When I did this, I got segmentation faults in the thread with discord_run, suggesting that perhaps it is not legal to call discord_shutdown in another thread. However, I need to be able to do that... how else can I tell the client to exit?

With the ordering shown in the larger snippet, pthread_join never returns, so the unload blocks forever (though no crash).

Is there an example or suggested approaching for doing an out of band shutdown as I'm trying to do here? I'd think this would be a common use case, but I didn't find any examples of it. It seems like there is a race condition here with the way I'm currently doing it that leads to the discord_run thread trying to read invalid memory... and if I don't call discord_shutdown, then the discord_run thread won't exit, so I'm not sure if there's a way around that.

Perhaps something like the following could be added to src/concord-once.c? (Although I'm not positive if this would be the right fix or not.)

static void ccord_shutdown_notify(int signum)
{
    ccord_has_sigint = 1;
}

My thinking is this would allow the discord_thread to gracefully exit on its own, and then I could call pthread_join directly, since the discord_run thread itself appears to call discord_shutdown on its own. So really, maybe what's need is just a way to asynchronously shut down using a public function, generically, not just the internal SIGINT handler which doesn't apply to every use case. (Correct me if I'm wrong, but I don't think this currently exists from what I could find.)

This happens consistently if I shut down my program (which unloads all dynamic modules), but not always (only sometimes) if I only attempt to unload the Discord module:

Stack Trace for full shutdown


==534690== 1 errors in context 16 of 29:
==534690== Invalid read of size 8
==534690==    at 0x4F273D3: ??? (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==534690==    by 0x4EF747F: ??? (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==534690==    by 0x4F0A32F: ??? (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==534690==    by 0x4F286AF: ??? (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==534690==    by 0x4F298D6: ??? (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==534690==    by 0x4F2A225: curl_multi_perform (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==534690==    by 0x4F2A562: ??? (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==534690==    by 0x4F2A679: curl_multi_socket_all (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==534690==    by 0x9253F2C: ws_multi_socket_run (in /usr/local/lib/libdiscord.so)
==534690==    by 0x91CAA39: discord_gateway_perform (in /usr/local/lib/libdiscord.so)
==534690==    by 0x91C9EFF: _discord_on_gateway_perform (in /usr/local/lib/libdiscord.so)
==534690==    by 0x924E86F: io_poller_perform (in /usr/local/lib/libdiscord.so)
==534690==  Address 0xb485e50 is 192 bytes inside a block of size 6,440 free'd
==534690==    at 0x48399AB: free (vg_replace_malloc.c:538)
==534690==    by 0x4F0A89A: ??? (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==534690==    by 0x4F00B78: curl_easy_cleanup (in /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0)
==534690==    by 0x92550AE: _cws_cleanup (in /usr/local/lib/libdiscord.so)
==534690==    by 0x9256F10: cws_free (in /usr/local/lib/libdiscord.so)
==534690==    by 0x9253325: ws_cleanup (in /usr/local/lib/libdiscord.so)
==534690==    by 0x91CA291: discord_gateway_cleanup (in /usr/local/lib/libdiscord.so)
==534690==    by 0x91C56FB: discord_cleanup (in /usr/local/lib/libdiscord.so)
==534690==    by 0x9183699: unload_module (mod_discord.c:409)

Stack Trace for explicit unload

==534944== 2 errors in context 17 of 34:
==534944== Invalid read of size 4
==534944==    at 0x4FE09F9: pthread_mutex_trylock (pthread_mutex_trylock.c:42)
==534944==    by 0xFB6EEBB: discord_requestor_dispatch_responses (in /usr/local/lib/libdiscord.so)
==534944==    by 0xFB761AB: discord_run (in /usr/local/lib/libdiscord.so)
==534944==    by 0x9183207: discord_relay (mod_discord.c:322)
==534944==    by 0x134D81: thread_run (thread.c:252)
==534944==    by 0x4FDDEA6: start_thread (pthread_create.c:477)
==534944==    by 0x50F4A2E: clone (clone.S:95)
==534944==  Address 0xfac8be0 is 96 bytes inside a block of size 120 free'd
==534944==    at 0x48399AB: free (vg_replace_malloc.c:538)
==534944==    by 0xFB6E213: discord_requestor_cleanup (in /usr/local/lib/libdiscord.so)
==534944==    by 0xFB6D93F: discord_rest_cleanup (in /usr/local/lib/libdiscord.so)
==534944==    by 0xFB736E9: discord_cleanup (in /usr/local/lib/libdiscord.so)
==534944==    by 0x9183699: unload_module (mod_discord.c:409)

Your suggested solution makes sense, ccord_has_sigint is checked from the main thread and when true it will trigger a discord_shutdown(). This should ensure proper shutdown in a thread-safe manner. I hope to look into a good solution for this in the following days.

I initially had this ordering:

Yeah, this ordering will lead to issues as we need to ensure discord_run() has been returned from, before doing a discord_cleanup()

The workflow is:

  1. discord_shutdown() will notify concord that it should cease the connection with Discord, this is done asynchronously, meaning that even if discord_shutdown() returns we'll still be in the process of shutting down
  2. concord waits until we have confirmation from Discord that the connection has been closed
  3. discord_run() returns
  4. we may now call discord_cleanup()

So discord_shutdown() is asynchronous, all it does is notify concord that it SHOULD shut down sometime in the near future... We'll only know for sure the bot has shutdown once discord_run() returns

Thanks for confirming!

A couple other clarifications from trying to figure these out the past couple days, and I can certainly open a new issue if this is a legitimate issue, not just user error, but I wonder if these are other issues that may exist:

2 things I'm struggling to do:

  • obtain user presences on startup
  • determine what users are in each channel in a guild

Luckily, both of these only need to be done once.

I can get presence updates just fine, but not actually the initial presences when the module starts, so I only accumulate presences over time, but don't have the initial presences. It seems this should be possible, but all the APIs either return NULL or don't have this implemented. Am I doing this wrong?

discord_request_guild_members doesn't seem to work as the returned members' presences are all NULL, even though the event includes it:

struct discord_presence_updates *presences = event->presences;
for (i = 0; i < members->size; i++) {
		struct discord_guild_member *member = &members->array[i];
		struct discord_user *user = member->user;
		struct discord_presence_update *presence = presences ? &presences->array[i] : NULL; /* Initial (current) presence status */
	}
}

I did notice some other events seem to return NULL for certain fields that other functions do include, so I thought maybe this was for efficiency reasons (don't want to fetch more than we need to).

This suggests guild member would be the right API to use, since it has presence: https://discord.js.org/#/docs/main/stable/class/GuildMember

But, according to the documentation for this, there is no presence field, even though it would seem this would be the logical place for it: https://cogmasters.github.io/concord/structdiscord__guild__member.html

There is also this one:

https://discord.com/developers/docs/resources/guild#guild-widget-object
https://cogmasters.github.io/concord/structdiscord__guild__widget.html

But, looking at what the response contains, I can't find any fields that actually contain any presence data. Certainly, struct discord_user * does not.

I've scoured both the code and the documentation, but have come up short on other ways to get a guild member's presence on demand, which seems necessary if there's no way to get it from discord_request_guild_members (which would be most ideal, if possible, to avoid making a separate request for every user).

  1. I'm running into a similar issue with discord_get_channel:
struct discord_ret_channel ret2 = { .sync = &dchannel };
code = discord_get_channel(client, channel->id, &ret2);
assert(dchannel.member);

The assertion fails, because the member field is always NULL.
I admit I don't understand why certain APIs are always returning fields that are always NULL, but maybe I'm not understanding some usage aspect of this.

Yes, the relevant intents should be set:
discord_add_intents(discord_client, DISCORD_GATEWAY_MESSAGE_CONTENT | DISCORD_GATEWAY_GUILD_MESSAGES | DISCORD_GATEWAY_GUILD_PRESENCES | DISCORD_GATEWAY_GUILDS | DISCORD_GATEWAY_GUILD_MEMBERS | DISCORD_GATEWAY_DIRECT_MESSAGES | DISCORD_GATEWAY_PRESENCE_UPDATE);

And all of those are toggled on the bot side as well, so I don't think it's permissions.

Any thoughts on this? Are these even the right APIs to use?

@InterLinked1 I can take a look at those separate issues later on, in the meantime can you confirm that the JSON payloads sent by Discord do in fact contain the field you're looking for? Please ensure that:

  1. Concord's logging is enabled, for that you'll need to start your bot with discord_config_init()
  2. Locate the JSON payload sent by Discord at your generated http.log file.

This suggests guild member would be the right API to use, since it has presence: https://discord.js.org/#/docs/main/stable/class/GuildMember

On another note, I see that you are using djs as a reference as to what to expect from the Concord's API, instead may I suggest you take a look at the official Discord API? Our goal is to keep concord a 1:1 mapping to the official API, so if you spot something there and seems to be missing from Concord chances are we need to have it be updated.

According to the Discord API, there is no presence field for guild members, which means djs must fetch that internally in some other way.

From what I can see, yes, it does appear the presences are getting sent, but either I'm not using the library right or the library isn't exposing these:

This is the beginning of a line that contains a whole JSON array of presence data, which appears after a request to https://discord.com/api/v10/channels/<CHAN>

WS_RCV_TEXT [WEBSOCKETS] - 2023-02-5T19:36:23.687Z - wss://gateway.discord.gg/?v=10&encoding=json
{"t":"GUILD_CREATE","s":2,"op":0,"d":{"application_id":null,"mfa_level":0,"guild_hashes":{

The fact that a whole array of presences is coming through leads me to believe my first snippet is probably the right idea, but for some reason it's just NULL for me.

There was indeed an issue with 'struct discord_request_guild_members', you can get the latest fix from the dev branch

I used the following snippet while working on a fix:

#include <stdio.h>
#include <stdlib.h>

#include <concord/discord.h>
#include <concord/log.h>

void
on_members_chunk(struct discord *client, const struct discord_guild_members_chunk *event)
{
    const int n = event->presences ? event->presences->size : 0;
    char text[8192];
    
    for (int i = 0; i < n; ++i) {
        discord_presence_update_to_json(text, sizeof(text), event->presences->array + i);
        log_info("%s\n", text);
    }
}

void
on_members_get(struct discord *client, const struct discord_message *event)
{
    if (event->author->bot) return;

    struct discord_request_guild_members request = {
        .guild_id = event->guild_id,
        .presences = true,
    };
    discord_request_guild_members(client, &request);
}

int main(void)
{
    ccord_global_init();
    struct discord *client = discord_config_init("config.json");

    discord_add_intents(client, DISCORD_GATEWAY_GUILD_PRESENCES
                                    | DISCORD_GATEWAY_GUILD_MEMBERS);
                                    
    discord_set_on_guild_members_chunk(client, &on_members_chunk);
    discord_set_on_command(client, "members_get", &on_members_get);

    discord_run(client);

    discord_cleanup(client);
    ccord_global_cleanup();
}

I'm running into a similar issue with discord_get_channel:

Now this solves the first issue you encountered, but it doesn't seem discord_get_channel() returns guild members from my tests... perhaps there is some other way the Discord API provides for fetching members via a channel?

Okay, so I'm getting some presence info now, but it seems to fail halfway through, for example starting around member 34 of 77 in one test, event->presences->array + i starts pointing to invalid memory, and I get a segfault.

#4  0x00007f84903fd306 in on_guild_members_chunk (client=0x555a8c44c5c0, event=0x7f84800d25f0) at mod_discord.c:314
        member = 0x7f84800cbb20
        user = 0x7f84800c6210
        presence = 0x7f84800c9540
        presencestatus = 0x656c6469 <error: Cannot access memory at address 0x656c6469>
        i = 36
        members = 0x7f84800c5f40
        presences = 0x7f84800c8d40
        __PRETTY_FUNCTION__ = "on_guild_members_chunk"
        __FUNCTION__ = "on_guild_members_chunk"

Printing out the sizes, I get an interesting result: 77 members, 34 presences

Is it expected that presences->size might be different from members->size? That seems odd to me. Obviously, I'm reading invalid memory, but I wonder why the number of presences would be less than the total number of members, and the rest is missing.

Do you mind opening this as a separate issue? Also, can you verify by the http.log payload that the presences and members array are indeed of the same size?