kafka client问题

Question

kafka client问题

zongjiangU opened this issue 6 months ago · comments

一个应用进程使用了kafka client, 该进程cgroup分配了1c2g的资源，该client发送kafka速率到980pps(约40M/s), kafka client get_error 收到了错误码11。此时内存会迅速增长后会OOM。
应用内发送kafka消息函数原型如下

void KafkaClient::sendMsg(const std::string& topic, const std::string& msg,bool mustAlive /*= true*/) {
    if (mustAlive && !m_bAlive) return;
    std::stringstream oTask;
    oTask << "api=produce&topic=" << topic;
    auto task = mustAlive ? m_pClient->create_kafka_task(oTask.str(), 3, KafkaClient::sendMsgCallback) :
                            m_pClient->create_kafka_task(oTask.str(), 3, m_hbCallback);
    protocol::KafkaRecord record;
    record.set_key("key", strlen("key"));
    record.set_value(msg.c_str(), msg.length());
    task->add_produce_record(topic, -1, std::move(record));
    task->start();
}

同等压力下让该函数在第一行直接return，不使用client发送任何数据，不会发生OOM。

m_bAlive的值是通过回调函数sendMsgCallback进行设置的，当get_state返回结果非0的时候会置为false,作为熔断。但效果并不理想，高压力下仍会发生OOM。

想请教一下几个问题：

client 的retry机制的重试次数是否可配？
当server发生问题时，client的内存是否会无限制的持续增长？

xiehan · Answer 1 · Mon Apr 01 2024 12:11:23 GMT+0800 (China Standard Time)

麻烦看一下workflow的版本。retry机制是最新几个发行版才生效的。内存无限增长肯定不对，但我们这边没有遇到过，需要看一下。

youzj · Answer 2 · Mon Apr 01 2024 17:14:00 GMT+0800 (China Standard Time)

麻烦看一下workflow的版本。retry机制是最新几个发行版才生效的。内存无限增长肯定不对，但我们这边没有遇到过，需要看一下。

版本是0.11.2当时拉的是master最新的代码

[root@localhost workflow]# git log
commit b876f856403f84098dc2833739aac44233d39432 (HEAD -> master, origin/master, origin/HEAD, 0a)
Author: Xie Han <63350856@qq.com>
Date:   Thu Dec 7 15:48:36 2023 +0800

    Remove 'struct'.

commit 21c71754029bc0ec48a50ad476781f87b3fd79f1
Merge: d73f353 908a2de
Author: xiehan <52160700+Barenboim@users.noreply.github.com>
Date:   Wed Dec 6 20:42:14 2023 +0800

    Merge pull request #1437 from Barenboim/master
    
    Add 'guard' task wrapper.

当时是遇到一个coredump问题
我这边进行了升级。
今天下午环境出现了持续的coredump

#0  0x00007ffff6811a55 in poller_add (data=data@entry=0x7fff03ffdf80, timeout=10000, poller=0x7ffff7e21010) at /home/workflow/src/kernel/poller.c:1353
#1  0x00007ffff6816652 in mpoller_add (mpoller=<optimized out>, timeout=<optimized out>, data=0x7fff03ffdf80) at /home/workflow/src/kernel/mpoller.h:52
#2  Communicator::request_new_conn (this=this@entry=0x7ffff6a846c8 <__CommManager::get_instance()::kInstance+8>, session=session@entry=0x7fff6800fc48,
    target=target@entry=0x7fffe0006bf8) at /home/workflow/src/kernel/Communicator.cc:1611
#3  0x00007ffff681670e in Communicator::request (this=this@entry=0x7ffff6a846c8 <__CommManager::get_instance()::kInstance+8>,
    session=session@entry=0x7fff6800fc48, target=0x7fffe0006bf8) at /home/workflow/src/kernel/Communicator.cc:1636
#4  0x00007ffff65c1361 in CommScheduler::request (target=0x7fff6800fc98, wait_timeout=<optimized out>, object=<optimized out>, session=0x7fff6800fc48,
    this=0x7ffff6a846c0 <__CommManager::get_instance()::kInstance>) at /home/workflow/_include/workflow/CommScheduler.h:133
#5  CommRequest::dispatch (this=0x7fff6800fc30) at /home/workflow/_include/workflow/CommRequest.h:45
#6  WFComplexClientTask<protocol::KafkaRequest, protocol::KafkaResponse, int>::dispatch (this=0x7fff6800fc30)
    at /home/workflow/src/factory/WFTaskFactory.inl:362
#7  0x00007ffff6850719 in WFResolverTask::dispatch (this=0x7fff68011d00) at /home/workflow/src/nameservice/WFDnsResolver.cc:390
#8  0x000000000051f962 in Workflow::start_series_work(SubTask*, std::function<void (SeriesWork const*)>) (first=0x7fff6800d5c0, callback=...)
    at /home/workspace/Daily_HiStorage_dra-agent_CentOS7.2_docker_Debug/DBdoctor/dra-agent/src/./workflow/Workflow.h:188
#9  0x000000000051f9e3 in WFGenericTask::start (this=0x7fff6800d5c0)
    at /home/workspace/Daily_HiStorage_dra-agent_CentOS7.2_docker_Debug/DBdoctor/dra-agent/src/./workflow/WFTask.h:379

对照本地代码是这行

xiehan · Answer 3 · Mon Apr 01 2024 17:16:18 GMT+0800 (China Standard Time)

升级之后出现coredump吗？请确保头文件和lib的一致性

youzj · Answer 4 · Mon Apr 01 2024 17:23:02 GMT+0800 (China Standard Time)

升级之后出现coredump吗？请确保头文件和lib的一致性

coredump是新出现的问题，头文件和lib确保是一致的，运行环境是打包成容器了，使用0.11.2后经过了长时间验证，很不幸下午在客户现场出现了coredump。。。

xiehan · Answer 5 · Mon Apr 01 2024 17:32:02 GMT+0800 (China Standard Time)

coredump和最新master代码无关是吧？那能不能升级一下到最新master……
我先看一下coredump的问题，这个地方coredump不太有可能啊。

xiehan · Answer 6 · Mon Apr 01 2024 17:39:05 GMT+0800 (China Standard Time)

最新的coredump就是OOM问题吗？

youzj · Answer 7 · Mon Apr 01 2024 17:46:46 GMT+0800 (China Standard Time)

最新的coredump就是OOM问题吗？

不是的是两个问题

OOM是上礼拜进行POC kafka-client 发包压力到40M/s kafka-server异常返回错误码 11，客户端这边瞬间就OOM了
coredump 是今天下午新产生的问题。这个问题也比较奇怪，一台机器上是发送第一个包就coredump，必现的，但还有两台是正常运行的，发生coredump的机器内核版本和正常的机器不一致

xiehan · Answer 8 · Mon Apr 01 2024 18:20:46 GMT+0800 (China Standard Time)

coredump

我们再看。你最新coredump的位置不太可能发生，我们看看有没有其他原因可能导致。

xiehan · Answer 9 · Mon Apr 01 2024 18:23:39 GMT+0800 (China Standard Time)

如果是error 11，说明最大连接数不够了。需要在全局配置（或upstream配置）里，加大max_connection。
之后内存OOM是不是你代码里处理得有问题？

xiehan · Answer 10 · Mon Apr 01 2024 18:51:24 GMT+0800 (China Standard Time)

我们认为OOM是你并发过大之后，server出现短时间没响应，而你发出的请求过多导致。毕竟你只有2GB内存，同时每个broker的连接都达到200上限了（才会导致errno 11），内存很可能真的被你用完了。

youzj · Answer 11 · Mon Apr 01 2024 19:07:21 GMT+0800 (China Standard Time)

如果是error 11，说明最大连接数不够了。需要在全局配置（或upstream配置）里，加大max_connection。之后内存OOM是不是你代码里处理得有问题？

一开始我怀疑是我这边的问题，然后我把kafka的地址故意配错，整个流程除了发送kafka时直接return，其余完全一致，没有出现OOM，内存使用稳定在150M，我这边异步队列是用的ringbuffer存储数据，理论上不会出现内存无限制增长。
刚才我把client 的endpoint_params.max_connections 从200设置为2000 ，没有出现11的错误码，但内存仍然会增长至2G。然后我把容器的memory limit设置为8G，2000连接的设置下内存仍持续增长至OOM

xiehan · Answer 12 · Mon Apr 01 2024 19:14:31 GMT+0800 (China Standard Time)

如果是error 11，说明最大连接数不够了。需要在全局配置（或upstream配置）里，加大max_connection。之后内存OOM是不是你代码里处理得有问题？

一开始我怀疑是我这边的问题，然后我把kafka的地址故意配错，整个流程除了发送kafka时直接return，其余完全一致，没有出现OOM，内存使用稳定在150M，我这边异步队列是用的ringbuffer存储数据，理论上不会出现内存无限制增长。刚才我把client 的endpoint_params.max_connections 从200设置为2000 ，没有出现11的错误码，但内存仍然会增长至2G。然后我把容器的memory limit设置为8G，2000连接的设置下内存仍持续增长至OOM

你用valgrind看一下吧，OOM之前正常退出看看有没有哪里泄露了。
我这边不知道你每个produce的记录多大，你这么高并发，占8G内存也不是不可能。

xiehan · Answer 13 · Mon Apr 01 2024 19:15:22 GMT+0800 (China Standard Time)

你的broker可能非常多啊，每个1000连接的话，我真的不知道会占多少内存。

youzj · Answer 14 · Mon Apr 01 2024 19:18:09 GMT+0800 (China Standard Time)

我们认为OOM是你并发过大之后，server出现短时间没响应，而你发出的请求过多导致。毕竟你只有2GB内存，同时每个broker的连接都达到200上限了（才会导致errno 11），内存很可能真的被你用完了。

这种可能性确实是最大的。
咱们框架支不支持设置producer的 acks、重试次数这两个参数，我这种场景下不太想关心server的ack

youzj · Answer 15 · Mon Apr 01 2024 19:19:54 GMT+0800 (China Standard Time)

static constexpr struct EndpointParams ENDPOINT_PARAMS_DEFAULT =
{
	.max_connections		=	200,
	.connect_timeout		=	10 * 1000,
	.response_timeout		=	10 * 1000,
	.ssl_connect_timeout	=	10 * 1000,
	.use_tls_sni			=	false,
};

这个endpoint设置中 response_timeout设置成0 能否满足不等待server ack的需求

xiehan · Answer 16 · Mon Apr 01 2024 19:22:32 GMT+0800 (China Standard Time)

你直接改，那meta的请求也受影响了，肯定不行啊。

static constexpr struct EndpointParams ENDPOINT_PARAMS_DEFAULT =
{
	.max_connections		=	200,
	.connect_timeout		=	10 * 1000,
	.response_timeout		=	10 * 1000,
	.ssl_connect_timeout	=	10 * 1000,
	.use_tls_sni			=	false,
};

这个endpoint设置中 response_timeout设置成0 能否满足不等待server ack的需求

xiehan · Answer 17 · Mon Apr 01 2024 19:23:40 GMT+0800 (China Standard Time)

retry的话，我们的创建kafka task的时候就可以传了。

xiehan · Answer 18 · Mon Apr 01 2024 19:25:57 GMT+0800 (China Standard Time)

你如果需要降低ack开销，也可以攒一批再produce。前提是内存别爆了。

youzj · Answer 19 · Mon Apr 01 2024 19:33:17 GMT+0800 (China Standard Time)

老板不允许我占用更多内存了。。

typedef struct __kafka_config
{
	int produce_timeout;
	int produce_msg_max_bytes;
	int produce_msgset_cnt;
	int produce_msgset_max_bytes;
	int fetch_timeout;
	int fetch_min_bytes;
	int fetch_max_bytes;
	int fetch_msg_max_bytes;
	long long offset_timestamp;
	long long commit_timestamp;
	int session_timeout;
	int rebalance_timeout;
	long long retention_time_period;
	int produce_acks;
	int allow_auto_topic_creation;
	int api_version_request;
	int api_version_timeout;
	char *broker_version;
	int compress_type;
	int compress_level;
	char *client_id;
	int check_crcs;
	int offset_store;
	char *rack_id;

	char *mechanisms;
	char *username;
	char *password;
	int (*client_new)(void *conf, kafka_sasl_t *sasl);
	int (*recv)(const char *buf, size_t len, void *conf, void *sasl);
} kafka_config_t;

task->set_config protocol::KafkaConfig::set_produce_acks 这里的produce_acks就是ack的设置吗

xiehan · Answer 20 · Mon Apr 01 2024 19:34:52 GMT+0800 (China Standard Time)

老板不允许我占用更多内存了。。

typedef struct __kafka_config
{
	int produce_timeout;
	int produce_msg_max_bytes;
	int produce_msgset_cnt;
	int produce_msgset_max_bytes;
	int fetch_timeout;
	int fetch_min_bytes;
	int fetch_max_bytes;
	int fetch_msg_max_bytes;
	long long offset_timestamp;
	long long commit_timestamp;
	int session_timeout;
	int rebalance_timeout;
	long long retention_time_period;
	int produce_acks;
	int allow_auto_topic_creation;
	int api_version_request;
	int api_version_timeout;
	char *broker_version;
	int compress_type;
	int compress_level;
	char *client_id;
	int check_crcs;
	int offset_store;
	char *rack_id;

	char *mechanisms;
	char *username;
	char *password;
	int (*client_new)(void *conf, kafka_sasl_t *sasl);
	int (*recv)(const char *buf, size_t len, void *conf, void *sasl);
} kafka_config_t;

task->set_config protocol::KafkaConfig::set_produce_acks 这里的produce_acks就是ack的设置吗

还真有……你试一下吧，看起来是的。

youzj · Answer 21 · Mon Apr 01 2024 19:59:54 GMT+0800 (China Standard Time)

protocol::KafkaConfig conf;
    conf.set_produce_acks(0);
    task->set_config(conf);
    task->add_produce_record(topic, -1, std::move(record));
    task->start();

这样设置返回状态码110 了。。。

xiehan · Answer 22 · Mon Apr 01 2024 20:03:00 GMT+0800 (China Standard Time)

110是ETIMEDOUT，超时了啊。你看着个文档：https://github.com/sogou/workflow/blob/master/docs/tutorial-13-kafka_cli.md
produce_acks是任务在返回之前应确保消息成功复制的broker节点数，是不是应该1比较合理？

youzj · Answer 23 · Tue Apr 02 2024 16:48:58 GMT+0800 (China Standard Time)

110是ETIMEDOUT，超时了啊。你看着个文档：https://github.com/sogou/workflow/blob/master/docs/tutorial-13-kafka_cli.md produce_acks是任务在返回之前应确保消息成功复制的broker节点数，是不是应该1比较合理？

问题目前是得到解决了，限制了1c的CPU，处理过慢，导致了数据的积压。
去掉CPU限制后问题解决了。
在限制1c的情况下，用valgrind+massif 跑了一下，但是由于使用了内存监测工具，压力起不来。
整个流程如下：
Kernel => RingBuffer => Consumer ThreadPool => Kafka Client
生产速率不变没有足够的CPU给consumer,导致ringbuffer的数据一直被覆盖，没能复现内存一直增长的问题，截取了一段报告，kafka client 申请内存的绝对值并没有很大,不确定有没有参考意义

bash-4.4# ms_print massif.out.512778 massif_report.txt
--------------------------------------------------------------------------------
Command:            ./dra_agent
Massif arguments:   (none)
ms_print arguments: massif.out.512778 massif_report.txt
--------------------------------------------------------------------------------


    MB
64.78^                        #
     |                        #
     |                        #:
     |                        #:
     |                        #:@
     |                        #:@
     |                        #:@             ::@@:::@:::::::::::::::@:::::::@
     |                      @@#:@:::::::::::::::@ :::@::: :::::::::::@:::::::@
     |                     @@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
     |                     @@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
     |                     @@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
     |                @  @@@@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
     |      ::        @  @ @@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
     |     :::::      @ :@ @@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
     |    :::::       @ :@ @@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
     |   ::::::       @ :@ @@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
     |   ::::::  :@:::@::@ @@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
     |  @:::::: ::@:::@::@ @@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
     | :@:::::: ::@:::@::@ @@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
     | :@:::::: ::@:::@::@ @@ #:@: :::: ::: ::::@ :::@::: :::::::::::@:::::::@
   0 +----------------------------------------------------------------------->Gi
     0                                                                   87.11

Number of snapshots: 68
 Detailed snapshots: [2, 11, 15, 18, 19, 20, 21 (peak), 23, 36, 40, 56, 66]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1  1,306,931,611        7,078,976        7,030,878        48,098            0
  2  3,249,387,500       11,478,352       11,406,589        71,763            0
99.37% (11,406,589B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->41.82% (4,799,904B) 0x61D11FA: __mpoller_create (mpoller.c:30)
| ->41.82% (4,799,904B) 0x61D11FA: mpoller_create (mpoller.c:70)
|   ->41.82% (4,799,904B) 0x61D7383: Communicator::create_poller(unsigned long) (Communicator.cc:1593)
|     ->41.82% (4,799,904B) 0x61D73E5: Communicator::init(unsigned long, unsigned long) (Communicator.cc:1616)
|       ->41.82% (4,799,904B) 0x61E0627: init (CommScheduler.h:116)
|         ->41.82% (4,799,904B) 0x61E0627: __CommManager (WFGlobal.cc:348)
|           ->41.82% (4,799,904B) 0x61E0627: get_instance (WFGlobal.cc:333)
|             ->41.82% (4,799,904B) 0x61E0627: WFGlobal::get_scheduler() (WFGlobal.cc:689)
|               ->41.82% (4,799,904B) 0x647A2D4: WFComplexClientTask (WFTaskFactory.inl:75)
|                 ->41.82% (4,799,904B) 0x647A2D4: __ComplexKafkaTask (KafkaTaskImpl.cc:53)
|                   ->41.82% (4,799,904B) 0x647A2D4: __WFKafkaTaskFactory::create_kafka_task(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::function<void (WFNetworkTask<protocol::KafkaRequest, protocol::KafkaResponse>*)>) (KafkaTaskI0)
|                     ->41.82% (4,799,904B) 0x6460395: KafkaClientTask::check_meta() (WFKafkaClient.cc:857)
|                       ->41.82% (4,799,904B) 0x6462467: KafkaClientTask::dispatch_locked() (WFKafkaClient.cc:885)
|                         ->41.82% (4,799,904B) 0x6463404: dispatch (WFKafkaClient.cc:1110)
|                           ->41.82% (4,799,904B) 0x6463404: KafkaClientTask::dispatch() (WFKafkaClient.cc:1079)
|                             ->41.82% (4,799,904B) 0x50BB6F: Workflow::start_series_work(SubTask*, std::function<void (SeriesWork const*)>) (Workflow.h:188)
|                               ->41.82% (4,799,904B) 0x50BBF0: WFGenericTask::start() (WFTask.h:379)
|                                 ->41.82% (4,799,904B) 0x50ABE0: KafkaClient::sendMsg(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) (KafkaClient.cp
|                                   ->41.82% (4,799,904B) 0x50B2D2: KafkaClient::heartBeatLoop() (KafkaClient.cpp:88)
|                                     ->41.82% (4,799,904B) 0x50C921: void std::__invoke_impl<void, void (KafkaClient::*)(), KafkaClient*>(std::__invoke_memfun_deref, void (KafkaClient::*&&)(), KafkaClient*&&) (invoke.h:73)
|                                       ->41.82% (4,799,904B) 0x50C05D: std::__invoke_result<void (KafkaClient::*)(), KafkaClient*>::type std::__invoke<void (KafkaClient::*)(), KafkaClient*>(void (KafkaClient::*&&)(), KafkaClient*&&) (invoke.h:95)
|                                         ->41.82% (4,799,904B) 0x50F61C: decltype (__invoke((_S_declval<0ul>)(), (_S_declval<1ul>)())) std::thread::_Invoker<std::tuple<void (KafkaClient::*)(), KafkaClient*> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) (thread:244)
|                                           ->41.82% (4,799,904B) 0x50F581: std::thread::_Invoker<std::tuple<void (KafkaClient::*)(), KafkaClient*> >::operator()() (thread:253)
|                                             ->41.82% (4,799,904B) 0x50F36A: std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (KafkaClient::*)(), KafkaClient*> >, voator()() const (future:1362)
|                                               ->41.82% (4,799,904B) 0x50F0B1: std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,uture_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (KafkaClient::*)(), KafkaClient*> >, void> >::_M_invoke(std::_Any_data const&) (std_function.h:283)
|                                                 ->41.82% (4,799,904B) 0x47E4D6: std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>::operator()() const (std_function.h:687)
|                                                   ->41.82% (4,799,904B) 0x47BC5E: std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) (future:561)
|                                                     ->41.82% (4,799,904B) 0x4807EC: void std::__invoke_impl<void, void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), stre_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::__invoke_memfun_deref, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, ture_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) (invoke.h:73)
|                                                       ->41.82% (4,799,904B) 0x47F688: std::__invoke_result<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__fu::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>::type std::__invoke<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_bast_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base:base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) (invoke.h:95)
|                                                         ->41.82% (4,799,904B) 0x47E29B: std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__futuretate_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Re::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()
|                                                           ->41.82% (4,799,904B) 0x47E2C6: std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__futu_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::ase::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()
|                                                             ->41.82% (4,799,904B) 0x47E2D7: std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__fu::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()
|                                                               ->41.82% (4,799,904B) 0x5C67E96: __pthread_once_slow (in /lib64/libpthread-2.28.so)
|                                                                 ->41.82% (4,799,904B) 0x472BFE: __gthread_once(int*, void (*)()) (gthr-default.h:699)
|                                                                   ->41.82% (4,799,904B) 0x47E369: void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__se::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) (mutex:684)
|                                                                     ->41.82% (4,799,904B) 0x47BA8C: std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) (future:401)
|                                                                       ->41.82% (4,799,904B) 0x50EB14: std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (KafkaClient::*)(), KafkaClient*> >, void>::_Async_state_impl(std::thread::_Invoker<std::tuple<void (Kaf:*)(), KafkaClient*> >&&)::{lambda()
|

youzj · Answer 24 · Tue Apr 02 2024 17:36:29 GMT+0800 (China Standard Time)

问题找到了内存增长是我这边的问题，用bcc的memleak统计的数据

747         7378772968 bytes in 230069 allocations from stack
748                 operator new(unsigned long)+0x1c [libstdc++.so.6.0.25]
749                 std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<consume_event, std::allocator<consume_event>, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<consume_event, std::allocator<consume_event>, (__gnu_cxx::_Lock_po    licy)2> >&, unsigned long)+0x28 [dra_agent]
750                 std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<consume_event, std::allocator<consume_event>, (__gnu_cxx::_Lock_policy)2> > > std::__allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<consume_event, std::allocator<consume_event>, (__gnu_    cxx::_Lock_policy)2> > >(std::allocator<std::_Sp_counted_ptr_inplace<consume_event, std::allocator<consume_event>, (__gnu_cxx::_Lock_policy)2> >&)+0x21 [dra_agent]
751                 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<consume_event, std::allocator<consume_event>>(consume_event*&, std::_Sp_alloc_shared_tag<std::allocator<consume_event> >)+0x3f [dra_agent]
752                 std::__shared_ptr<consume_event, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<consume_event>>(std::_Sp_alloc_shared_tag<std::allocator<consume_event> >)+0x36 [dra_agent]
753                 std::shared_ptr<consume_event>::shared_ptr<std::allocator<consume_event>>(std::_Sp_alloc_shared_tag<std::allocator<consume_event> >)+0x23 [dra_agent]
754                 std::shared_ptr<consume_event> std::allocate_shared<consume_event, std::allocator<consume_event>>(std::allocator<consume_event> const&)+0x23 [dra_agent]
755                 std::shared_ptr<consume_event> std::make_shared<consume_event>()+0x2c [dra_agent]

xiehan · Answer 25 · Tue Apr 02 2024 17:38:13 GMT+0800 (China Standard Time)

好的好的，感谢使用。后面那个core的问题，还有什么发现么？

youzj · Answer 26 · Tue Apr 02 2024 17:55:56 GMT+0800 (China Standard Time)

好的好的，感谢使用。后面那个core的问题，还有什么发现么？

core的问题在本地没能复现，core的环境没办法登录，没有更多的信息了