rancher / opni

Multi Cluster Observability with AIOps

Home Page:https://opni.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Log anomaly not working due to seaweed failure

dbason opened this issue · comments

After selecting deployments nothing is showing up in the log anomaly section:
image

Restarted the GPU controller but it did not solve the issue:

  logs_queue = asyncio.Queue(loop=loop)
2023-08-08T00:48:15.232244361Z 2023-08-08 00:48:15,231 - INFO - Attempting to connect to NATS
2023-08-08T00:48:15.257641485Z 2023-08-08 00:48:15,257 - INFO - Connected to NATS at opni-nats-client.opni-cluster-system.svc:4222... with cid: 29
2023-08-08T00:48:15.257670347Z 2023-08-08 00:48:15,257 - INFO - Current server info: {'server_id': 'ND3LEAUTQXYHFXR2Q56IPHHPUZSWUYK442L45VJA4MOQLRXXHJ24JUYY', 'server_name': 'opni-nats-1', 'version': '2.8.4', 'proto': 1, 'git_commit': '66524ed', 'go': 'go1.17.10', 'host': '0.0.0.0', 'port': 4222, 'headers': True, 'auth_required': True, 'max_payload': 8388608, 'jetstream': True, 'client_id': 29, 'client_ip': '10.0.14.17', 'nonce': 'mSTaRmRE8dnMhiI', 'cluster': 'opni', 'connect_urls': ['10.0.16.91:4222', '10.0.5.69:4222', '10.0.15.157:4222']}
2023-08-08 00:48:15,257 - INFO - NATS stats: {'in_msgs': 0, 'out_msgs': 0, 'in_bytes': 0, 'out_bytes': 0, 'reconnects': 0, 'errors_received': 0}
2023-08-08 00:48:15,257 - INFO - NATS options: {'verbose': True, 'pedantic': False, 'name': None, 'allow_reconnect': True, 'dont_randomize': False, 'reconnect_time_wait': 5, 'max_reconnect_attempts': -1, 'ping_interval': 120, 'max_outstanding_pings': 2, 'no_echo': False, 'user': None, 'password': None, 'token': None, 'connect_timeout': 2, 'drain_timeout': 30}
/app/opni_inference_service/./start_opnilog_inference.py:235: DeprecationWarning: The loop argument is deprecated since Python 3.8, and scheduled for removal in Python 3.10.
2023-08-08T00:48:15.258193805Z   job_queue = asyncio.Queue(loop=loop)
2023-08-08 00:48:15,343 - ERROR - Cannot currently obtain necessary model files. Exiting function
2023-08-08T00:48:15.343440807Z 2023-08-08 00:48:15,343 - INFO - initializing...
2023-08-08 00:48:15,352 - ERROR - No OpniLog model currently [Errno 2] No such file or directory: 'output/vocab.txt'

Also tried selecting new deployments to kick of a retraining but the model has failed to upload:

2023-08-08 00:49:24,772 - INFO - POST https://opni-opensearch-svc.opni-cluster-system.svc:9200/_search/scroll?scroll=5m [status:200 request:0.320s]
2023-08-08 00:49:43,851 - INFO - n_words : 10000
2023-08-08T00:49:43.852219276Z 2023-08-08 00:49:43,852 - INFO - valid_words : 99
/app/opni_inference_service/models/opnilog/opnilog_parser.py:581: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
2023-08-08T00:49:43.958580653Z   nn.init.xavier_uniform(p)
/app/opni_inference_service/models/opnilog/opnilog_model.py:341: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
2023-08-08T00:49:47.307177425Z   npd = np.asarray(d)
2023-08-08 00:49:47,910 - INFO - #######Training Model within 3 epochs...######
2023-08-08 00:49:48,690 - INFO - | Epoch: 0 | Total Progress: 0.02% | Training Time Taken: 0.78s | ETC: 3185.22s | Epoch Step: 1/1363 | Loss: 8.4375 | Tokens per Sec: 2159.58 
...
2023-08-08 01:16:39,186 - INFO - Model finished training predicting 4844 logs correctly out of 4844 total logs for an accuracy of 1.0 on eval dataset.
2023-08-08 01:16:46,200 - ERROR - OpniLog model was not able to be trained. Failed to upload output/nulog_model_latest.pt to opni-nulog-models/nulog_model_latest.pt: An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 4): We encountered an internal error, please try again.
2023-08-08 01:16:46,204 - WARNING - Disconnected from NATS
2023-08-08 01:16:46,216 - INFO - Connected to NATS at opni-nats-client.opni-cluster-system.svc:4222... with cid: 31
2023-08-08 01:16:46,216 - INFO - Current server info: {'server_id': 'ND3LEAUTQXYHFXR2Q56IPHHPUZSWUYK442L45VJA4MOQLRXXHJ24JUYY', 'server_name': 'opni-nats-1', 'version': '2.8.4', 'proto': 1, 'git_commit': '66524ed', 'go': 'go1.17.10', 'host': '0.0.0.0', 'port': 4222, 'headers': True, 'auth_required': True, 'max_payload': 8388608, 'jetstream': True, 'client_id': 31, 'client_ip': '10.0.14.17', 'nonce': 'Z-Wxnj5-aekJDts', 'cluster': 'opni', 'connect_urls': ['10.0.16.91:4222', '10.0.5.69:4222', '10.0.15.157:4222']}
2023-08-08T01:16:46.217367031Z 2023-08-08 01:16:46,217 - INFO - NATS stats: {'in_msgs': 0, 'out_msgs': 0, 'in_bytes': 0, 'out_bytes': 0, 'reconnects': 0, 'errors_received': 0}
2023-08-08T01:16:46.217375157Z 2023-08-08 01:16:46,217 - INFO - NATS options: {'verbose': True, 'pedantic': False, 'name': None, 'allow_reconnect': True, 'dont_randomize': False, 'reconnect_time_wait': 5, 'max_reconnect_attempts': -1, 'ping_interval': 120, 'max_outstanding_pings': 2, 'no_echo': False, 'user': None, 'password': None, 'token': None, 'connect_timeout': 2, 'drain_timeout': 30}
2023-08-08 01:16:46,217 - INFO - training status : False

Turns out this was due to seaweedfs failing. We should provide an option for the user to define their own s3