[BUG] Sharepoint ingestion fails with remote end closed connection without response
mawandm opened this issue · comments
Michael Sekamanya commented
Nesis version
0.1.0
Describe the bug
During a long running Sharepoint ingestion process, an error
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:36.695 [WARNING ] nesis.api.core.document_loaders.sharepoint - Error when getting and ingesting file Stock Market Wizards (Jack D. Schwager) (z-lib.org).pdf - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Generating embeddings: 0%| | 0/14 [00:00<?, ?it/s]Killed
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:37.469 [ERROR ] nesis.api.core.document_loaders.sharepoint - Error fetching and updating documents - Error: (None, None, "401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files")
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] Traceback (most recent call last):
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_request.py", line 38, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] response.raise_for_status()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] File "/app/.venv/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] raise HTTPError(http_error_msg, response=self)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] During handling of the above exception, another exception occurred:
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] Traceback (most recent call last):
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] File "/app/nesis/api/core/document_loaders/sharepoint.py", line 117, in _sync_sharepoint_documents
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] _process_folder_files(
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] File "/app/nesis/api/core/document_loaders/sharepoint.py", line 168, in _process_folder_files
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] _files = folder.get_files(False).execute_query()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_object.py", line 52, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] self.context.execute_query()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_runtime_context.py", line 183, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] self.pending_request().execute_query(qry)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_request.py", line 42, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] raise ClientRequestException(*e.args, response=e.response)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] office365.runtime.client_request_exception.ClientRequestException: (None, None, "401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files")
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:37.503 [INFO ] apscheduler.executors.default - Job "ingest_datasource (trigger: date[2024-05-04 00:19:28 UTC], next run at: 2024-05-04 00:19:28 UTC)" executed successfully
Shows
To reproduce
- Create a sharepoint datasource
- Add multiple large documents to the Sharepoint
- Run the ingestion... after a while, the API service logs show a
401 Client Error: Unauthorized for url...
Expected behavior
The ingestion should run continuously. It seems like a refresh of the Sharepoint client authentication is needed
Screenshots
No response
Additional context
No response