ametnes / nesis

Your AI Powered Enterprise Knowledge Partner. Designed to be used at scale from ingesting large amounts of documents formats such as pdfs, docx, xlsx, png, jpgs, tiff, mp3, mp4, jpeg. Integrates with s3, Windows Shares, Google Drive and more.

Home Page:https://ametnes.github.io/nesis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Sharepoint ingestion fails with remote end closed connection without response

mawandm opened this issue · comments

Nesis version

0.1.0

Describe the bug

During a long running Sharepoint ingestion process, an error

[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:36.695 [WARNING ] nesis.api.core.document_loaders.sharepoint - Error when getting and ingesting file Stock Market Wizards (Jack D. Schwager) (z-lib.org).pdf - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Generating embeddings:   0%|          | 0/14 [00:00<?, ?it/s]Killed
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:37.469 [ERROR   ] nesis.api.core.document_loaders.sharepoint - Error fetching and updating documents - Error: (None, None, "401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files")
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] Traceback (most recent call last):
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_request.py", line 38, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     response.raise_for_status()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     raise HTTPError(http_error_msg, response=self)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] During handling of the above exception, another exception occurred:
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] Traceback (most recent call last):
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/nesis/api/core/document_loaders/sharepoint.py", line 117, in _sync_sharepoint_documents
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     _process_folder_files(
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/nesis/api/core/document_loaders/sharepoint.py", line 168, in _process_folder_files
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     _files = folder.get_files(False).execute_query()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_object.py", line 52, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     self.context.execute_query()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_runtime_context.py", line 183, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     self.pending_request().execute_query(qry)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_request.py", line 42, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     raise ClientRequestException(*e.args, response=e.response)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] office365.runtime.client_request_exception.ClientRequestException: (None, None, "401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files")
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:37.503 [INFO    ] apscheduler.executors.default - Job "ingest_datasource (trigger: date[2024-05-04 00:19:28 UTC], next run at: 2024-05-04 00:19:28 UTC)" executed successfully

Shows

To reproduce

  1. Create a sharepoint datasource
  2. Add multiple large documents to the Sharepoint
  3. Run the ingestion... after a while, the API service logs show a 401 Client Error: Unauthorized for url...

Expected behavior

The ingestion should run continuously. It seems like a refresh of the Sharepoint client authentication is needed

Screenshots

No response

Additional context

No response