GoogleCloudPlatform / professional-services

Common solutions and tools developed by Google Cloud's Professional Services team. This repository and its contents are not an officially supported Google product.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

asset-inventory : job stay running for many hours

sdenef-adeo opened this issue · comments

Hello,

We faced an issue yesterday with a Dataflow job using the asset-inventory feature.

The job stay running for more than 10 hours. Usually it takes 2h to run.
We canceled the job, and run it again. It finished successfully in 2h.

We got many logs like the following (we didn't have them usually but I can't be sure it's the root cause)

Operation ongoing for over 300.19 seconds in state process-msecs in step s8 . Current Traceback:
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/start.py", line 144, in <module>
    main()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/start.py", line 140, in main
    batchworker.BatchWorker(properties, sdk_pipeline_options).run()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 844, in run
    deferred_exception_details=deferred_exception_details)
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
    work_executor.execute()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute
    op.start()
  File "./asset_inventory/import_pipeline.py", line 150, in add_input
    resource_schema = self.element_to_schema(element)
  File "./asset_inventory/import_pipeline.py", line 147, in element_to_schema
    'iam_policy' in element)
  File "/usr/local/lib/python3.7/site-packages/asset_inventory/api_schema.py", line 451, in bigquery_schema_for_resource
    discovery_doc_url)
  File "/usr/local/lib/python3.7/site-packages/asset_inventory/api_schema.py", line 84, in _get_discovery_document_versions
    dd = cls._get_discovery_document(dd_url)
  File "/usr/local/lib/python3.7/site-packages/asset_inventory/api_schema.py", line 44, in _get_discovery_document
    response = requests.get(dd_url)
  File "/usr/local/lib/python3.7/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.7/http/client.py", line 1369, in getresponse
    response.begin()
  File "/usr/local/lib/python3.7/http/client.py", line 310, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.7/http/client.py", line 271, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)

Are you able to confirm it's the root cause?
If it is, are you able to explain why it keep our job running for many hours yesterday?

Do you think you can provide any update to the asset-inventory feature to avoid this kind of problem? For example to be able to make the job failed after X minutes to avoid this kind of situation?

Best regards,
Sébastien