fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NewsPlease.from_urls behaves inconsistently in situations where a url results in 404

loganamcnichols opened this issue · comments

Mandatory

  • I read the documentation (readme and wiki).
  • I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.
  • I confirm that this bug report does not report on a specific news site where news-please does not work. Please keep in mind that news-please is a generic crawler so it is expected that it will not work for all sites well or even at all.

Describe the bug
NewsPlease.from_urls behaves inconsistently in situations where a url results in 404. Does not behave how it's doc string suggests.

  1. If passed a single url which results in 404, it returns an empty dictionary.
  2. If passed multiple urls, one of which results in 404, it throws an error.

To Reproduce

from newsplease import NewsPlease

url_1 = "https://channelnomics.com/2018/03/two-realities-truth-and-fact-and-theyre-not-the-same/"
url_2 = "https://www.washingtonpost.com/politics/2020/12/04/trump-spending-properties/"
print(NewsPlease.from_urls([url_1]))
print(NewsPlease.from_urls([url_2]))
print(NewsPlease.from_urls([url_1, url_2]))

Expected behavior

not a 200 response: 404
{"https://channelnomics.com/2018/03/two-realities-truth-and-fact-and-theyre-not-the-same/": None}
{'https://www.washingtonpost.com/politics/2020/12/04/trump-spending-properties/': <NewsArticle.NewsArticle object>}
not a 200 response: 404
{"https://channelnomics.com/2018/03/two-realities-truth-and-fact-and-theyre-not-the-same/": None,
'https://www.washingtonpost.com/politics/2020/12/04/trump-spending-properties/': <NewsArticle.NewsArticle object>}

Log

not a 200 response: 404
{}
{'https://www.washingtonpost.com/politics/2020/12/04/trump-spending-properties/': <NewsArticle.NewsArticle object at 0x7f21e9364c50>}
not a 200 response: 404
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/ubuntu/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/code/autocast/autocast_experiments/data/test.py", line 7, in <module>
    print(NewsPlease.from_urls([url_1, url_2]))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/__init__.py", line 145, in from_urls
    results[url] = NewsPlease.from_html(results[url], url, download_date)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/__init__.py", line 103, in from_html
    item = extractor.extract(item)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/pipeline/extractor/article_extractor.py", line 63, in extract
    article_candidate = extractor.extract(item)
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newsplease/pipeline/extractor/extractors/newspaper_extractor.py", line 36, in extract
    article.parse()
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newspaper/article.py", line 191, in parse
    self.throw_if_not_downloaded_verbose()
  File "/home/ubuntu/miniconda3/envs/autocast/lib/python3.11/site-packages/newspaper/article.py", line 531, in throw_if_not_downloaded_verbose
    raise ArticleException('Article `download()` failed with %s on URL %s' %
newspaper.article.ArticleException: Article `download()` failed with No connection adapters were found for '://' on URL ://

Versions (please complete the following information):

  • OS: Ubuntu 22.03
  • Python Version: 3.11
  • news-please: 1.5.33

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)

  • personal
  • academic
  • business
  • other
  • Some information on your project:
    I am working on training LLM to make accurate probabilistic forecasts on forecast tournament style questions.