matthieuvion / lmd_ukr

Custom dataset of Le Monde (236 k comments), 1-year-coverage-on-Ukraine & custom made associated API (need Subscriber credentials)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dataset - Le Monde Guerre en Ukraine

License: MIT made-with-python

Always the same boring english datasets.
Out of curiosity and as an avid reader of Le Monde, here a Dataset collected from my fav newspaper : 1 year coverage of the Ukraine Invasion (Feb 24 2022 -> 2023) as well as the tools used to build it.
You might want to check the subsequent analysis I made out of this data on the sibling repo or access this rendered version

Important : the data is collected and shared by me for educational & research purposes only ; premium articles (suscriber only) have been truncated to first 2500 characters.

Dataset


Download /dataset (Compressed Parquet, 40mb)
236 k comments and associated articles (2 k unique), title, content (truncated if premium), desc & date
dataset structure

Remarks / Limitations :

  • Articles truly about Ukraine War, not a simple mention, using a prior filter on articles tags.
  • Lives and Blog type articles not collected; all other types are (Edito etc.)
  • Articles authors (journalists) not collected, purposely.
  • No distinction between comments and replies-to-comment.
  • No timestamp, only associated article (last) publication date.

Workflow, things you might re-use


1. Data Collection

Custom API (lmd_ukr/api.py)
- To be seen as a good, but not top (i.e. scalable etc.) one-shot-project" API, shared "as is"
- Le Monde does not offer a public API --as the New York Times ;)
- Personal credentials (suscriber) are required, because comments are suscribers-only
- Built using httpx for requests & selectolax for parsing
- API & use examples with caching are available in lmd_ukr/examples; added some documentation in-code (rate limits etc.)

2. Dataset prep

- Checkout lmd_ukr/build_sqlite_dataset.py and build_parquet_dataset.ipynb
- Parsed data populated into an sqlite db with two tables articles and comments with shared key article_id
- This was optional, but wanted to refresh my skills and it allows to remove duplicates when building our db
- Formating / cleaning using Polars, wanted to benchmark v. Pandas (cf. notebook)
- Final file is a joined articles-comments (tidy) parquet file.

Data Usage


I created and shared this dataset for educational purposes only. Just wanted to have a French dataset, --if possible from my favorite newspaper on a topic I'm following daily; instead of exploring the same boring english datasets we're used to. It could be used for various natural language processing tasks.

  • Topic modeling
  • Troll detection (not enough fields though in my opinion)
  • Generate summaries or headlines for articles (and compare to "desc" for instance)
  • trending & various generative tasks of your choice
  • (...)

About

Custom dataset of Le Monde (236 k comments), 1-year-coverage-on-Ukraine & custom made associated API (need Subscriber credentials)


Languages

Language:Python 100.0%