matthieuvion/lmd_ukr

api dataset french-language httpx lemonde scraping selectolax ukraine-invasion

Dataset - Le Monde Guerre en Ukraine

Always the same boring english datasets.
Out of curiosity and as an avid reader of Le Monde, here a Dataset collected from my fav newspaper : 1 year coverage of the Ukraine Invasion (Feb 24 2022 -> 2023) as well as the tools used to build it.
You might want to check the subsequent analysis I made out of this data on the sibling repo or access this rendered version

Important : the data is collected and shared by me for educational & research purposes only ; premium articles (suscriber only) have been truncated to first 2500 characters.

Dataset

Download /dataset (Compressed Parquet, 40mb)
236 k comments and associated articles (2 k unique), title, content (truncated if premium), desc & date

Remarks / Limitations :

Articles truly about Ukraine War, not a simple mention, using a prior filter on articles tags.
Lives and Blog type articles not collected; all other types are (Edito etc.)
Articles authors (journalists) not collected, purposely.
No distinction between comments and replies-to-comment.
No timestamp, only associated article (last) publication date.

Workflow, things you might re-use

1. Data Collection

Custom API (lmd_ukr/api.py)
- To be seen as a good, but not top (i.e. scalable etc.) one-shot-project" API, shared "as is"
- Le Monde does not offer a public API --as the New York Times ;)
- Personal credentials (suscriber) are required, because comments are suscribers-only
- Built using httpx for requests & selectolax for parsing
- API & use examples with caching are available in lmd_ukr/examples; added some documentation in-code (rate limits etc.)

2. Dataset prep

- Checkout lmd_ukr/build_sqlite_dataset.py and build_parquet_dataset.ipynb
- Parsed data populated into an sqlite db with two tables articles and comments with shared key article_id
- This was optional, but wanted to refresh my skills and it allows to remove duplicates when building our db
- Formating / cleaning using Polars, wanted to benchmark v. Pandas (cf. notebook)
- Final file is a joined articles-comments (tidy) parquet file.

Data Usage

I created and shared this dataset for educational purposes only. Just wanted to have a French dataset, --if possible from my favorite newspaper on a topic I'm following daily; instead of exploring the same boring english datasets we're used to. It could be used for various natural language processing tasks.

Topic modeling
Troll detection (not enough fields though in my opinion)
Generate summaries or headlines for articles (and compare to "desc" for instance)
trending & various generative tasks of your choice
(...)

About

Custom dataset of Le Monde (236 k comments), 1-year-coverage-on-Ukraine & custom made associated API (need Subscriber credentials)

api dataset french-language httpx lemonde scraping selectolax ukraine-invasion

Languages

Language:Python 100.0%