mizvol / Wikipedia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wikipedia trending topic detection

Python topic detection module for SparkWiki. The module computes statistics, clustering and assigns topics to clusters of trending Wikipedia pages, extracted using the Anomaly Detection Algorithm. Topic classification model is available here. The module works with all language editions of Wikipedia.

Features

  • Compute degree, betweeness centrality and modularity for clustering the graph by events
  • Match wikipages with their Qids (unique Wikipedia ID)
  • Match wikipages with their corresponding topics
  • Match wikipages with their pageviews
  • Save a new corresponding graph with these attributes
  • Give a graphical topics partition of each cluster

Pre-requisites

Python libraries
  • numpy, matplotlib, pandas, networkx, requests
  • community
    $ pip install python-louvain
Wikipedia graph

Get the graph from SparkWiki projet using PeakFinder module.

Put the graph file into a local folder Python/Results/<Language>/<Language>_<date_start>_<date_end>.

Language: EN, FR, RU, etc.

Date format: YYYYMMDD

Graph file name format: peaks_graph_<date_start>_<date_end>.gexf

Example: Python/Results/EN/EN_20200316_20200331/peaks_graph_20200316_20200331.gexf

Usage

To compute the whole pipeline from a graph with the name and folder path in the correct format (cf. Pre-requisites), run the following command in the terminal:

$ python main.py EN 20200316 20200331

The pipeline can also be computed partially. To do that, specify the optional parameter from 1 to 6 to run only a part of the pipeline corresponding to the features described in the table below:

$ python main.py EN 20200316 20200331 1
Parameter value Description
0 Default
1 Compute degree, betweeness centrality and modularity
2 Match Qids
3 Match topics
4 Match pageviews
5 Save graph attributes
6 Give topics repartition per cluster

Alternatively, one can run the Topics_exctraction.ipynb notebook. The notebook also includes the code generating visualisations.

Results

Every stage of the pipeline generates and saves a .csv file with corresponding results.

The final step creates /Figures folder with figures of the topics partition per cluster.

Also, the final stage creates a graph file with all the computed attributes: filled_graph.gexf

In order to explore the detected topics, the graph can be visualized in Gephi. We used Circle Pack Layout with modularity class as a partitioning attribute.

Tests

Wikipedia graphs of trending pages are available in Python/Result for 16/08/2018 to 31/12/2018 and 17/12/2019 to 15/04/2020 periods for EN, FR, RU languages.

The notebook Topic_comparison.ipynb gives a topic comparaison between EN, FR, RU languages. The figures are saved in Python/Comparison_figures.

Gephi files representing the graphs are also located in /Gephi folder.

Examples

Here you can see a visual example. The animation shows trending topics for the last four months of 2018. The graph visualization illustrates the graph computed for the period 1-15 March 2020.

Topics comparaison Topics comparaison Gephi graph (EN_20200301_20200315) Gephi graph example (EN_20200301_20200315)

Credits

Wikipedia trending topics detection: SparkWiki

Clustering of trending pages: Community detection

Topic classification model: Language-Agnostic Topic Classification

About


Languages

Language:Jupyter Notebook 98.0%Language:Python 2.0%