Final project description

The idea of our group is to work on related Wikipedia pages. That is to give as input a Wikipedia page and to get related pages’ name as output. The idea is to go through the first page text and to get all Wikipedia pages mentioned. Then go through all that pages and get all Wikipedia pages mentioned and so on until we have visited a specific number of pages. By doing it some pages will been mentioned many times and so they should be related to input page. It can be seen as a network where nodes are Wikipedia pages and there is a path from node A to node B if page B is mentioned in page A. Thus the intention is to retrieve the most linked nodes which there is a path from the input one.

Here is a short description of the steps will have to go through :

Be able to extract Wikipedia links from a page. ✅

Define a relevant criteria to stop going through pages (experimentaly). Option : compute the number of occurences of pages. ✅

Construct a network where nodes are pages and draw an edge from A to B if B is mentionned in page A. ✅

Construct a PageRank ✅.

Cluster all pages using either K-Mean or DBSCAN ✅.

Display the nodes with the highest page rank of each cluster ✅.

User's guide :

First you have to be sure that all packages are already installed.
To install the ForceAtlas2 : pip install fa2
To install Networkx : pip install networkx

Now you are ready to use our program.

The first step is to create the graph of your desire input and then to save it as a .gml file. To do so run the graphConstructor.py file and fill out the inputs with your desired parameters.
The first parameter is the title of the Wikipedia page you want suggestions about. The second one is the depth limit (we recommend you not to exceed 3 otherwise it will take a really long time to run and the suggested pages might off subject from the input.

Then run the graphAnalyzer.py file. This will display a list of all the suggested pages and their related PageRank. It will also create a graph called graph_with_layout.png where you will be able to identify the different clusters.

By default two graphs are already constructed as .gml files, which will avoid you to waste time. If you want to use one of them, you just have to go to the graphAnalyzer.py file and specify with file you want analyze line 127.

Warning : Please keep in mind that the DBSCAN algorithm does not provide a really relevant output and that it might take up to 10min to run. Therefore, we recommend you to use the K-Means algorithm (that is to give False as input when running graphAnalyzer.py)

JulesBelveze / wikipedia-pages-suggestion

Final project description

User's guide :

About

Languages