Readme

Project Description

This assignment is part of the course DATA 512 - Human Centered Data Science.

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles.
We perform an analysis of the number of articles and their qualities across different countries and regions.

License

Wikimedia API portal can be accessed here.

Content accessed via this API is licensed under the CC-BY-SA 3.0 and GFDL licenses.

Please find the terms and use of this API here - https://www.mediawiki.org/wiki/REST_API#Terms_and_conditions

Link to the API documentation can be found here - https://wikimedia.org/api/rest_v1/

Here's the link to API:Info that is used to get revision id

Here's the link to ORES that is used to predict the quality of the article.

Input Files

There are two inputs used by the code in this repository.

The list of politicians is present in the file input/politicians_by_country_SEPT_2022.csv World population data is present in the file input/population_by_country_2022.csv

Files Generated

The following data files are generated by the notebook.

data/wp_politicians_by_country.csv - stores the merged dataframe created from politicians and population dataframe
data/politicians_with_revid_quality.csv - stores the revision_id and quality data in politicians dataframe
result/wp_countries-no_match.txt - countries which didn't have corresponding population data or zero population

Running the code

Clone this repo using

git clone git@github.com:abhishekiitm/data-512-homework_2.git
cd data-512-homework_1

First install the necessary Python libraries in a virtual environment by executing the following steps in the Terminal (assuming you are running Linux):

$ virtualenv hw2_env  
$ source hw2_env/bin/activate

Then install the libraries using

$ pip install -r requirements.txt

Execute the notebook notebooks/analysis.ipynb using your choice of notebook environment (Jupyter Notebook or VS Code extension)

Research Implications

I learned that the dataset shows big differences in the number of Wikipedia articles of politicians per million population. By looking at the countries the reason for this deviation isn't very obvious. However, on looking at region level, it seems that regions that speak English have higher articles. This bias could be because we are only looking at the English language articles. Secondly poor countries with lower internet penetration could also have lower Wikipedia articles of their politicians. The variation was higher at country level and this variance could be due to their low population size.

Q. What (potential) sources of bias did you discover in the course of your data processing and analysis?

One potential source of bias is the language spoken in that country. If there are more English speakers in a country, it is more likely that there will be more articles about the politician of that country. This could explain why Nothern Europe, Oceania, have higher articles per million whereas East Asia has the lowest.
Another potential source of bias could be internet penetration. Countries that have higher internet penetration would be more aware about Wikipedia and care more about having Wikipedia entries about the politicians they know.

Q. What might your results suggest about (English) Wikipedia as a data source?

The results suggest that Wikipedia English articles could have more data about English speaking politicians and vice versa. Researchers, developers, policymakers designing downstream application using this data should be aware of this limitation.

Q. What might your results suggest about the internet and global society in general?

The results suggest the existence of strong regional bubbles that form naturally due to language barriers. As a result, there would be limited interaction between certain groups of people which could limit the understanding people have of cultures that are different from theirs. This might in turn perpetuate and reinforce biases that people may have about foreign cultures.

abhishekiitm / data-512-homework_2