This assignment is part of the course DATA 512 - Human Centered Data Science.
The goal of this assignment is to explore the concept of bias in data using Wikipedia articles.
We perform an analysis of the number of articles and their qualities across different countries and regions.
Wikimedia API portal can be accessed here.
Content accessed via this API is licensed under the CC-BY-SA 3.0 and GFDL licenses.
Please find the terms and use of this API here - https://www.mediawiki.org/wiki/REST_API#Terms_and_conditions
Link to the API documentation can be found here - https://wikimedia.org/api/rest_v1/
Here's the link to API:Info that is used to get revision id
Here's the link to ORES that is used to predict the quality of the article.
There are two inputs used by the code in this repository.
The list of politicians is present in the file input/politicians_by_country_SEPT_2022.csv
World population data is present in the file input/population_by_country_2022.csv
The following data files are generated by the notebook.
- data/wp_politicians_by_country.csv - stores the merged dataframe created from politicians and population dataframe
- data/politicians_with_revid_quality.csv - stores the revision_id and quality data in politicians dataframe
- result/wp_countries-no_match.txt - countries which didn't have corresponding population data or zero population
Clone this repo using
git clone git@github.com:abhishekiitm/data-512-homework_2.git
cd data-512-homework_1
First install the necessary Python libraries in a virtual environment by executing the following steps in the Terminal (assuming you are running Linux):
$ virtualenv hw2_env
$ source hw2_env/bin/activate
Then install the libraries using
$ pip install -r requirements.txt
Execute the notebook notebooks/analysis.ipynb
using your choice of notebook environment (Jupyter Notebook or VS Code extension)
I learned that the dataset shows big differences in the number of Wikipedia articles of politicians per million population. By looking at the countries the reason for this deviation isn't very obvious. However, on looking at region level, it seems that regions that speak English have higher articles. This bias could be because we are only looking at the English language articles. Secondly poor countries with lower internet penetration could also have lower Wikipedia articles of their politicians. The variation was higher at country level and this variance could be due to their low population size.
Q. What (potential) sources of bias did you discover in the course of your data processing and analysis?
One potential source of bias is the language spoken in that country. If there are more English speakers in a country, it is more likely that there will be more articles about the politician of that country. This could explain why Nothern Europe, Oceania, have higher articles per million whereas East Asia has the lowest.
Another potential source of bias could be internet penetration. Countries that have higher internet penetration would be more aware about Wikipedia and care more about having Wikipedia entries about the politicians they know.
Q. What might your results suggest about (English) Wikipedia as a data source?
The results suggest that Wikipedia English articles could have more data about English speaking politicians and vice versa. Researchers, developers, policymakers designing downstream application using this data should be aware of this limitation.
Q. What might your results suggest about the internet and global society in general?
The results suggest the existence of strong regional bubbles that form naturally due to language barriers. As a result, there would be limited interaction between certain groups of people which could limit the understanding people have of cultures that are different from theirs. This might in turn perpetuate and reinforce biases that people may have about foreign cultures.