Overview

This repository contains keyword blacklists and lists of other content such as URLs or images used to trigger censorship in apps used in China. With the exception of WeChat, these lists were reverse engineered and are the exhaustive lists of keywords used to trigger censorship on these platforms.

The full details on data collection and analysis methods and results are available below.

Chat apps

The research below tracks daily changes to censorship in three different chat apps used in China: TOM-Skype, Sina UC, and Line. Overall, our chat app data consists of over 4,000 blacklisted keywords.

Data: TOM-Skype and Sina UC, LINE

Live-streaming apps

The research below tracks hourly changes to censorship in three different live streaming apps in China: YY, Sina Show, and 9158; and documents the keywords censored by GuaGua, which does not include a mechanism for downloading updates to its censorship blacklists. Overall, our live-streaming data consists of over 20,000 blacklisted keywords.

Data: Original live-streaming data (2015), Updated live-streaming data (2017)

Mobile games

Our research on mobile games analyzes domestic Chinese games as well as international games that have been altered to comply with Chinese regulations. Overall, we found hundreds of mobile games performing censorship, collectively censoring over 100,000 unique blacklisted keywords.

Measuring Decentralization of Chinese Keyword Censorship via Mobile Games

Data: Mobile games

Open source projects

This research analyzes Chinese censorship in open source projects. We extracted over 1,000 Chinese keyword blacklists from open source projects on GitHub, collectively spanning over 200,000 unique blacklisted keywords.

The effect of information controls on developers in China: An analysis of censorship in Chinese open source projects

Data: Open source blacklists

WeChat

Our research on WeChat censorship uses sample testing to determine what type of content, such as words, URLs, and images, can be communicated over the platform and which content is censored. We have studied what categorical content WeChat generally filters in addition to what content WeChat filters in response to specific events.

Data: Keywords and URLs (November 2016), 709 Crackdown keywords and images (April 2017), Liu Xiaobo keywords and images (July 2017), 19th Party Congress keywords (November 2017), Image filtering test data (May 2018)

Apple engravings

Our research measuring Apple's filtering of product engravings uses sample testing to discover keywords that cannot be engraved in each of six different regions: United States, Canada, Japan, Taiwan, Hong Kong, and mainland China.

Engrave Danger: An Analysis of Apple Engraving Censorship across Six Regions

Data: Keyword filtering rules

QQMail

On Tencent's QQMail, we discover that certain combinations of keywords being present in email messages triggers their censorship. However, the presence of other combinations, which we call extenuating combinations, deactivates the censorship of some censored keywords.

Measuring QQMail’s automated email censorship in China

Data: Censored and extenuating keyword combinations

Keyword Content Analysis

Datasets include raw keyword lists collected from the applications. Many also include processed data including translations and categorization of keywords. Keywords were translated to English using a combination of machine and human translation. Based on interpreting these translations with contextual information, we coded each keyword into content categories grouped under six general themes according to a code book.

License

All data is provided under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and available in full here and summarized here.

About

Data related to investigation of chat client censorship

Languages

Language:Lua 56.6%Language:Java 24.8%Language:HTML 12.9%Language:C 2.6%Language:Assembly 1.6%Language:R 1.5%Language:Shell 0.0%