google / wikiloop-analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wikipedia Cross Edit Pattern Exploration

This repository is created to host exploratory code for the Wikipedia Revision Dataset, as part of the Cross Edit Pattern Detection Project.
This repository is forked from a Google corporate repository and will push changes regularly.
Author: Haoran Fei (haoranfei@google.com)
Host: Zainan Zhou (zzn@google.com)
Date: June 8th, 2020

Open-Source Dependencies and Licensing

Python3: GPL-Compatible License. GPL-compatible doesn’t mean that we’re distributing Python under the GPL. All Python licenses, unlike the GPL, let you distribute a modified version without making your changes open source.
Pandas: New BSD License. Matplotlib: License based on PSF license.

Usage Example

Loading the First json data file and run article-based analysis:
$ python3 article_analytics.py --path ./data/cross_edits_tmp_ttl=72_revisioninfo_20200605_1023_segment-000##-of-00037.json --start 0 --stop 1
Loading all 37 json data files and run article-based analysis:
$ python3 article_analytics.py --path ./data/1023/segment-000##-of-00037.json --start 0 --stop 37
Loading the First json data file and run author-based analysis:
$ python3 author_analytics.py --path ./data/cross_edits_tmp_ttl=72_revisioninfo_20200605_1023_segment-000##-of-00037.json --start 0 --stop 1
Loading all 37 json data files and run author-based analysis:
$ python3 author_analytics.py --path ./data/1023/segment-000##-of-00037.json --start 0 --stop 37

Formula for Sliding Window Anomaly Detection

A window will be flagged as anomaly if it satisfies the following condition:

M: metric considerd. Currently supports mean and median.
W: the window frame under consideration.
S: the complete dataset of the given key. This can be all edits on the same article/by the same author, depending on the key used.
k: value is either 1 or -1. It is 1 if we are concerned with abnormally high values only, and -1 if we are concerned with abnormally low values only.
t: a percentage threshold for flagging anomal. Currently set at 50%.

Log Files and Format

All log files are located in the cross-edits-analysis/log directory. Each directory holds the logs for the corresponding analysis script.
Format of log line: Anomaly of (metric name) of (column name) detected for (key: this can be article/author or article/author pair) during period from (starting time of window) to (ending time of window), with a () percent difference from baseline.

About

License:Apache License 2.0


Languages

Language:Python 100.0%