RoseMatcher is an approach to automatically match colloquially-written user reviews with technically-written release notes, and identify the relevant matching pairs, which can not only address the language gap between the two natural languages, but also greatly improve the hit ratio of the matching pairs.
This repository provides the public access to the raw data of RoseMatcher.
App | Category | App Release Time | First Update Time | Ranking Within Category |
---|---|---|---|---|
News | 2016/04/07 | 2016/04/20 | 2 | |
#Spotify | Music | 2011/07/04 | 2015/05/20 | 1 |
#Pandora | Music | 2013/09/18 | 2014/06/12 | 2 |
ZOOM | Bussiness | 2012/08/15 | 2013/12/18 | 1 |
Microsoft Teams | Bussiness | 2016/11/02 | 2016/11/08 | 2 |
#SHEIN | Shopping | 2014/5/20 | 2014/7/21 | 2 |
Google Chrome | Utilities | 2012/06/28 | 2014/10/27 | 2 |
TestFlight | Developer Tools | 2014/7/23 | 2014/07/26 | 1 |
Github | Developer Tools | 2020/3/17 | 2020/03/20 | 2 |
Photo & Video | 2010/10/06 | 2015/09/15 | 2 | |
Gmail | Productivity | 2011/11/02 | 2012/07/31 | 1 |
Google Drive | Productivity | 2012/6/28 | 2014/01/28 | 2 |
* All the data information is sourced from Apple App Store.
* Apps that App Name starts with # are evaluation data we use in our paper (same below).
- Data collection was performed in February 2022.
- All user reviews and release notes of the apps for the 5-year period from January 1st, 2017 to January 1st, 2022 are crawled.
App Name | Release Num | Sentence Num | Review Num | Sentence Num |
---|---|---|---|---|
233 | 450 | 62,598 | 129,545 | |
#Spotify | 182 | 506 | 368,243 | 808,441 |
#Pandora | 115 | 235 | 107,241 | 241,497 |
ZOOM | 104 | 878 | 47,799 | 95,528 |
Microsoft Teams | 154 | 379 | 9,233 | 21,567 |
#SHEIN | 161 | 361 | 45,516 | 117,529 |
Google Chrome | 77 | 359 | 12,677 | 32,735 |
TestFlight | 22 | 37 | 5,566 | 8,683 |
Github | 68 | 373 | 668 | 1,458 |
253 | 380 | 461,264 | 918,729 | |
Gmail | 119 | 175 | 33,985 | 93,748 |
Google Drive | 127 | 167 | 26,139 | 56,823 |
Total | 1,615 | 4,300 | 1,180,929 | 2,526,283 |
All the data are stored in Excel files, in which:
- App release notes are stored in dataset/*app_name*/*app_name*_Release_Origin.xlsx.
- App user reviews are stored in dataset/*app_name*/*app_name*_Reviews_Origin.xlsx.
- version: update version number
- date: update date*
- release: release note content
* If you are not a Chinese speaker, we want you to know that "年" means "year", "月" means month, and "日" means day. If you need to process the data , you can use the following python code to replace these Chinese characters with symbols.
date = "2019年5月26日"
data.replace('年', '-').replace('月', '-').replace('日', '-')
# expected output: data = "2019-5-26"
- date: posting date, including the exact time (year-month-day hour:minute:second)
- rating: ratings given by users (from 1 to 5)
- title: user review title
- content: user review content*
* If you are not a Chinese speaker, we want you to know that "该条评论已经被删除" means "this review has been deleted".
Python provides great APIs for Excel reading, and we provide the following one example for data reading.
import pandas as pd
df = pd.read_excel("../Spotify/Spotify_Reviews_Origin.xlsx")
df.dropna()
df.reset_index()
# If you want the user reviews that combine title and content
reviews = df['title'] +" "+ df['content']
# If you only want the user review content rated 1 by users
content_rating1 = df.loc[df['rating']==1]['content']