Leolty / RoseMatcher

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RoseMatcher

RoseMatcher is an approach to automatically match colloquially-written user reviews with technically-written release notes, and identify the relevant matching pairs, which can not only address the language gap between the two natural languages, but also greatly improve the hit ratio of the matching pairs.

This repository provides the public access to the raw data of RoseMatcher.

Data Source

App Category App Release Time First Update Time Ranking Within Category
#Reddit News 2016/04/07 2016/04/20 2
#Spotify Music 2011/07/04 2015/05/20 1
#Pandora Music 2013/09/18 2014/06/12 2
ZOOM Bussiness 2012/08/15 2013/12/18 1
Microsoft Teams Bussiness 2016/11/02 2016/11/08 2
#SHEIN Shopping 2014/5/20 2014/7/21 2
Google Chrome Utilities 2012/06/28 2014/10/27 2
TestFlight Developer Tools 2014/7/23 2014/07/26 1
Github Developer Tools 2020/3/17 2020/03/20 2
#Instagram Photo & Video 2010/10/06 2015/09/15 2
Gmail Productivity 2011/11/02 2012/07/31 1
Google Drive Productivity 2012/6/28 2014/01/28 2

* All the data information is sourced from Apple App Store.

* Apps that App Name starts with # are evaluation data we use in our paper (same below).

Collection Method

  • Data collection was performed in February 2022.
  • All user reviews and release notes of the apps for the 5-year period from January 1st, 2017 to January 1st, 2022 are crawled.

Dataset Composation

App Name Release Num Sentence Num Review Num Sentence Num
#Reddit 233 450 62,598 129,545
#Spotify 182 506 368,243 808,441
#Pandora 115 235 107,241 241,497
ZOOM 104 878 47,799 95,528
Microsoft Teams 154 379 9,233 21,567
#SHEIN 161 361 45,516 117,529
Google Chrome 77 359 12,677 32,735
TestFlight 22 37 5,566 8,683
Github 68 373 668 1,458
#Instagram 253 380 461,264 918,729
Gmail 119 175 33,985 93,748
Google Drive 127 167 26,139 56,823
Total 1,615 4,300 1,180,929 2,526,283

Data Storage and Attributes

Data Storage

All the data are stored in Excel files, in which:

  • App release notes are stored in dataset/*app_name*/*app_name*_Release_Origin.xlsx.
  • App user reviews are stored in dataset/*app_name*/*app_name*_Reviews_Origin.xlsx.

Release Note Dataset Attribute

  • version: update version number
  • date: update date*
  • release: release note content

* If you are not a Chinese speaker, we want you to know that "年" means "year", "月" means month, and "日" means day. If you need to process the data , you can use the following python code to replace these Chinese characters with symbols.

date = "2019年5月26日"

data.replace('年', '-').replace('月', '-').replace('日', '-')

# expected output: data = "2019-5-26"

User Review Dataset Attribute

  • date: posting date, including the exact time (year-month-day hour:minute:second)
  • rating: ratings given by users (from 1 to 5)
  • title: user review title
  • content: user review content*

* If you are not a Chinese speaker, we want you to know that "该条评论已经被删除" means "this review has been deleted".

How to read data

Python provides great APIs for Excel reading, and we provide the following one example for data reading.

import pandas as pd

df = pd.read_excel("../Spotify/Spotify_Reviews_Origin.xlsx")
df.dropna()
df.reset_index()

# If you want the user reviews that combine title and content
reviews = df['title'] +" "+ df['content']

# If you only want the user review content rated 1 by users
content_rating1 = df.loc[df['rating']==1]['content']

About