The goal of LIHKGr is to scrape text data on the LIHKG, the Hong Kong version of Reddit, for analysis. LIHKG has gained popularity in 2016 and become a popular research data source during recent years. LIHKG is currently protected by Google's reCAPTCHA, this package currently builds on RSelenium
and adopts a semi-manual approach to bypass it.
lihkgr.R
contains all the required functions. Please install the following packages: RSelenium
, raster
, magrittr
, and rvest
and follow the following workflows:
- Run the R sciprt to load the packages and define all the functions.
- Set working directory, temporary files will be stored here.
- Run
init_scraper()
to initiate scraper. Specify the range of post ids to scrape. - Run
launch_browser()
and solve reCAPTCHA when needed. The default browser is Chrome (77.0.3865.40), change browser and version if neccessary. This function builts onRSelenium::rsDriver()
, see help file for more information about supported browsers. - Run
start_scraping()
to strat scrapping. The function currently produces the following files:LIHKGr.RData
which saves the workspace,lihkg_df.rds
andlihkg_df.csv
which save the dataframe as .rds and .csv respectively, andlihkg_df_postid.txt
which save the last scraped post id.
If the browser has crashed, repeat step 4 and 5. If R has crashed, read in LIHKGr.RData
and repeat step 4 and 5.
- Create R package using devtools
- Create arguments for specifying file outputs.
- Debug error scrapping empty last page
- Debug error scraping hidden posts
- Justin Chun-ting Ho
- Nick H. K. Or
Ho, J.C. & Or, N.H.K. (2020). LIHKGr. An application for scraping LIHKG. Source code and releases available at https://github.com/justinchuntingho/LIHKGr.