wendywangwwt / AcademicDatabaseToolkit

An R package for automating article searches & carrying out simple analysis on the search results.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Academic Database Toolkit

This is a refactored & extended package of the scripts in WebScrapper-for-Academic-Databases, a side project started years ago.

This toolkit is intended to include

  • A web scrapper for academic databases in R/Python in the form of a (set of) function. Initially, this was built for accelerating meta-data analysis where one intends to try various combinations of different keywords in a handful of databases. Manually typing the keywords in each databases and eye-balling the resulting articles can be extremely inefficient and painful, especially when an article already seen from a previous database came up again in a new database, and
  • A set of useful functions to summarize, analyze, and visualize the retrieved search results.

The web scrapper automatically

  • generates query based on the provided keywords
  • parses the result page (xml in PubMed for example)
  • extracts the a group of information, including
    • title
    • author(s)
    • published year
    • abstract
    • link to the article
    • availability (may not be available to all databases)
    • search term
    • database
  • returns the results as a data frame

Supported databases as of 2023-04-03, for ONE or MULTIPLE sets of keywords:

  • PubMed (pubmed)
  • Sage Journals (sage_journal): you may want to use filter Article Type to include only the research articles for your work (there are other types such as review article) by search_database(...,additional_args=list(ContentItemType='research-article'))

Supported databases as of 2023-04-04, for ONE set of keywords:

  • ProQuest (proquest): you need to provide a string or a vector of sub-databases to param subdb_proquest for the sub-databases to search; if you do not know which ones are available, use function get_proquest_subdb()

Databases that need code update as of 2023-04-03:

  • Science Direct (science_direct)

Current development work focuses on implementing the automated search of MULTIPLE sets of keywords. This is the scenario where each concept can be described with slightly or largely different words/phrases, and the need is to search for articles that involves at least one keyword from each concept (relationship=or).

How to install

devtools::install_github('wendywangwwt/AcademicDatabaseToolkit')
library(AcademicDatabaseToolkit)

How to use

param: keywords

For ONE set of keywords, use a string of keyword, or a vector of multiple keywords:

keywords <- c("decisions","decision-making")

For MULTIPLE set of keywords, use a list where each sublist is a vector of one or multiple keywords that describes the same concept:

keywords <- list(c("decisions","decision-making"),
                 c( "consumer behavior"))

param: database_name

A string of database name:

database_name <- 'pubmed'

or a vector of multiple database names:

database_name <- c('pubmed','sage_journal')

param: limit_per_search

Put a limit to avoid collecting tens of thousands of results, unless it is intended.

df_data <- search_database(keywords,database_name=database_name,limit_per_search=300)

param: relationship

The relationship between keywords, if multiple are provided. Default to or. So the above example is equivalent to:

df_data <- search_database(keywords,relationship='or',database_name=database_name,limit_per_search=300)

If you want to concatenate your keywords with an AND relationship for the search, change the value to and.

df_data <- search_database(keywords,relationship='and',database_name=database_name,limit_per_search=300)

param: field

Which field to search. Default to abstract (depending on database, this usually includes article title & keywords as well). Optionally, you can switch to all, to search full article.

df_data <- search_database(keywords,relationship='and',field='all',database_name=database_name,limit_per_search=300)

param: no_duplicate

Whether to drop duplicated results or not. Default to TRUE. Duplicated results come from searches across keywords (relationship='or'), and/or searches across databases. You may want to turn it off if you'd like to better understand which database + search term combinations bring a duplicate. Higher number of duplicates could indicate higher relevancy of the article to the topic you intend to look into.

df_data <- search_database(keywords,relationship='or',database_name=database_name,no_duplicate=F,limit_per_search=300)

Change log

2023-04-04:

  • added support for one set of keywords in proquest
  • added tests for proquest

2023-04-03:

  • re-factored the code into an R package that can be installed from GitHub, for pubmed & sage journals
  • added tests for pubmed and sage journals
  • updated readme

2022-02-20:

  • re-factored the code (not completely) and updated the readme file
  • only pubmed passed the tests, need to check and update the interaction with other databases later

About

An R package for automating article searches & carrying out simple analysis on the search results.


Languages

Language:R 100.0%