luispfonseca / stata-rcallstringdist

Call R's stringdist package from Stata using rcall

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RCALLSTRINGDIST: Call R's stringdist package from Stata using rcall


Major updates

  • 0.3.0 11jul2019:
    • adds additional options to clean strings before comparison (ignorecase, ascii, whitespace, punctuation)
    • moved handling out of csv and into .dta files (faster when merging with original data) using R package haven (appears better at handling diacritics than readstata13)
    • small speed improvements
  • 0.2.0 16apr2019:
    • adds several options: matrix (for one and two variables), duplicates, and sortwords
    • (0.2.3) significant increases in speed
  • 0.1.0 15apr2019:
    • first version of the command

Description

This command uses rcall to call R's stringdist. It allows the user to obtain various measures of distances between text strings.

I'd like to thank the authors of both packages:

Alternatives

Install

  1. Install R directly or with RStudio for a graphical interface.
  2. Install this package using the github command by E. F. Haghish. This will also install dependencies automatically.
net install github, from("https://haghish.github.io/github/") replace
github install luispfonseca/stata-rcallstringdist

Dependencies

This Stata package requires R, the stringdist and haven R packages, and the rcall Stata package.

Optional dependencies:

  • Commands from gtools by Mauricio Caceres Bravo are used to speed up the command when available.

If R is installed on your machine, all these dependencies will be automatically installed when following the earlier instrutions. The file dependency.do is executed automatically after installing rcallstringdist package. Make sure R is installed on your machine before you attempt to install these packages on Stata.

Usage

See the help file in Stata for details about each option.

* Comparing two lists of strings
clear
input str30 nameA
"Gates Bill"
"Gates, Bill"
"bill gates"
"William H. Gates III"
end

input str30 nameB
"Bill Gates"
"Bill Gates"
"Bill Gates"
"William Henry Gates III"

compress

** Comparing two variables, row by row
*** default method (osa), default arguments, default generated variable name
rcallstringdist nameA nameB
*** specific variable names
rcallstringdist nameA nameB, gen(osa)
rcallstringdist nameA nameB, method(cosine) q(3) gen(cosine)
*** sometimes it's worth sorting words within each string. 
*** the first row will now be a perfect match
rcallstringdist nameA nameB, gen(osa_sortw) sortwords
*** it can also be worth cleaning up the strings before feeding them 
****(e.g. lowercase, remove punctuation and diacritics)
gen nameAclean = lower(nameA)
gen nameBclean = lower(nameB)
rcallstringdist nameAclean nameBclean, gen(osa_clean)
rcallstringdist nameAclean nameBclean, gen(osa_clean_sortw) sortwords

** Comparing two variables, all possible combinations
*** by calling the matrix option, we can compare all possible combinations 
*** of strings from one variable with the other variable
*** be aware: this option will clear your current working dataset from memory
*** see the following example
clear
input str30 nameA
"Gates Bill"
"Gates, Bill"
"bill gates"
"William H. Gates III"
"Bill Gates"
"Bill Gates"
end

input str30 nameB
"Bill Gates"
"William Henry Gates III"
"Bill Gates"
end

compress

save example_dataset

*** each string of nameA will be compared with each string of nameB
*** nameA has 5 unique strings, while name B has 2
*** 10 pairs will be compared
rcallstringdist nameA nameB, matrix

* Comparing one list of strings with itself, all possible combinations
*** if only one variable is passed, compare all pairs of strings within
*** we have 5 unique strings, 5x4/2=10 combinations
use example_dataset, clear
rcallstringdist nameA, matrix
*** to keep all permutations (5x4=20), we can use the keepduplicates option
use example_dataset, clear
rcallstringdist nameA, matrix keepduplicates

Author

Luís Fonseca
London Business School
lfonseca london edu
https://luispfonseca.com

About

Call R's stringdist package from Stata using rcall

License:MIT License


Languages

Language:Stata 99.6%Language:TeX 0.4%