gogit / globalnamedata

Tools to download and process name data from various sources.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What is this?

Most data on names and gender is ill-suited for any serious analytical purpose. Websites which collect data on birth names mainly offer searches, top ten lists and suggestions for parents. Most available data on the web comes either from commercial sources or from summary data.

We have collected birth record data from the United States and the United Kingdom across a number of years for all births in the two countries and are releasing the collected and cleaned up data here. We have also generated a simple gender classified based on incidence of gender by name. You can use this data for any purpose compatible with the license.

And, unlike any other open record for name data, we've provided the scripts necessary to check our work! You don't need to trust us in order to trust the data.

You can read about some uses of this data along with code examples at the Bocoup blog.

Setup

The easiest way to set up Global Name Data is to install it as an R package with devtools. With devtools installed you can install the package directly from github with install_github("globalnamedata", "OpenGenderTracking"). Dependencies will be automatically installed.

Once installed, the package will make available datasets (in compressed .RData format) for each source of name data as well as functions to process and check that data against available records.

Not an R user?

If you're mainly interest in the data, pre and post classified name data is available in the assets directory. If you install the package these will not be included in the install as .csv files but will be included as compressed binaries (the data are identical).

Contributing

We love pull requests. While not required, please try to adhere to Google's R Style guide.

Classification

Currently the Global Name Data project is used to produce gender estimates for byline and content classification in Open Gender Tracker. Each name is associated with a gender through the nameBinom() function using a binomial estimate. The specific method can be passed in as an argument, as can the thresholds for acceptance.

The classifier is specifically left decoupled from the import and processing function to allow for rapid testing and extension. Further, the classifer function itself can operate on individual years (or groups of years) rather than simply the whole dataset.

Further improvements down the line:

  • Pruning of name data via actuarial tables. This may be left to end users but it can improve the performance of the classifier marginally.
  • Introduction of other features for gender classification. Phenome distribution and last-letter choice are informative features of names for a gender classifier. Primary issue is that phenome/letter patterns are culturally dependent and could introduce noise for non-anglophone names.

Data sources

Currently Global Name Data utlizes four sources:

Processed data are provided under the Open Government License or the public domain where appropriate. See the LICENSE for details.

United States

The Social Security Administration provides records for name and gender by year for births between 1880 and 2011. In each year, names with a minimum incidence of 5 births are counted. Prior to 1937, data should be considered suspect and retrospective as names were only recorded for individuals who received a social security card and birth year was not comprehensibly verified. More information can be found here.

United Kingdom

Records for the United Kingdom are broken out across England and Wales, Northern Ireland and Scotland. The Office of National Statistics records births for England and Wales while Northern Ireland and Scotland are recorded seperately. In all jurisdictions the minimum number of births per year for each name is 3. Each jurisdiction provides summary data (e.g. top 10 names per year) but we do not download this data or use it in any way.

England and Wales

Full name data is provided between 1996 and 2011. The ONS offers historical summary data for 1904-1994 but these are restricted to the most popular names per year and so not of much analytical value. Information on the data itself can be found here [PDF].

Northern Ireland

Northern Ireland provides full name data between 1997 and 2011. Like the ONS, summary data is offered but does not add much value. Information on NISRA data can be found here.

Scotland

Scotland only provides full name data for 2009 and 2010. Summary data is offered over the past 20 years. General information about birth record data in Scotland is available here.

License

See the LICENSE file for details.

About

Tools to download and process name data from various sources.

License:Other


Languages

Language:R 100.0%