alexrutherford / name_gender_scraping

Notebook to scrape Indian names and genders

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Summary

Quick notebook to parse names and genders from Behind the Names using Beautiful Soup

Passes foreign names into Google translate via TextBlob and adds translation and detected language from Langid.

Borrows processed names data from UK and US from OpenGenderTracker's GitHub repo

Borrows processed names from Argentina/Uruguay from GitHub repo

Many resources from this blog post

DB Schema

SQL DB stores name, number of male and female occurrences, flag if name can be unisex, and country, region and language hints where available.

name male female unisex country region lang lang_detected name_eng
احمد 99999 0 0 PK asia ur ar Ahmed

Additional Datasources

  1. Indian Hindi Baby Names

TODO

  1. Add in Wilson binomial correction
  2. Add in url decomposition from urlparse
  3. Create fresh DB connections following this recipe to prevent timeout

About

Notebook to scrape Indian names and genders


Languages

Language:HTML 95.7%Language:PHP 2.8%Language:JavaScript 1.2%Language:Python 0.3%