namsor / namsor-tools-v2

NamSor command line tools, to append gender, origin, diaspora or us 'race'/ethnicity to a CSV file.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

namsor-tools-v2

NamSor command line tools, to append gender, origin, diaspora or us 'race'/ethnicity to a CSV file. The CSV file should in UTF-8 encoding, pipe-| demimited. It can be very large. Check out also the Python CLT (https://github.com/namsor/namsor-python-tools-v2)

Installation

Please install Maven first, then build the executable with command

mvn clean package

NB: we use Unix conventions for file paths, ex. samples/some_fnln.txt but on MS Windows that would be samples\some_fnln.txt

Usage

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar
usage: NamSorTools -apiKey <apiKey> [-basePath <basePath>] [-countryIso2
       <countryIso2>] [-digest] [-e <encoding>] -f <inputDataFormat> [-h]
       [-header] -i <inputFile> [-o <outputFile>] [-r] [-religionoption]
       [-s] -service <service> [-uid] [-usraceethnicityoption
       <usraceethnicityoption>] [-w]

       
 -apiKey,--apiKey <apiKey>
 NamSor API Key
 -basePath,--basePath <basePath>
 Base Path, ex. https://v2.namsor.com/NamSorAPIv2
 -countryIso2,--countryIso2 <countryIso2>
 countryIso2 default
 -digest,--digest
 SHA-256 digest names in output
 -e,--encoding <encoding>
 encoding : UTF-8 by default
 -f,--inputDataFormat <inputDataFormat>
 input data format : first name, last name (fnln) / first name, last name,
 geo country iso2 (fnlngeo) / full name (name) / full name, geo country
 iso2 (namegeo) / full name, geo country iso2, country subdivision
 (namegeosub)
 -h,--help
 get help
 -header,--header
 output header
 -i,--inputFile <inputFile>
 input file name
 -o,--outputFile <outputFile>
 output file name
 -r,--recover
 continue from a job (requires uid)
 -religionoption,--religionoption
 extra religion stats option X-OPTION-RELIGION-STATS for country / origin
 / diaspora
 -s,--skip
 skip errors
 -service,--endpoint <service>
 service : parse / gender / origin / country / diaspora / usraceethnicity
 / phoneCode / subclassification / religion / castegroup
 -uid,--uid
 input data has an ID prefix
 -usraceethnicityoption,--usraceethnicityoption <usraceethnicityoption>
 extra usraceethnicity option USRACEETHNICITY-4CLASSES
 USRACEETHNICITY-4CLASSES-CLASSIC USRACEETHNICITY-6CLASSES
 -w,--overwrite
 overwrite existing output file

Examples

To append the likely name gender to a list of first and last names : John|Smith

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar -apiKey <yourAPIKey> -w -header -f fnln -i samples/some_fnln.txt -service gender

To append the likely name origin to a list of first and last names : John|Smith

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar -apiKey <yourAPIKey> -w -header -f fnln -i samples/some_fnln.txt -service origin

To append the likely country of residence to a list of full names : John Smith

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar -apiKey <yourAPIKey> -w -header -f name -i samples/some_name.txt -service country

To parse names into first and last name components (John Smith or Smith, John -> John|Smith)

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar -apiKey <yourAPIKey> -w -header -f name -i samples/some_name.txt -service parse

The recommended input format is to specify a unique ID and a geographic context (if known) as a countryIso2 code.

To append gender to a list of id, first and last names, geographic context : id12|John|Smith|US

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i samples/some_idfnlngeo.txt -service gender

To append the ethnicity (in the sense of cultural heritage / country of origin of ascendents) from a list of id, first and last names, geographic context : id12|John|Smith|US

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i samples/some_idfnlngeo.txt -service diaspora

To parse name into first and last name components, a geographic context is recommended (esp. for Latam names) : id12|John Smith|US

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar -apiKey <yourAPIKey> -w -header -uid -f namegeo -i samples/some_idnamegeo.txt -service parse

On large input files with a unique ID, it is possible to recover from where the process crashed and append to the existint output file, for example :

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar -apiKey <yourAPIKey> -r -header -uid -f fnlngeo -i samples/some_idfnlngeo.txt -service gender

For Indian names (for now), you can infer the likely india state or union territory (ie. a subdivision of the country as per ISO 3166-2:IN)

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar -apiKey <yourAPIKey>  -r -header -uid -f fnlngeo -i samples/some_indian_idfnlngeo.txt -service subdivision

For Indian names (for now), you can infer the likely religion (provided the IN country code and state/union territory as per ISO 3166-2:IN)

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar -apiKey <yourAPIKey>  -r -header -uid -f namegeosub -i samples/some_indian_idnamegeosub.txt -service religion

For Indian names (for now), you can infer the likely caste group (provided the IN country code and state/union territory as per ISO 3166-2:IN)

java -jar target/NamSorToolsV2-0.26-SNAPSHOT.jar -apiKey <yourAPIKey>  -r -header -uid -f namegeosub -i samples/some_indian_idnamegeosub.txt -service castegroup

Anonymizing output data

The -digest option will digest personal names in file outpus, using a non reversible MD-5 hash. For example, John Smith will become 6117323d2cabbc17d44c2b44587f682c. Please note that this doesn't apply to the PARSE output.

Understanding output

Please read and contribute to the WIKI https://github.com/namsor/namsor-tools-v2/wiki/NamSor-Tools-V2

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

About

NamSor command line tools, to append gender, origin, diaspora or us 'race'/ethnicity to a CSV file.

License:GNU Lesser General Public License v3.0


Languages

Language:Java 100.0%