Subject Conversion Data Analysis & Transformation

This is a jupyter notebook working through data analysis on DOMM's subject conversion project. Mainly uses pandas and some bash commands, with the desired output of certain files that will serve as input for the subject conversion process code

Workflow

The following instructions taken from the Subject Conversion Google Doc

How to make desired input file 'New_authorities_list'

Download combined_subject_conversion_file.xslx from dams-metadata repo
Remove all rows from the dataset containing complexSubject types
De-dupe (i.e. remove all incidence of rows after the first occurrence) all rows based on unique values of updated_label column
Create a new column, external_authority_URI, that contains only 1 URI, that first filters out/ignores all "local" type values, then fill in values taken mostly from the clustering id column (TODO confirm this is reliable logic)
Create a new column, other_URI, that concatenates (and is de-duped to clustering id values) all the "URI" column values, separated by | (FAST_URI, LoC URI, AAT_URI, VIAF_URI)
Create a new column, variant_label, that contains unique values from old_label (de-duped against updated_label column values). TODO Investigate level of effort to concatenate all reconciled labels, de-deuped against both old label and updated_label column values)
A key to the main data table. This is currently being hashed out, but if this key column is added to the original combined_subject_conversion_file.xslx, can just preserve the key values

remerjohnson / subject-data-analysis

Subject Conversion Data Analysis & Transformation

Workflow

How to make desired input file 'New_authorities_list'

Desired input file 'label_updates'

About

Languages