This is a jupyter notebook working through data analysis on DOMM's subject conversion project. Mainly uses pandas
and some bash commands, with the desired output of certain files that will serve as input for the subject conversion process code
The following instructions taken from the Subject Conversion Google Doc
- Download
combined_subject_conversion_file.xslx
from dams-metadata repo - Remove all rows from the dataset containing
complexSubject
types - De-dupe (i.e. remove all incidence of rows after the first occurrence) all rows based on unique values of
updated_label
column - Create a new column,
external_authority_URI
, that contains only 1 URI, that first filters out/ignores all "local" type values, then fill in values taken mostly from theclustering id
column (TODO confirm this is reliable logic) - Create a new column,
other_URI
, that concatenates (and is de-duped toclustering id
values) all the "URI" column values, separated by|
(FAST_URI
,LoC URI
,AAT_URI
,VIAF_URI
) - Create a new column,
variant_label
, that contains unique values fromold_label
(de-duped againstupdated_label
column values). TODO Investigate level of effort to concatenate all reconciled labels, de-deuped against bothold label
andupdated_label
column values) - A key to the main data table. This is currently being hashed out, but if this key column is added to the original
combined_subject_conversion_file.xslx
, can just preserve the key values