jhu-library-applications / dspace-data-collection-1

Scripts for extracting DSpace data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dspace-data-collection

Note: These scripts were updated in 05/2018 for the new authentication method used by DSpace 6.x

All of these scripts require a secret.py file in the same directory that must contain the following text:

        baseURL='https://dspace.myuni.edu'
        email='dspace_user@.myuni.edu'
        password='my_dspace_password'    
        filePath = '/Users/dspace_user/dspace-data-collection/data/'
        handlePrefix = 'http://dspace.myuni.edu/handle/'
        skippedCollections = A list of the 'uuid' of any collections that you wish the script to skip. (e.g. ['45794375-6640-4efe-848e-082e60bae375'])

The 'filePath' is directory into which output files will be written and 'handlePrefix' may or may not vary from your DSpace URL depending on your configuration. This secret.py file will be ignored according to the repository's .gitignore file so that DSpace login details will not be inadvertently exposed through GitHub.

If you are using both a development server and a production server, you can create a separate secret.py file with a different name (e.g. secretProd.py) and containing the production server information. When running each of these scripts, you will be prompted to enter the file name (e.g 'secretProd' without '.py') of an alternate secret file. If you skip the prompt or incorrectly type the file name, the scripts will default to the information in the secret.py file. This ensures that you will only access the production server if you really intend to.

Based on user input, extracts the values of two specified keys from a specified community to a CSV file for comparison.

Based on mjanowiecki's findInitialedNamesByCollection.py, find values in name fields that appear to have first initials that could be expanded to full names and provides a count for each collection when the count is more than zero.

Based a CSV of item handles, extracts all metadata (except 'dc.description.provenance' values) from the selected items to a CSV file.

Extracts the item ID and the value of the key 'dc.identifier.uri' to a CSV file when the value does not begin with the handlePrefix specified in the secret.py file.

Based on user input, extracts item IDs to a CSV file where there are multiple instances of the specified key in the item metadata.

Based on user input, extracts all of the item metadata from the specified collection to a JSON file.

Creates a 'completeValueLists' folder and for all keys used in the repository, extracts all values for a particular key to a CSV with item IDs. It also creates a 'uniqueValueLists' folder, that writes a CSV file for each key with all unique values and a count of how many times the value appears.

Creates a 'completeValueLists' folder and for all keys used in the specified community, extracts all values for a particular key to a CSV with item IDs. It also creates a 'uniqueValueLists' folder, that writes a CSV file for each key with all unique values and a count of how many times the value appears.

Based on user input, extracts all values from 'dc.contributor.advisor' and 'dc.contributor.committeeMember' fields from items in collections in the specified community.

Extracts all unique language values used by metadata entries in the repository to a CSV file.

Based on user input, extracts all the handles and bitstreams associated with the items in the specified collection to a CSV file.

Extracts all unique pairs of keys and language values used by metadata entries in the repository to a CSV file.

This script finds names with initials in DSpace collections based on regular expression matches and prints the results to a CSV. In particular, it searches for names where the first name is an initial and has not been expanded. It ignores most instances of names where the initial is a middle initial.

This script prints all the dcElements being used in a specific DSpace collection.

This script produces a csv with the metadata for items from a specific DSpace collection that have a certain key/value pair.

Based on user input, extracts the ID and URI for all items in the repository with the specified key, as well as the value of the specified key, to a CSV file.

Based on user input, extracts the ID and URI for all items in the specified collection with the specified key, as well as the value of the specified key, to a CSV file.

Based on user input, extracts the ID and URI for all items in the repository with the specified key-value pair to a CSV file.

Based on user input, extracts the IDs of items from a specified community that do not have the specified key.

Creates a matrix containing a count of each time a key appears in each collection in the repository.

Produces several CSV files containing different information about the structure and metadata of the repository:

File Name Description
collectionMetadataKeys.csv A list of all keys used in each collection with collection name, ID, and handle.
dspaceIDs.csv A list of every item ID along with the IDs of the collection and community that contains that item.
dspaceTypes.csv A list of all unique values for the key 'dc.type.'
keyCount.csv A list of all unique keys used in the repository, as well as a count of how many times it appear.
collectionStats.csv A list of all collections in the repository with the collection name, ID, handle, and number of items.

About

Scripts for extracting DSpace data


Languages

Language:Python 100.0%