Project @HPI
Location: settings
directory
To run the application you need to specify a YAML configuration file. This file contains some settings and the datasets to be used. It has the following structure:
-
Global Settings:
Name Required Description Accepted Values Default directory
β Directory in repo root where data is or will be stored. File path string, e.g. data
default_phonetic_method
β Default method for phonetic matching. soundex
,nysiis
,metaphone
,match_rating
default_pair_method
β Default method for creating candidate matches. full
,block
,sortedneighbourhood
,random
sortedneighbourhood
number_indexing_keys
β Number of key columns to be used for indexing in case multi-key-indexing is needed. The ones with the highest entropies will be chosen. Integer, e.g. 2
1
default_similarity_measures: string
β Default similarity measure for string candidate matches. jaro
,jarowinkler
,levenshtein
,damerau_levenshtein
,qgram
,cosine
,smith_waterman
,lcs
levenshtein
default_similarity_measures: numeric
β Default similarity threshold for numeric candidate matches. step
,linear
,exp
,gauss
,squared
linear
-
Datasets: An array of datasets, each including:
Name Required Description Accepted Values Default id
β Unique identifier for the dataset. E.g. 'freedb_cds', 'hpi_cora', 'bikes' phonetic_method
β Method used for phonetic matching. If not specified, the default method from the global settings will be applied. soundex
,nysiis
,metaphone
,match_rating
pair_method
β Method used for creating candidate matches. If not specified, the default method from the global settings will be applied. full
,block
,sortedneighbourhood
,random
sortedneighbourhood
number_indexing_keys
β Number of key columns to be used for indexing in case multi-key-indexing is needed. The ones with the highest entropies will be chosen. Integer, e.g. 2
1
tables
β URL(s) or filename(s) of the dataset tables in csv format lying in the directory
specified in the global settings. Max number of tables: 2; min: 1E.g. 'freedb_cds.csv' or 'http://pages.cs.wisc.edu/~anhai/data/784_data/bikes/csv_files/bikedekho.csv' gold_standard
β URL or filename of the gold standard for the dataset in csv format lying in the directory
specified in the global settings.E.g. 'hpi_cora_hpi_cora_goldstandard.csv' or 'http://pages.cs.wisc.edu/~anhai/data/784_data/bikes/csv_files/labeled_data.csv' candidate_set
β URL or filename to the candidate set for the dataset in csv format lying in the directory
specified in the global settings .E.g. 'bikes_candset.csv' or 'http://pages.cs.wisc.edu/~anhai/data/784_data/bikes/csv_files/candset.csv' similarity_measures: string
β Similarity measure for string candidate matches. If not specified, the default string similarity measure from the global settings will be applied. jaro
,jarowinkler
,levenshtein
,damerau_levenshtein
,qgram
,cosine
,smith_waterman
,lcs
levenshtein
similarity_measures: numeric
β Similarity threshold for numeric candidate matches. If not specified, the default numeric similarity measure from the global settings will be applied. step
,linear
,exp
,gauss
,squared
linear
All values should be strings.
global_settings:
directory: 'data'
default_phonetic_method: 'soundex'
default_pair_method: 'sortedneighbourhood'
default_similarity_measures:
string: 'jaro'
numeric: 'exp'
datasets:
- id: 'freedb_cds'
tables:
- 'freedb_cds_freedb_cds.csv'
phonetic_method: 'soundex'
gold_standard: 'freedb_cds_freedb_cds_goldstandard.csv'
similarity_measures:
numeric: 'gauss'
- id: 'hpi_cora'
tables:
- 'hpi_cora_hpi_cora.csv'
gold_standard: 'hpi_cora_hpi_cora_goldstandard.csv'
pair_method: 'block'
similarity_measures:
string: 'jarowinkler'
- id: 'bikes'
tables:
- 'http://pages.cs.wisc.edu/~anhai/data/784_data/bikes/csv_files/bikedekho.csv'
- 'http://pages.cs.wisc.edu/~anhai/data/784_data/bikes/csv_files/bikewale.csv'
gold_standard: 'http://pages.cs.wisc.edu/~anhai/data/784_data/bikes/csv_files/labeled_data.csv'
candidate_set: 'http://pages.cs.wisc.edu/~anhai/data/784_data/bikes/csv_files/candset.csv'
python src/main.py --config settings/config.yaml
If no path to a config file is specified, the application will look for a file named config.yaml
in the settings
directory by default.
The datasets specified in the config will be either downloaded and saved in the specified directory or loaded from this directory.
A log/logs.log
file will be created in the repo root directory. It will save all logs from an application run.