BuildOccurrence: process function mismatches filenames
munnellg opened this issue · comments
Gary Munnelly commented
The process
function in BuildOccurrence
uses a regular expression to locate the date in the filename. However, the regular expression will match the first instance of a number it finds, which means that any ID numbers contained in the filename are erroneously extracted as years.
process
will also ignore files that do not have a file extension.
Pierpaolo Basile commented
In some cases, the time period is not equivalent to a specific year but can span several years. Or the corpus is not split according to time information (e.g. author, or other features). The number in the filename is used to sort the files and associates an id to each word space.
I will fix the documentation.