pippokill / tri

Temporal Random Indexing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BuildOccurrence: process function mismatches filenames

munnellg opened this issue · comments

The process function in BuildOccurrence uses a regular expression to locate the date in the filename. However, the regular expression will match the first instance of a number it finds, which means that any ID numbers contained in the filename are erroneously extracted as years.

process will also ignore files that do not have a file extension.

In some cases, the time period is not equivalent to a specific year but can span several years. Or the corpus is not split according to time information (e.g. author, or other features). The number in the filename is used to sort the files and associates an id to each word space.
I will fix the documentation.