viveksck / langchangetrack

Package for Statistically significant linguistic change

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The code fails to create time series in using my original dataset. 'Number of words we are analyzing: 0'

fufufukakaka opened this issue · comments

When I run this code for my original dataset, the code fails to create time series.
I prepared ngrams and common_vocab in the same way as this repository.
Why 'Number of words we are analyzing: 0' ?

$ ngrams_pipeline.py --corpus-dir ngram100 --file-extension "ngrams" --working-dir ./working --output-dir ./output --context-size 5 --epochs 3 --start-time-point 1905 --end-time-point 1975 --step-size 5 --vocabulary-file common_vocab2.txt --workers 16
Processing files, ngram100/1905.ngrams ngram100/1910.ngrams ngram100/1915.ngrams ngram100/1920.ngrams ngram100/1925.ngrams ngram100/1930.ngrams ngram100/1935.ngrams ngram100/1940.ngrams ngram100/1945.ngrams ngram100/1950.ngrams ngram100/1955.ngrams ngram100/1960.ngrams ngram100/1965.ngrams ngram100/1970.ngrams
Training embeddings
Models will be stored in, ./working/models
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence the citation notice: run 'parallel --citation'.


Computers / CPU cores / Max jobs to run
1:local / 8 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:4/0/100%/0.0s train_embeddings_ngrams.py -f ngram100/1020.ngrams -o ./working/models -p 1020 -e skipgram -workers 4 --epochs 3 -w 5
Building a model from the corpus.
Model built.
2016-07-14 11:23:41 INFO corpustoembeddings.py: 59 window size:5, alpha:0.01, embedding size:200, min_count:10, workers:4
2016-07-14 11:23:41 INFO word2vec.py: 395 collecting all words and their counts
2016-07-14 11:23:45 INFO word2vec.py: 423 PROGRESS: at sentence #0, processed 0 words and 0 word types
2016-07-14 11:23:45 INFO word2vec.py: 423 PROGRESS: at sentence #10000, processed 79436 words and 13124 word types
2016-07-14 11:23:45 INFO word2vec.py: 423 PROGRESS: at sentence #20000, processed 158902 words and 16428 word types
2016-07-14 11:23:45 INFO word2vec.py: 423 PROGRESS: at sentence #30000, processed 238544 words and 17704 word types
2016-07-14 11:23:45 INFO word2vec.py: 423 PROGRESS: at sentence #40000, processed 318058 words and 18188 word types
2016-07-14 11:23:45 INFO word2vec.py: 423 PROGRESS: at sentence #50000, processed 397421 words and 18453 word types
2016-07-14 11:23:45 INFO word2vec.py: 423 PROGRESS: at sentence #60000, processed 476940 words and 18540 word types
2016-07-14 11:23:45 INFO word2vec.py: 423 PROGRESS: at sentence #70000, processed 556187 words and 18582 word types
2016-07-14 11:23:45 INFO word2vec.py: 423 PROGRESS: at sentence #80000, processed 635476 words and 18601 word types
2016-07-14 11:23:45 INFO word2vec.py: 423 PROGRESS: at sentence #90000, processed 714631 words and 18615 word types



Config:
Input data frame file name: ./working/timeseries/source.csv
Vocab file common_vocab2.txt
Output pvalue file ./output/pvals.csv
Output sample file ./output/samples.csv
Columns to drop 1905
Normalize Time series: True
Threshold 1.75
Dropped column 1905
Columns of the data frame are Index([u'Unnamed: 0', u'word', u'1910', u'1915', u'1920', u'1925', u'1930',
       u'1935', u'1040', u'1945', u'1950', u'1955', u'1960', u'1965', u'1970'],
      dtype='object')
Number of words we are analyzing: 0
/Users/username/.pyenv/versions/2.7.9/lib/python2.7/site-packages/more_itertools-2.2-py2.7.egg/more_itertools/more.py:30: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
[Parallel(n_jobs=4)]: Done   0 out of   0 | elapsed:    0.0s finished
Traceback (most recent call last):
  File "/Users/username/.pyenv/versions/2.7.9/bin/detect_changepoints_word_ts.py", line 4, in <module>
    __import__('pkg_resources').run_script('langchangetrack==0.1.0', 'detect_changepoints_word_ts.py')
  File "/Users/username/.pyenv/versions/2.7.9/lib/python2.7/site-packages/pkg_resources.py", line 517, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Users/username/.pyenv/versions/2.7.9/lib/python2.7/site-packages/pkg_resources.py", line 1436, in run_script
    exec(code, namespace, namespace)
  File "/Users/username/.pyenv/versions/2.7.9/lib/python2.7/site-packages/langchangetrack-0.1.0-py2.7.egg/EGG-INFO/scripts/detect_changepoints_word_ts.py", line 227, in <module>
    main(args)
  File "/Users/username/.pyenv/versions/2.7.9/lib/python2.7/site-packages/langchangetrack-0.1.0-py2.7.egg/EGG-INFO/scripts/detect_changepoints_word_ts.py", line 189, in main
    pvals, num_samples = zip(*results)
ValueError: need more than 0 values to unpack
Traceback (most recent call last):
  File "/Users/username/.pyenv/versions/2.7.9/bin/ngrams_pipeline.py", line 4, in <module>
    __import__('pkg_resources').run_script('langchangetrack==0.1.0', 'ngrams_pipeline.py')
  File "/Users/username/.pyenv/versions/2.7.9/lib/python2.7/site-packages/pkg_resources.py", line 517, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Users/username/.pyenv/versions/2.7.9/lib/python2.7/site-packages/pkg_resources.py", line 1436, in run_script
    exec(code, namespace, namespace)
  File "/Users/username/.pyenv/versions/2.7.9/lib/python2.7/site-packages/langchangetrack-0.1.0-py2.7.egg/EGG-INFO/scripts/ngrams_pipeline.py", line 55, in <module>
    main(args)
  File "/Users/username/.pyenv/versions/2.7.9/lib/python2.7/site-packages/langchangetrack-0.1.0-py2.7.egg/EGG-INFO/scripts/ngrams_pipeline.py", line 28, in main
    subprocess.check_call(cmd, shell=True)
  File "/Users/username/.pyenv/versions/2.7.9/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'detect_cp_distributional.sh ./working/models ./working ./output 1905 1975 5 locallinear 1000 common_vocab2.txt 1000 1.75 4' returned non-zero exit status 1

Thank you.

I'm sorry, it is my fault.
My vocabrary file contains some numeric characters, but these characters are not readed incorrectly.
Now, I fixed this point.