lojban / freq

Word frequency list generator.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

This code takes the data from:

http://www.lojban.org/corpus/ (specifically the "compressed" link in "Corpus (compressed)" )

and

http://jbovlaste.lojban.org/ (specifically the XML Export -> English link)

and uses them with a template to generate a list of words in usage order in whatever format you specify.

There are two modes: in blob mode, the template is run exactly once, with all words. In normal mode, the template is run once for each word and the results are concatenated. Which mode you want to use depends on how the template was written, but normally you want normal mode.

Example Run

./run_docker.sh -t templates/anki_gismu.erb -o test.out

gives you test.out with the lines generated by the anki_gismu template.

To get a shell in the docker environment for testing/debugging:

./run_docker.sh shell

Template Details

The following, in YAML format, is what the "words" variable looks like if you use the -b option. The normal mode is the same except that you get each of these words one at a time, and all of the per-word items are variables, i.e. "type" is a variable, "word" is a variable, etc.

In normal mode, all the parts of the current word are available as a hash at the variable "word_all", and all the words are available as an array at the variable "words_all".

There's also an artificial variable, rafsi_or_selmaho, which is all the rafsi joined with spaces, if any, or the selmaho, or the empty string.

The Ruby "||=" idiom is very useful here. For example, this:

<%= notes %>

will fail if any word has no notes field, but this:

<%= notes ||= '' %>

works fine, and the auto-vivification of the variable does us no harm.

- definition: 'descriptor: the one, which (is / does) ... / those, which (are / do)
    ...'
  selmaho: LE
  notes: Terminated with {ku}. Under the xorlo reform, {lo} converts a selbri to a
    sumti in a generic way. In particular, lo broda = {zo'e} noi broda.
  glosswords:
  - word: that, which
  type: cmavo
  word: lo
  frequency: 374663
  rank_up: 1
  rank_down: 9720
- definition: "$x_{1}$ is happy/merry/glad/gleeful about $x_{2}$ (event/state)."
  notes: Adversity (= {kamnalgei}).  See also {badri}, {cinmo}.
  rafsi:
  - gek
  - gei
  glosswords:
  - word: happy
  type: gismu
  word: gleki
  frequency: 54166
  rank_up: 13
  rank_down: 9708
- definition: "$x_{1}$ (agent) expresses/says $x_{2}$ (sedu'u/text/lu'e concept) for
    audience $x_{3}$ via expressive medium $x_{4}$."
  notes: Also says.  See also {bacru}, {tavla}, {casnu}, {spuda}, cmavo list {cu'u},
    {bangu}, {dapma}, {jufra}, {pinka}.
  rafsi:
  - cus
  - sku
  glosswords:
  - word: say
  - word: express
  - word: mention
  keywords:
  - word: expresser
    place: '1'
  - word: expression
    sense: expressed information
    place: '2'
  - word: audience
    sense: receiver of expression
    place: '3'
  - word: medium
    sense: communication format
    place: '4'
  type: gismu
  word: cusku
  frequency: 19373
  rank_up: 47
  rank_down: 9674

About

Word frequency list generator.

License:MIT License


Languages

Language:Ruby 50.8%Language:Shell 37.7%Language:HTML 11.6%