mcurmei627 / RabBIT

A collection of quick and dirty scripts to analyze wikipedia corpus

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RabBIT

This repository contains the R script mainly for running simulations over the corpus using some of the functionalities provided by BitFunnel Ingestor: https://github.com/BitFunnel

Creating manifest files

The chunk files can be found here, there are 4 folders {AA, AB, AC, AD} containing each 100 chunck filea, except the last one which has 96. https://www.dropbox.com/s/ysava18au2q7t67/wiki-chunks.tar.gz?dl=0 Each chunk file contains around 500 documents.

A manifest file is a text file that contains on each line fully specified system paths to each chunk file

Obtaining a test corpus

Clone experiment branch https://github.com/BitFunnel/BitFunnel/tree/experiment and run:

make &&./StatisticsBuilder <manifest filepath>

The result are two $ separated csv files (they are not comma separated because some terms contain commas in the string of a posting, confusing the reader).

About

A collection of quick and dirty scripts to analyze wikipedia corpus


Languages

Language:R 100.0%