CodedotAl / code_clippy_lm_dataformat

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LM_Dataformat Build Status Coverage Status

Utilities for storing data for LM training.

Basic Usage

To write:

ar = Archive('output_dir')

for x in something():
  # do other stuff
  ar.add_data(somedocument, meta={
    'example': stuff,
    'someothermetadata': [othermetadata, otherrandomstuff],
    'otherotherstuff': True
  })

# remember to commit at the end!
ar.commit()

To read:

rdr = Reader('input_dir_or_file')

for doc in rdr.stream_data():
  # do something with the document

About

License:MIT License


Languages

Language:Python 100.0%