This tool extracts question-answer pairs from documents using a local LLM. It chunks documents, generates QA pairs, and collects the results.
Install directly from GitHub:
pip install git+https://github.com/sbetko/genqa.git --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
GenQA works in two steps: extraction and collection.
python -m genqa.extract <input_files> --output_dir <output_directory> [options]
This processes documents and saves QA pairs as JSON files, preserving context and structure.
Options:
--temperature
: LLM temperature (default: 0.0)--n_ctx
: Max context length (default: 16384)--chunk_size
: Max text chunk size (default: 4096)--max_questions
: Max questions per chunk (default: 3)--overwrite
: Overwrite existing files
Example:
python -m genqa.extract docs/*.pdf docs/*.docx docs/*.html --output_dir qa_results --temperature 0.1 --max_questions 3
python -m genqa.make_csv <input_directory> <output_csv_file>
This compiles all QA pairs from JSON files into a single CSV file for easier analysis.
Example:
python -m genqa.make_csv qa_results output_qa_pairs.csv
This two-step process lets you process documents in batches and review outputs before final collection.
There's also a standalone converter for DOCX, HTML, and PDF to Markdown:
python -m genqa.convert <input_file> > output.md
The model supports up to 128,000 tokens, but defaults to 16,384 to balance context and memory use. Adjust with --n_ctx
if needed.
The default chunk size fits instructions, content, and output within the max context. If generation halts due to context overflow, the script will retry with higher temperature or skip the chunk.
GPU support significantly speeds up processing. Installation instructions for GPU acceleration coming soon.
- DOCX
- HTML
Each document gets a JSON file with:
source_filepath
: Original document pathmarkdown_text
: Full document text in Markdownchunks
: Array of text chunks, each with:chunk_text
: Chunk contentqa_pairs
: Array of QA pairs (question, answer, supporting_quotes)
The CSV contains:
file_path
: Original document pathchunk_number
: Chunk number in documentqa_number
: QA pair number in chunkquestion
: Generated questionanswer
: Generated answersupporting_quotes
: Relevant quotes from chunk, delimited by pipes ("|
")
Uses Phi-3.1-mini-128k-instruct. The GGUF file auto-downloads to ~/.cache/huggingface/hub
on first run.