jackfsuia / BertChunker

BertChunker: Efficient and Trained Chunking for Unstructured documents. 训练Bert做文档语义分段.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BertChunker: Efficient and Trained Chunking for Unstructured Documents

Model | Paper

Code for generating dataset and training of BertChunker, a semantic chunker for unstructured documents.

Generate dataset

See generate_dataset.ipynb

Train from the base model all-MiniLM-L6-v2

Run

bash train.sh

Inference

See test.py

Citation

If this work is helpful, please kindly cite as:

@article{BertChunker,
  title={BertChunker: Efficient and Trained Chunking for Unstructured Documents}, 
  author={Yannan Luo},
  year={2024},
  url={https://github.com/jackfsuia/BertChunker}
}

About

BertChunker: Efficient and Trained Chunking for Unstructured documents. 训练Bert做文档语义分段.

License:Apache License 2.0


Languages

Language:Python 72.9%Language:Jupyter Notebook 19.3%Language:Shell 7.8%