计算任何数据集json
的Token数,动态支持各种json格式,只需准备好你的json文件即可。
pip install -r requirements.txt
python calc_token.py <your-file.json>
Linux
- start python venv
python -m venv env
source env/bin/activate
- install python packages
pip install -r requirements.txt
- run once
python calc_token.py <your-file.json>
- copy tmp file
Because tiktoken
package need online download cl100k_base
in cache. We can download the necessary file, then "trick" tiktoken into caching it.
https://stackoverflow.com/questions/76106366/how-to-use-tiktoken-in-offline-mode-computer
cp -r /tmp/data-gym-cache .
- tar.gz
cd ..
tar -czvf token-calc.tar.gz token-calc
- setup env
tar -zxvf token-calc.tar.gz
cd token-calc
cd env/bin && rm python* && ln -s python3 python && ln -s python3 python3.10 && ln -s /usr/bin/python3 python3
vi activate # Edit "VIRTUAL_ENV" to your current dir
source env/bin/activate
- copy tmp files
cp -r data-gym-cache /tmp
- run python
python calc_token.py <your-file.json>