Reads pdfs and images (jpg, png by default) to a text file.
sudo apt install tesseract-ocr tesseract-ocr-deu
Install python env:
poetry install
Convert pdfs and images to text files in the current directory:
poetry run digitize.py .
See digitize.py -h
for more options.
Example:
poetry run ./digitize.py --exclude DSC IMAG foto picture photo book -r -- ~/sync/private/
You may exclude the generated files of pattern *_ocr.txt
for sync.