rename_by_content (RBC)
Automatically rename files by looking at their contents.
RBC is a python script that can be used to automaticall guess (hopefully) useful names and dates for files. It was written to recover thousands of files that were deleted by mistake and partially recovered by the excellent tool photorec
.
Supported file formats are: pdf, ai, doc, tar, zip, txt, mbox, ods, xls, xlsx, docx, docm, html, rtf, odt, png, jpg, gif, bmp, tif, ppt, pptx ,odg
For images, RBC uses optical character recognition (OCR) to try and extract information.
Requirements
-
A linux machine with several opensource utilities (should work on a mac too, in principle):
- exiftool
(extract files metadata). Please make sure that your exiftool
install is complete. For instance, find a
.docx
file and runexiftool myfile.docx
: then check the result for the line:
File Type Extension : docx
- tesseract (great OCR program). Use version 4 for best results (there is a ppa for ubuntu, see here)
- libreoffice (to convert office documents to txt)
- pdftotext (usually included in any linux distro; otherwise install
poppler-utils
) - mutool (convert pdf to image.
sudo apt install mupdf-tools
. This one can be replaced by its many equivalents. But mupdf is great.) - pandoc (
sudo apt install pandoc
)
- exiftool
(extract files metadata). Please make sure that your exiftool
install is complete. For instance, find a
-
Python 2.7
With additional packages:
- pyexiftool (download directly from here)
- magic (
sudo apt install python-magic
) - dateparser (
sudo pip install dateparser
)
Installation
-
download rename_by_content.py
-
download exiftool.py in the same directory
-
make sure the other tools mentioned above are installed on your system
Usage
Command-line usage
python ./rename_by_content.py [-h] [-d] [-b]
[--output OUTPUT]
[--log LOG]
[--ocrdir OCRDIR]
files [files ...]
Search for a title and a date for all files
, and copy the renamed
files in OUTPUT
. Inside the OUTPUT
dir, paths have the form
year/month/name_of_file.ext
. For instance 2018/02/example.pdf
.
The name is misleading, it actually copies the files in the OUTPUT
directory. The original files are not affected (apart from being read,
of course).
-
files
can be the path of a single file, or a shell syntax of the formdir/*
if you want to treat all files in thedir
directory. -
The directory
OCRDIR
contains all the texts extracted from the givenfiles
. If you run RBC a second time with the sameOCRDIR
, it will use the previously generated text, and hence run much faster. On the other hand, it is safe to delete theOCRDIR
directory to force re-starting text extraction when running RBC again. -
The
LOG
file contains a list of all operations done, and the list of errors. This file can be use to cancel the operation, that is, remove all files that have been copied. For this, use the python functionremove_from_summary
. -
-b
or--batch
: Batch mode: doesn't wait for user input. -
-d
or--dry
: Dry-run mode: does everything but the final copy. However, the text files are generated inOCRDIR
and theLOG
is written.
Example
python ./rename_by_content.py -o /tmp/newfiles /home/joe/recup_dir/*
This will examine all files in /home/joe/recup_dir/*
and copy them,
with new names, in /tmp/newfiles
.
In python programs
See the file example.py
.
Essentially your have to do
import rename_by_content as rbc
and then you may use the function
-
rbc.batch(files, newdir)
, which will treat allfiles
and copy them with their new title innewdir
.You may also use the optional arguments
dry
andocr_dir
:dry
is a boolean. If true, the final copy is not done.ocr_dir
is the path of the temporary directory used to store texts extracted from the files.
Other utilities:
-
rbc.mkdir(path)
: create thepath
directory if it does not exist. -
rbc.ocr_dir()
: return the temporary directory used for storing extracted texts. -
rbc.clear_ocr()
: remove that temporary directory. -
rbc.copy_unique(src_dir, dst_dir)
: copy all files fromsrc_dir
intodst_dir
, but never overwrites: if a file with the same name already exists indst_dir
, the file fromsrc_dir
will have a numbered suffix like '_01'.This is useful if you have run
rbc.batch
with several destination directories, and finally you want to group everything in the same location.
TODO
-
Language detection (English, French, etc.) for better date recognition.
Currently you have to edit yourself the
MONTHS
variable if your documents are not in French.