Tool for archiving documents and pictures
- SCArchive scans given local folders for different file types (PDF and HMTL by now, more are coming) and extracts meta data from each file.
- PDF Files are OCR'd and extracted with the help from PDFBox (https://pdfbox.apache.org/) tesseract (https://github.com/tesseract-ocr/tesseract) and Graphicsmagick (http://www.graphicsmagick.org/).
- The application uses Vaadin for providing a Web-UI where the user can search for and edit the gathered meta data.
- As all files and also the gathered meta data is stored as local files, it is possible to synchronize the files via e.g. rsync or Resilio Sync to other machines.
- Java 8
- Spring-Boot
- Vaadin
- PDFBox
- tesseract
- GraphicsMagick
- Install the prerequisites
- Java 8 or greater (https://www.java.com/de/download/)
- tesseract (https://github.com/tesseract-ocr/tesseract#installing-tesseract)
- GraphicsMagick (http://www.graphicsmagick.org/download.html)
- Much RAM and CPU capacity (for OCR)
- Currently only from source is possible
- Clone this repository
git clone git@github.com:scyv/SCArchive.git
- Run
mvnw package
- Navigate to ./target:
cd target
- Copy application.properties from
src/main/resources
:cp src/main/resources/application.properties .
- Edit application.properties for your needs (see below)
- Run `java -jar server-0.0.1-SNAPSHOT.jar
- Clone this repository
Property key | Possible Values | Description |
---|---|---|
scarchive.documentpaths | e.g. /home/user/myFiles;/home/user/myOtherFiles | ; separated list of folders, the application shall scan |
scarchive.scheduler.pollingInterval | Integer e.g. 10 | Time between two scans in Seconds |
scarchive.tesseract.bin | e.g. /usr/bin/tesseract | Absolute path to the tesseract binary |
scarchive.graphicsmagick.bin | e.g. /usr/bin/gm | Absolute path to the graphicsmagick binary |
scarchive.openlocal | true or false | When true, the files are opened locally, when false, the files are downloaded |
scarchive.enablescan | true or false | When true, scanning of files is enabled, when false, no scanning takes place. This is especially useful if you want to provide the web ui without letting the host do the scanning |
scarchive.maxfindings | e.g. 100 | Maximum amount of findings that shall be shown when searching for meta data |