My local newspaper has local job classifieds. However, they are online only as a flipbook. Since flipbooks are in JPEG format they are not searchable. And it is too tedious to look through each one to find something in my interests.
I created this Python Django app to scrape the classifieds JPEG and then using tesseract OCR extracted the text. Then I displayed them in a nice format and added search to the app. So now you can easily find jobs with your desired keywords.
This is the stable version. See the source repository
$ heroku buildpacks:add --index 1 heroku-community/apt
$ touch Aptfile
tesseract-ocr-eng is the English language file for tesseract.
tesseract-ocr
tesseract-ocr-eng
We will use this path for the next step
$ heroku run bash
$ find -iname tessdata # this will give us the path we need
You can exit heroku shell now exit
Set a heroku config variable named TESSDATA_PREFIX to the path returned from find -iname tessdata
cmnd above
$ heroku config:set TESSDATA_PREFIX=./.apt/usr/share/tesseract-ocr/4.00/tessdata
Now set heroku set a heroku config variable named TESSDATA_PREFIX to the path returned from find -iname tessdata
Set a heroku config variable named TESSDATA_PREFIX to the path returned from find -iname tessdata cmnd above
$ git push heroku master
I hope this helps. Let me know if it works for you.