Aron-HD / convert-articles

Automating the production of docx to html conversion with image extraction and renaming

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Convert Articles

A new development for automating the conversion of edited docx content to html and correctly named images.

  • Goal: direct cost savings for my company and reduced production time for web articles (case studies).
  • We currently send this content to India and pay for a company to convert the files to html and extract all images. We then have to wait around a week for this to be completed before the article assets can be uploaded to our CMS and published as articles on our website.

Considering splitting html parsing to a utils folder to keep tidy.

ToDo

  • sub h5 - h3 titles.
  • fix objectives h3 on WA also run a check to see if any headers are missing
  • add Sources heading to endnotes
  • nest under an Article class
  • ensure imgs are in their own p tags
  • regex matching for h3 titles to be more exact / account for spaces at line endings etc (or could strip()).
  • add requirements.txt / pip.lock to make it standalone
  • add logging
  • add file verification for sys.arv[1]
  • add unit testing: def test_rename_docx_images(Path('test/131412/media')), IMGS={}):
  • remove /media folder in output path
  • add warning for .emf files and tables / charts
  • split hmtl amending to separate package
  • use pyinstaller to make exe
  • allow directories as well as single docx files so doesn't start script new everytime and create new log.
  • use colour on warnings and flags for file or dir through click cli

Docs

INSTRUCTIONS

If running from command line: ./convert_articles.py <file_you_want_to_convert> <award_scheme>

  • e.g. ./convert_articles.py "test/131485.docx" "warc"

MAIN FUNCTIONS

  • log_setup():

Makes log directory and sets logger file. Uses 'log' for main app, lgr1 for Article Class.

  • load_infile():

Runs validation on file input by sys.argv[1].

  • load_award():

Runs validation on award input by sys.argv[2] to return correct award code from SUBS json.

  • load_json():

Loads data from the specified json file.

ARTICLE CLASS

Arguments:

  • IN_FILE: specify docx or html file.
  • TAGS: specify tags and attributes for bleach module in json file.
  • SUBS: specify substitutions for h3 headings in html.

CLASS FUNCTIONS

  • convert_docx():

Uses the pypandoc module to convert docx file to html content for parsing.

  • rename_docx_images():

Rename extracted images. Returns old img path and new img filename in a json for subtitution in html.

  • clean_html():

Uses the bleach module to clean unwanted html tags and limit attributes of allowed tags. Tags and attributes are stored in json folder under '/JSON/tags.json'.

  • amend_html():

Parses cleaned html content from docx, running replacements to correct headings. Heading substitutes are stored in json folder under '/JSON/subs.json'. Also contains the award code variable for inserting in <img src""/>.

  • write_html():

Outputs cleaned and amended html content to specified file name. Pass in file name and html contents.

About

Automating the production of docx to html conversion with image extraction and renaming


Languages

Language:Python 100.0%