dwp / ms-html-to-pdfa

micro-service that takes an XHTML document and produces a pdf document at various conformance levels. The main driver for this is to create a 'generic' pdf-generator but also to create PDFA/1A compliant documents for DRS. It has recently been extended to support the PDF/UA standard for accessibility and uses an(other) in-house service which has abstracted the pdf 'build' activities (https://github.com/dwp/html-to-pdf)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ms-html-to-pdfa

Build Status Known Vulnerabilities

RESTful service receiving json to construct a PDF document to various conformance levels

build & run

Standard maven build.

  • to package the jar file mvn clean package

  • to run the application execute jar -jar /path/to/jar/app.jar server /path/to/config.yml

    • eg. java -jar target/ms-html-to-pdfa-1.0-SNAPSHOT.jar server src/main/properties/dev.yml
  • from the IDE run the uk.gov.dwp.pdfa.application.HtmlToPdfApplication with program arguments server path/to/properties.yml (eg. src/main/properties/dev.yml)

NOTE: this application accepts environment variables that will be picked up at runtime (this file is bundled into to container). If https configuration is needed a modified config.yml must be mounted into the container with the appropriate keystore/truststore locations (see dropwizard documentation).

server:
  applicationContextPath: ${SERVER_CONTEXT_PATH:-/}
  applicationConnectors:
  - type: ${SERVER_APP_CONNECTOR:-http}
    port: ${SERVER_APP_PORT:-6677}
  adminConnectors:
  - type: ${SERVER_ADMIN_CONNECTOR:-http}
    port: ${SERVER_ADMIN_PORT:-0}
  requestLog:
    type: ${SERVER_REQUEST_LOG_TYPE:-external}

performance testing

A k6 script is included to satisfy a basic load test. By default, this will target the application running on localhost, via the docker hostname host.docker.internal. This can be altered by passing an optional TARGET_HOST environment variable.

Ensure you have the service running, and execute the test as follows:

# Default target: host.docker.internal
docker run --rm -i --name loadtest \
  -v $PWD:/k6 \
  loadimpact/k6 run - < ./load-test/test.js

# Custom target (must be accessible from within the k6 container)
docker run --rm -i --name loadtest \
  -e TARGET_HOST=some-target:8080 \
  -v $PWD:/k6 \
  loadimpact/k6 run - < ./load-test/test.js

# Change no. virtual users and duration
docker run --rm -i --name loadtest \
  -v $PWD:/k6 \
  loadimpact/k6 run --vus 20 --duration 5m - < ./load-test/test.js

Default configuration and criteria for satisfying performance thresholds are bundled in the test scripts themselves.

For configuring the tests in the CI pipeline, refer to the official GitLab documentation or underlying template source.

/generatePdf

POST endpoint receiving the information to build the pdf file

{
    "colour_profile": "base64-encoded-file",
    "font_map": {
        "tahoma": "base64-encoded-file",
        "arial": "base64-encoded-file"
    },
    "page_html": "base64-encoded-html",
    "conformance_level": "PDFA_1_A"
}
  • colour_profile (optional) : The base64 encoded colour profile file contents to be embedded to the pdf. If this value is omitted or null the default colour profile will be applied (src/main/resources/colours/sRBG.icm)
  • font_map (optional): a list of fonts to be embedded into the pdf. If the font_map is missing or null then a 2 default fonts will be embedded into the document.
    • arial to cover basic fonts and courier to cover monospace requirements.
    • The format for each key/value item is:-
      • the name of the font (eg. arial), this must be specified in the html style header using the same format
      • the base64 encoded version of the .ttf file contents to be embedded with the file
  • page_html (mandatory): The base64 encoded html document
  • conformance_level (optional): The conformance level for the resulting pdf.
    If this parameter is missing (or null) it will default to PDFA_UA; the tightest of all the conformance levels.

Pdf conformance levels are detailed here with acceptable values for this service as:-

The only mandatory parameter is the base64 encoded html. If only the html is passed a standard colour profile will be used, arial (standard) and courier (monospace) will be embedded to the pdf and the conformance level for the pdf will be PDF/UA

Returns:-

  • 200 :: Success. Returns base64 encoded pdf in the response body
  • 400 :: Bad or Malformed json document or json elements. Returns a brief error message as the response body (full error is logged)
  • 500 :: Internal error occurred, bad html or conformance levels, font/colour profile embedding. Returns a brief error message as the response body (full error is logged)

Usage notes

For the incoming html there are 2 things to consider.

  • The pdf generator requires XHTML which requires careful closing of tags (https://www.w3schools.com/html/html_xhtml.asp)
  • In order to satisfy the font requirements of PDFA_1_A document all elements need to reference the font that will be embedded. This is best achieved by adding a <STYLE> element to the <HEAD> of the html and to apply it for all items (eg body). The important point is to make sure that all fonts are explicitly specified in the html document.
  • If using images it is best to encode the images directly into the html. eg <img src="data:image/png;base64,<the-base64-encoded-string-of-the-image>"/>

eg.

<html>
    <head>
        <style>
            pre, code, var {
                font-family: 'courier', serif;
            }
            body {
                font-family: 'arial', serif;
            }
        </style>
    </head>
    <body>
        <h1>hello world</h1>
        <img
            width="250px" height="250px"
            src=""
            alt="base64 encoded embedded image"
        />
    </body>
</html>

Common faults

  • fonts not embedded correctly :: will result in an error reporting Index: 0, Size: 0 or Index 0 out-of-bounds for length 0 which, whilst not a very clear, is because the required font is not present in the embedded list array. All html tags should have an attached font (both normal and monospaced)
  • links not fully qualified :: any references to css or images that have relative paths will fail. A full, resolvable URL is required.
  • closing tags :: XHTML requires all tags to be terminated, this is easily missed.

/version-info

Endpoint to return a standard JSON document with build information.

  • name: the project.artifactId
  • version: the project.version
  • build: the jenkins build-number
  • build_time: the maven.build.timestamp

example output is:-

{
  "app": {
    "name": "ms-html-to-pdfa",
    "version": "1.6.0",
    "build": "133",
    "build_time": "2019-09-09T09:58:17Z"
  }
}

Examples

The following will base64 encode the html file contents, call the service, decode the response and write to file on *nix based operating systems

curl -m 10 -X POST --data '{"page_html":"'$(cat src/test/resources/successfulHtml.html | base64)'"}' http://localhost:6677/generatePdf | base64 -D > test.pdf

This example will return the current build information

curl http://localhost:6677/version-info

Continuous Integration (CI) Pipeline

For general information about the CI pipeline on this repository please see documentation at: https://confluence.service.dwpcloud.uk/x/_65dCg

Pipeline Invocation

This CI Pipeline now replaces the Jenkins Build CI Process for the ms-html-to-pdfa.

Gitlab CI will automatically invoke a pipeline run when pushing to a feature branch (this can be prevented using [skip ci] in your commit message if not required).

When a feature branch is merged into develop it will automatically start a develop pipeline and build the required artifacts.

For production releases please see the release process documented at: https://confluence.service.dwpcloud.uk/pages/viewpage.action?spaceKey=DHWA&title=SRE A production release requires a manual pipeline (to be invoked by an SRE) this is only a release function. Production credentials are required.

localdev Usage

There is no change to the usage of localdev. The gitlab CI Build process create artifacts using the same naming convention as the old (no longer utilised) Jenkins CI Build process.

Therefore please continue to use branch-develop or branch-f-* (depending on branch name) for proving any feature changes.

Access

While this repository is open internally for read, no one has write access to this repository by default. To obtain access to this repository please contact #ask-health-platform within slack and a member will grant the appropriate level of access.

About

micro-service that takes an XHTML document and produces a pdf document at various conformance levels. The main driver for this is to create a 'generic' pdf-generator but also to create PDFA/1A compliant documents for DRS. It has recently been extended to support the PDF/UA standard for accessibility and uses an(other) in-house service which has abstracted the pdf 'build' activities (https://github.com/dwp/html-to-pdf)

License:MIT License


Languages

Language:HTML 46.0%Language:Java 44.8%Language:Gherkin 5.4%Language:Shell 1.8%Language:JavaScript 1.4%Language:Dockerfile 0.6%