danfickle / openhtmltopdf

An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!

Home Page:https://danfickle.github.io/pdf-templates/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Substitute Non-Breaking-Space with Normal-Space for PDF font character lookup

rototor opened this issue · comments

I´ve investigated the "#" problem I described in #19 a bit future. The problem is, that   is renderd as '#'. The # comes from the default xhtmlrenderer.conf:

# When rendering text, not all fonts support all character glyphs. When set to true, this
# will replace any missing characters with the specified character to aid in the debugging
# of your PDF.  Currently only supported for PDF rendering.
xr.renderer.replace-missing-characters=false
xr.renderer.missing-character-replacement=#

The character is used as replacement even if xr.renderer.replace-missing-characters=false. It seem no font has a   character. This makes somehow sense, as its visual the same character as a normal space.

Just replacing   (character 160) with ' ' would fix the problem - but it does not feel like a correct fix to me:

--- a/openhtmltopdf-pdfbox/src/main/java/com/openhtmltopdf/pdfboxout/PdfBoxOutputDevice.java
+++ b/openhtmltopdf-pdfbox/src/main/java/com/openhtmltopdf/pdfboxout/PdfBoxOutputDevice.java
@@ -381,6 +332,8 @@ public class PdfBoxOutputDevice extends AbstractOutputDevice implements OutputDe
         for (int i = 0; i < str.length(); ) {
             int unicode = str.codePointAt(i);
             i += Character.charCount(unicode);
+            if( unicode == 160 )
+                unicode = ' ';
             String ch = String.valueOf(Character.toChars(unicode));
             boolean gotChar = false;

Especially because their are more spaces then just space and non-breaking-space. For examples see here https://www.cs.tut.fi/~jkorpela/chars/spaces.html

I think (hope) using Character::isSpaceChar is the correct fix. We also need to make it easier to change the replacement character. Thanks @rototor for the patch, Daniel.

@danfickle I think using Character::isSpaceChar is really enough for now. If someone wants different "space-widths" he just should use a <span> with the needed styles (i.e. inline-block, and width: 0.5em etc).

@danfickle is there a timeframe for having this fixed (in a non-snapshot version)? We have an application using your library that is supposed to go into production, but the customer ran into this problem in user acceptance testing and is not likely to approve this moving to production the way it is. Thanks!

Can you give me the weekend to clean up some svg code before deploying a release or do you need it immediately? It's nice to hear that people are using this.

Yeah that's no problem. Thanks for the quick response!

Just FYI, I came across another character that causes a "#" to show up. ​ which is classified as a zero-width space: https://en.wikipedia.org/wiki/Zero-width_space

I've put in some character replacement in our code to deal with this for the time being, but thought you'd like to know. Thanks again for the fast turnaround.

Sorry that was supposed to be &#8203;

@scoldwell - If you are pre-filtering as a temporary fix, you may wish to use this function:

    /**
     * Checks if a code point is printable. If false, it can be safely discarded at the 
     * rendering stage, else it should be replaced with the replacement character,
     * if a suitable glyph can not be found.
     * @param codePoint
     * @return whether codePoint is printable
     */
    public static boolean isCodePointPrintable(int codePoint) {
        if (Character.isISOControl(codePoint))
            return false;

        int category = Character.getType(codePoint);

        return !(category == Character.CONTROL ||
                 category == Character.FORMAT ||
                 category == Character.UNASSIGNED ||
                 category == Character.PRIVATE_USE ||
                 category == Character.SURROGATE);
    }

As an implementation note, behavior will differ between Java 6 and later versions as the unicode version was changed and Character::isWhitespace no longer returns true for zero-width spaces.

I'll close this issue now, as I think it is finally solved. Feel free to re-open if you find any other issues.