htmlEntityToUtf8 adds around 600 kb to binary size with -d:release on Windows

Question

htmlEntityToUtf8 adds around 600 kb to binary size with -d:release on Windows

metagn opened this issue 5 years ago · comments

https://github.com/soasme/nim-markdown/blob/master/src/markdownpkg/entities.nim

Tested by manually removing its use in my local Nimble instance:

# markdown.nim
proc escapeHTMLEntity*(doc: string): string =
  var entities = doc.findAll(re"&([^;]+);")
  result = doc
  for entity in entities:
    if not IGNORED_HTML_ENTITY.contains(entity):
      let utf8Char = entity.htmlEntityToUtf8

Size of small website builder compiled with -d:release:

    if not IGNORED_HTML_ENTITY.contains(entity):
      let utf8Char = entity#.htmlEntityToUtf8

Same compilation settings:

Converting this to a constant table should save ~~a large amount of~~ space. A build option to turn it off might work as a temporary option though, like -d:markdownNoEntities

Update: Tried changing it to a hash table, it apparently does not save much space:

This makes sense because of the way case/of is optimized (case/of itself is probably faster than a hash table), but I expected it to have a bigger impact. My mistake.

What does save a little more space than that though is using an array of tuples and checking for equality every single time instead of hashes, sacrificing speed:

This is just a bad idea for performance. I would really rather just not have all this in my binary.

Forgot to mention this is on Nim 1.0.4.

Ju · Answer 1 · Fri Jan 24 2020 11:34:57 GMT+0800 (China Standard Time)

This was a workaround due to nimlang std library function htmlparser.entityToUtf8 can't translate all of the html entities defined in commonmark spec, in particular, https://html.spec.whatwg.org/multipage/entities.json.

The current Nim implementation of converting entities is also through a hash, (source code), btw.

I'll create an upstream issue to Nimlang reporting the issue and hope more characters can be added to language std library. If the proposal can be approved, then this module is no longer needed in the library.

Adding an option markdownNoEntities will make nim-markdown incompatible to the commonmark spec. I think correctness is very important as well.

Another way is to diff the above entities.json with the current entity set in Nim implementation and introduce those missing to nim-markdown. This is probably a solution that can both ensure correctness and reduce binary size without harm the performance.