citation-js / bibtex-parser-experiments

Experiments to determine a new BibTeX parser formula for Citation.js -- to be applied to other formats as well

Home Page:https://travis-ci.com/citation-js/bibtex-parser-experiments/builds

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Argument commands

larsgw opened this issue · comments

The citationjs parser needs to allow for more different kinds of commands, mostly argument commands. Arguments seem to be treated the same always: it either takes in a braced block or the first character of text. Exceptions are math blocks: \url takes in the dollar sign verbatim while \emph does not.

That's more a difference whether a command parses its argument in verbatim-mode; \url expects one parameter, and parses that in verbatim mode; \href expects two arguments, but parses the first verbatim, and the 2nd normal. \begin{verbatim} ...\end{verbatim} parses everything in that environment verbatim. \verb parses everything until the end of the block it's in verbatim.

There's simply no math in verbatim environments, because the $ is just a character there.

That's a bit annoying, I was planning to do something like the following:

// constants.js
export const argumentCommands = {
  href (url, text) { return text === url ? text : `${text} (${url})` }
}

// value.js (grammar)
const grammar = new Grammar({
  // ...

  Command () {
    const command = this.consumeToken('command').value

    if (command in constants.argumentCommands) {
      const func = constants.argumentCommands[command]
      const args = []
      let arity = func.length // fun thing

      while (arity-- > 0) {
        this.consumeToken('whitespace', /* optional: */ true)
        args.push(this.consumeRule('Argument'))
      }

      return func(...args)
    } // else...
  },

  // ...
})

If you retain the full parsed input attached to the tokens while tokenizing, it's possible to decide during this phase how you want to handle the input. Basically, you process the tokens according to their semantic meaning for normal mode, and for verbatim mode, you take the parsed orig text attached to the tokens and string it together.

Don't forget that commands can have arguments in square brackets. I simply ignore them, but for that I do have to parse them.

I think I might just let the command functions be called as if they're rules in the grammar, i.e. they can decide themselves how to parse their arguments. Perhaps a bit similar to what you're doing, based on what I saw. It feels a bit weird to make it that customisable but I don't think it can lead to code injection or the like.

By the way, I am working on a prototype plugin for @citation-js/plugin-bibtex that extends unicode support with your unicode2latex tables. I don't really want to put an additional 400KB in the default browser bundle so I think an optional plugin to the plugin could work well. I am still working out how to add things like {\\'{}I} but that might be helped by the changes mentioned above.

From my pov you're making astounding progress.