citation-style-language / schema

Citation Style Language schema

Home Page:https://citationstyles.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make punctuation collapsing localisable

georgd opened this issue · comments

Punctuation in input data may collide with punctuation in style definitions. The citeprocs handle these punctuation clusters by applying hard-coded rules. While these rules have been chosen on a good basis and probably suffice for all English language documents and most others in Latin script, some style guides or non-English standards might require divergent collapsing.

I brought this up on the citeproc-js issue tracker for a German language style which requires suppression of subsequent , after titles ending in ! or ?, with the intention of a CSLm-extension: Juris-M/citeproc-js#154. @denismaier suggested to discuss this here and also came up with possible solutions to make punctuation collapsing localisable on a per-style basis:

<punct-handling>
  <punct>
    <input value="?,"/>
    <output value="?"/>
  </punct>
</punct-handling>

Perhaps this is closer to how it looks in citeproc-js:

<punct-handling>
  <punct value="?">
    <next value="!" result="?!"/>
    <next value="." result="?"/>
    <next value=":" result="?"/>
    <next value="," result="?,"/>
    <next value=";" result="?;"/>
  </punct>
</punct-handling>

We can take the combinations from citeproc-js as a basis for here if we can agree upon a syntax. I think CSL 1.1 is a good target for this. But perhaps we can even add it to 1.0.2 as this just makes the current behaviour explicit and configurable.

https://github.com/cormacrelf/citeproc-rs/blob/b724318fe90c4c401ffbd4f34e8916f4e07435c9/crates/io/src/output/markup/move_punctuation.rs#L853

This is closer to how it's implemented now, rather than nested punc/next blocks. There is no need for the hierarchy. But as you can see, implementation would not be terribly difficult, just swap out the lookup tables.

@cormacrelf so what do you suggest concerning style syntax?

And @fbennett do you have a preference for any of the two models?

Maybe should step back from syntax and settle what the logic is?

@bdarcus What do you mean with "logic" in that case? Right now, we have an established set of replacements, and the question is if we want to make this customizable. Is yes, we just need to define a syntax how this those replacement pairs should be defined. Maybe I'm just not understanding what you mean?

I’m not sure, this is what you’re missing, but it’s what comes to my mind:

  • Most of the hard-coded rules would never be touched so it makes sense to provide a syntax that allows redefinition of collapsing rules for single punctuation pairs.
  • As a consequence, the processors can’t just overwrite the whole ruleset.
  • Now, I see two possible ways to achieve the localisation:
    1. On a pair of punctuation first apply the localised punctuation rules, then run through the unchanged hard-coded set.
    2. Run through a modified copy of the hard-coded set.

Does this make sense? I think, the last point is getting a bit too deep into implementation details.

A thing, I was considering are punctuation clusters that upon consolidation produce a new punctuation cluster. But I don’t think these should be handled differently from how they are normally, except for each time the hard-coded collapsing rules are applied, the localised set should be considered.

As @cormacrelf says, it's just a mapping table, and making changes isn't much of a problem in implementations. As @georgd points out, adjustments demanded by a style or language domain will affect only a small number of entries, so expressing the entire mapping table in XML shouldn't be necessary.

I wonder how much variation there is in punctuation merge and transformation conventions? If there are only a handful of patterns, they could be called by name, masking the details from CSL entirely. That would align with settings for quote-swapping of punctuation, where we have two modes and don't declare in CSL code exactly which punctuation marks are affected in which context.

WRT the logic of how these additional rules work, the word "override" would suffice, we can take it from there. You can specify "go back to no collapsing for !." with rule that maps "!." to "!.".

So we're all on the same page, this is roughly how it's done in citeproc-rs. I'll get to my point in a moment.

  "String!" + ". etc"
= "String" + "!." + " etc"
= "String" + lookup("!.") + " etc"
= "String" + "!" + " etc"
= "String! etc"

A similar thing operates over quotation boundaries, with different lookup table and inside/outside location depending on your punctuation in quote settings. Hopefully it will be clear that a "flat" syntax ie a list of <punct pair="!." replace="!" /> would be easier to use, and would expose less implementationy stuff than a punct/next hierarchy, that IIRC is a detail of citeproc-js optimising the routine, whether that even helps or not I can't guess.

One optimisation I will not give up lightly is that each side contributes exactly one Unicode scalar value. AFAIK no language has punctuation represented with more than one scalar value. Punctuation smashing is already an expensive operation. But swapping out "lookup" for a different hash table is absolutely fine.

WRT adding the implementation detail to CSL, I'm not aware that the spec has any reference to punctuation smashing beyond punctuation-in-quote. So perhaps it at least deserves a mention, especially since it's already in the CSL test suite as something you have to have to be a real implementation.

The punctuation pairs are not really an implementation detail. It's just that the simplest user-facing conceptualisation happens to be very similar to the implementation. So I don't really see any downsides to exposing it. Because you still have complete flexibility to implement however you wish. And adding complete customisation of this does not realistically add any additional complexity that punctuation-smash-style="german" would not.

Compare this to defining in the spec where precisely punctuation smashing occurs. For example, does it happen inside "My title has !. in it"? As a processor dev, I need the flexibility to say no, if only because this saves a large number of CPU cycles scanning linearly through all text in all citations everywhere, for something that I don't believe anyone actually needs, because users already control what's in the title. Also, defining it would require acknowledging/cementing precise rich text structure in the spec, which is also undesirable. We get better at smashing punctuation as time goes on, and I don't want a plain description of the current best effort to become a limitation.

Where does that leave the punctuation pairs? I don't see that as a limitation, just an input parameter to a larger problem. Would you, for example, want multiple punctuation marks in a row to be customised? No, because users already control this by not writing dumb styles like delimiter=", " + <text value=".;/:!?"/> and expecting magic.

So final opinion is that sure, put some <punct pair="!." replace="!"/> syntax or a bikeshed thereof in the locales with normal term-like locale override semantics. Pair must be exactly two Unicode scalar values. Replace can be anything you like, including "SMASH_SUCCESS" in a test suite. Another thing to bikeshed is whether you can also indicate whether the pairing operates on a quotation border and where each mark goes in each case, which would ultimately build the mappings FULL_MONTY_QUOTES_IN/OUT in that file I linked above. You could use an attribute like in[/out]-quote="split|inside|outside". I'll yield my time there.

Still wondering whether a limited set of named punctuation-merger patterns might be viable. Thoughts about that notion, @bdarcus, @georgd, @denismaier, @bwiernik, @adam3smith, @rmzelle?

If you can really pin it down to one or two variations on the whole mapping, then yeah, that would be good probably. But that does sound like more work for us compiling various punctuation conventions than letting people figure it out over time. Custom is set and forget and we never have to worry about it again.

I think the big advantage of a limited set of predefined patterns is that it can be thoroughly tested.

I really can't tell how many patterns would appear but for German I can name two already that are not mutually exclusive: s/([!?]),/$1/ and (between title-main and title-sub) s/([!?]):/$1/ (or s/([!?]):/$1:/, depending on the actual default).

I don't expect lots of additional sets for Latin or Cyrillic script but every script with its own set of punctuation marks may add to this (Greek, Arabic, Hebrew,...).

You can thoroughly test the operation of the customisable rules, if you build them. If you're saying "the built-in modes would be part of the CSL test suite", there are other ways to test your style and locale combo. I built one called jest-csl and Frank built another one independently at the same time (haha). But ultimately if you test that the rules engine works, you shouldn't really need a second test suite? It's a set of declarative rules that you would simply be replicating in test case form.

I accept the point that we could be doing all this for exactly one conflicting rule, and technically we could do some research and squish all the world's punctuation rules into one hashmap, simply special-casing German. But I would also welcome letting locale maintainers figure that out themselves and not have to worry about it here.

I accept the point that we could be doing all this for exactly one conflicting rule, and technically we could do some research and squish all the world's punctuation rules into one hashmap, simply special-casing German.

Just not to be mistaken: the two rules I posted above are not universally applicable for German but have to be set on a per style basis, either one of them alone, or both combined, or neither of them.

But I would also welcome letting locale maintainers figure that out themselves and not have to worry about it here.

I prefer this, too.

I also think that should be customizable as per @cormacrelf's suggestion if we cannot say that it comes to 2 or 3 alternatives.

If all this is is a simple input/output map, then that suggests something like this should be fine?

<punct-handling>
  <punct input="?," output="?"/>
  <punct input="." output="?"/>
</punct-handling>

... or even:

<punct-handling>
  <map input="?," output="?"/>
  <map input="." output="?"/>
</punct-handling>

As in, the values can just be attributes on some common element.

Could we add this, and document a default?

Could we add this, and document a default?

Yes, I think that would be good.

Where would you add this? Under locales? Or under style? While this is not really locale specific, adding it under locales has the advantage that we can easily add default mappings to the default locale.

Adding a mapping to a style would then just override this particular combination. Or add a new combination if it doesn't exist yet. Right?

I would think we'd need to allow it either place?

Why?

Because in general rules can be generic, locale-specific, or style-specific.

Isn't that the case here?

Or do you mean just the CSL syntax?

In any case, I'm not following this closely; feel free to disregard if I'm off-base.

I think, allowing both is ultimately the correct way to do it as a ruleset might apply for the whole style and yet a specific locale might want to set some combinations differently.

<punct-handling>
  <map input="?," output="?"/>
  <map input="." output="?"/>
</punct-handling>

As in, the values can just be attributes on some common element.

Could we add this, and document a default?

I like this one very much.

I'm not so sure regarding adding in two places. Shouldn't it just be like with terms? The syntax with map is nice, yes.

How I understand it: imagine, Chicago mandates one set of collapsing rules which, however contradicts to, let’s say, French orthographic rules. The style-specific set would go into cs:style, while the French rules would go into cs:locale.

One could argue about script specific mappings which might be locale-independent on cs:style level or even global (hard-coded in the processor?). AFAICS there are currently only Latin mappings.

Existing style options can be citation-specific, bibliography-specific, global (meaning both citation and bibliography), or locale options (for separate formatting rules by locales). We don't currently have any options are both locale options and citation/bibliography/global options. The current approach if a locale has a formatting option that a style wishes to override (e.g., to use en-GB with punctuation-in-quotes="true") is to specify that locale in the style with the new formatting option.

I don't think that we need to change that existing locale options behavior. These locale punctuation rules seem to be applied fairly consistently within a language, so I think it will work fine for these punctuation rules to be defined in a similar way as the existing locale options. That is, they are specified only as children of cs:locale. If a style like Chicago wants to override the baseline French orthographic rules, it can specify that within the French locale in the style.

If we did want to make this a global option, my preference would be to permit all locale options to be specified globally. This would make the style options more consistent than having separate application rules for individual options. In that case, a locale option set on cs:style would take priority over the option set on a locale (either within the style or from a locale file).

As we are talking about this, these requirements seem similar to the discussion about commas, colons, and semicolons from here—e.g., Arabic has special characters

It seems like, rather than using the cs:term system, if we are going to add a new punctuation system, the semicolon-substitution rules, etc. could be incorporated here. My thinking is that something like this would be used:

<punct-handling>
  <map input=";" output="؛"/>
</punct-handling>

In the discussion, the convention in Arabic and Persian is to use the localized colon and semicolon for terms, but not for punctuation that already exists in a title (e.g., Journal of Research: Theory and Practice) would not localize the colon as part of the title. However, a locale like French does want within-field punctuation resolution, e.g.:

<punct-handling>
  <map input=":" output=" :"/>
</punct-handling>

So, I think we might need an attribute like @substitute-within-fields to control whether the punctutation handling is applied within fields. What do you think @cormacrelf ?

As for the syntax itself, "mapping inputs to outputs" is pretty arcane programmer language. We all know it here, but this is not common and I catch very confused looks when I slip it into conversation. These things should be self-documenting, and input/output tells people exactly nothing about what it does. I think take another run at it.

As for #107, I think I've missed the motivation for it, and it's not linked there or I've missed that too. I think it's multilingual styles which want the punctuation to be different style-wide in different locales? A couple of points while I'm at it:

  • It probably (?) doesn't make sense to support this kind of global override in normal locale files, rather only in the in-style override locale units where the person doing the override knows what punctuation is in use. If only so we don't have to add "define comma as comma" as boilerplate to most styles to be sure the French don't swoop in and ruin everything.
  • It strikes me as quite a blunt instrument in general, but if you know what you're doing...

As for whether adding these two feature together is good? Not like this, at least. It's confusing -- why are some of these single marks and some double? It seems apt that you could define the punctuation-global-find-and-replace terms by their actual mark rather than a name, but I no longer have any clue what that locale snippet does. They are two very different features. And I don't want to ask this because the answer is undoubtedly horrifying, but which transform do you suppose happens first? Workshop this a bit more in general. Give us a different element name for a different feature, and give us some instructive attribute names, and then we'll be talking.

As for the syntax itself, "mapping inputs to outputs" is pretty arcane programmer language. We all know it here, but this is not common and I catch very confused looks when I slip it into conversation. These things should be self-documenting, and input/output tells people exactly nothing about what it does. I think take another run at it.

What about replacements, find, replace, or so? That's more common language.

I’m sorry, but looking at the overall complexity, I think 'map input to output' is really a minor problem.

Right, fair, any ideas on the ordering of transformations?

  Smash(:,) => :
+ Replace(:) => $
= what, applied to a field ending
  with ":", but suffix ","

Is the answer Field$, or Field$?

@cormacrelf the motivation is that some locales use different punctuation symbols entirely (e.g., Persian has its own semicolon symbol). Currently, supporting that requires a completely separate style, rather than just being able to localize the semicolon symbol.

It's a fairly similar situation as in French where a colon is localized to narrow-no-break-space-colon.

But the question for ordering is a good one: script-specific substitutions (e.g. colon > Greek ano teleia) need to be processed before cluster consolidation, while other typographic transformations (e.g. French colon spacing) should be applied at last.

@cormacrelf These both seem like similar issues to me--in both, it's localized punctuation substitution. In some cases, it's replacing two characters with one (e.g., the ?. that started the discussion), in some cases, it's replacing one character with one (e.g., Persian semicolon), in some cases, it's replacing one character with two (e.g., French putting spaces before : ? !). But all of them are locale-specific punctuation substitutions.

If we are going to make a localizable punctuation system (which we definitely should), I think having a common system for specifying them would be good.

Can you elaborate on your concern?

May I remind you of this issue?

I think, we agreed that a localisable punctuation system is desirable. The last question that wasn’t discussed to an end, if I read this thread correctly, was about ordering.

@georgd Could you give a summary of what the ordering concern is with an example?

@georgd do you think it would be sufficient to have a "pre" and "post" list of punctuation replacements? I mean "pre" = before smashing together adjacent fields and therefore being targetable in those smash rules, ie the example you gave of "colon > U+0387 · GREEK ANO TELEIA"; "post" = after adjacent fields are smashed together and only affecting presentation.

I'm curious about the ano teleia, why does it need to be targetable in smash rules? I can think of a use case for the French colon, which is to normalise " :" into ":" for smash rules and then finally present it as the narrow nbsp version. That would work with only pre and post rules.

I would prefer this as it would be easier to write fast code for than an arbitrary interleaving of smash and replace rules. I'm thinking it probably wouldn't excessively limit the things you can accomplish; is that right?

Technical implementation note

I was thinking about how I would execute a fully flexible list of replacements and punctuation smashes in any combined order, and it seems quite hard to optimise. The naive implementation would be to scan through a list of 40 replace/smash instructions for every single string join which is just going to be slow. In comparison if you only had pre and post replacements, the "mid-section" would be a single lookup of smashable punctuation.

@bwiernik The smash rules and normal replacements must be different because smash rules have to supply additional information for what to do across quote boundaries, with and without punctuation-in-quote enabled. By quote boundaries I mean joining the rightmost end of a quoted field with an unquoted suffix/field/etc. You need to be able to specify for “?.” which part (if any) goes inside the quote and which (if any) goes outside, and for PIQ off, what to put outside the quote, and for unquoted joins, what to replace it with.

The last two are never going to differ: I think it's safe to say that if punctuation-in-quote is off, there would be no reason for any punctuation to appear inside the quotes, nor for the behaviour to be different depending on whether the join is quoted or unquoted. (Unless someone can come up with an example where these would be different. You can always add a specialisation for this later.) So you can specify a smash rule in 3 outputs: PIQ-in, PIQ-out, and otherwise.

Here’s a syntax proposal incorporating the pre and post stuff I just posted about, using verbs. Pre becomes “normalise” and post becomes “present”. The three element kinds form a list of normalisations to apply, a mapping of smash join strings to their replacements, and a set of presentation rules that you can pretty much apply at the very end.

<!-- fr-FR -->
<punctuation>
  <normalise text=" :" as=":" />
  <smash join="::" out=":" />
  <smash join="!;" piq-in="!" piq-out=";" out="!;" />
  <present text=":" as="&#8239;:" />
</punctuation>

The single line smash rule is equivalent to piq-in="" piq-out=":" out=":". The out can even be optional, to represent not doing any replacement with PIQ off or in unquoted joins (and if no piq rules are present either then simply deleting any inherited rule). Hence <smash join="!;" piq-in="!" piq-out=":" /> would be equivalent to the one in the snippet above. Same for the normalise and present rules, no as= = delete the rule.

There’s one final ordering clarification, which is in what order to apply inherited normalise rules... but this is easy. It’s more a matter of determinism rather than an actual feature. Build a list of normalisations from inherited locales first (starting with en-US, then fr-FR, then fr-CA). Each time in the inheritance chain you see a a previously seen normalisation, delete the previous one and append the new one. You may have to override all the normalisations from parent locales to get the order you want but in general you won’t have to because there should not be that much overlap between normalisations, why would you write

<normalise text=" :" as="Potato" />
<normalise text="Potato" as=":" />

... when you could do it in one rule.

The other punctuation rules are unordered so no issues.

One more consideration is where to match or NOT to match these normalise and present strings. Do we for example want to make people match whitespace themselves? So actually <normalise text=" : " as=": "? Or do that implicitly so we can be slightly smarter and not have to trust locale authors to test that kind of thing? I don't want to break some paper's title where there's a mid-word colon or something. I also don't want to give people the full power of arbitrary regex substitutions here. Is that unreasonable? I don't know.

(Edit because I don’t want to ping you all again: I’m at peace with people using this to add rules like the New Yorker <present text="re-e" as "reë" />. That would just be really funny. But also I’m thinking about unordered present rules — you have to avoid doing replacements of already-presented slices. It’s more of a scan through the text doing replacements of a regex like rule1|rule2 and building an output string in one go. Maybe they should be ordered so we don’t have to guess which order to match the alternatives in when scanning the string. Or implicitly ordered by longest match first. I’ll think about that. Also maybe people do need access to regex so they can use the \b word boundary and \s whitespace features.)

@georgd Could you give a summary of what the ordering concern is with an example?

I’m seeing three groups of interaction with punctuation here (a fourth one is quote boundary handling which I leave away here):

  1. Script specific glyph replacement like Greek ano teleia for colon or Arabic comma for Latin comma.
  2. Typographic conventions like French ":" to "&nbsp;:".
  3. Smash rules.

IMO, the group 2 should be applied after smash rules only, so the NBSP doesn’t get in the way. As for group 1, this might be safe to apply after smashing (which may reduce the number of smash rules). There might, however, be cases where collision with a native punctuation mark would be handled differently from a collision with a non-native punctuation mark. So, we could need to handle ":·" differently from "··".

But now, this seems very hypothetical to me as every example I could make up would pose problems in Latin script, too. (Something like a title ending in a colon). With my little bit of knowledge of Greek, Arabic and Hebrew script, I can’t imagine a case where it is necessary to apply 1 before 3. Yet, they should definitely not be intermingled.

@georgd do you think it would be sufficient to have a "pre" and "post" list of punctuation replacements? I mean "pre" = before smashing together adjacent fields and therefore being targetable in those smash rules, ie the example you gave of "colon > U+0387 · GREEK ANO TELEIA"; "post" = after adjacent fields are smashed together and only affecting presentation.

I'm curious about the ano teleia, why does it need to be targetable in smash rules? I can think of a use case for the French colon, which is to normalise " :" into ":" for smash rules and then finally present it as the narrow nbsp version. That would work with only pre and post rules.

@cormacrelf in general the approach to have pre and post rules looks very nice. With what I said above, "pre" rules could be obsolete. I don’t think it makes a substantial difference to replace colon by ano teleia before smashing vs. afterwards. But I don’t know what other typographies in the world require...

I don't want to break some paper's title where there's a mid-word colon or something. I also don't want to give people the full power of arbitrary regex substitutions here. Is that unreasonable? I don't know.

That would definitely be inappropriate. I think, the field data should not be touched beyond smashing and quote boundary handling.