stencila / encoda

↔️ A format converter for Stencila documents

Home Page:https://stencila.github.io/encoda/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

JATS: Plain text citation

rgieseke opened this issue · comments

Mixed citations (e.g. originally from bibitems in LaTeX) are not parsed when reading from JATS-XML.

I think the mixed-content should actually be mixed-citation: https://github.com/stencila/encoda/blob/master/src/codecs/jats/index.ts#L1165

Even nicer would probably be to have all elements of the mixed-citation as inline elements to keep e.g. parts in italics.

<mixed-citation>Norman de Plume <italic>The book of XML problems</italic>. XtraPress 2021.</mixed-citation>

Maybe this could be filed as description or comment?

https://schema.stenci.la/creativework

I think the ideal way to handle a <mixed-citation> would be to try to "decode" it (ie. parse it) into a CreativeWork. If we do that then in text Cite nodes will work as expected (ie. show authors and years if needed).

With the fix that you made the entirety of the <mixed-citation>

type: Article
id: pone-0091296-Choat1
authors: []
title: >-
      Choat JH (2012) Spawning aggregations in reef fishes; ecological and
      evolutionary processes. In: Sadovy de Mitcheson Y, Colin PL, editors. Reef
      Fish Spawning Aggregations: Biology, Research and Management. Heidelberg:
      Springer. pp. 85–116.

whereas what we want is the bibliographic info to be parsed out of the <mixed-citation> into

type: CreativeWork
authors:
  - type: Person
    familyNames:
      - Choat
    givenNames:
      - John Howard
datePublished:
  type: Date
  value: '2011-09-20'
identifiers:
  - type: PropertyValue
    name: doi
    propertyID: https://registry.identifiers.org/registry/doi
    value: 10.1007/978-94-007-1980-4_4
isPartOf:
  type: Periodical
  name: 'Reef Fish Spawning Aggregations: Biology, Research and Management'
publisher:
  type: Organization
  name: Springer Netherlands
title: Spawning Aggregations in Reef Fishes; Ecological and Evolutionary Processes
url: http://dx.doi.org/10.1007/978-94-007-1980-4_4

In Encoda, rather than trying to parse references into a CreativeWork, we take the approach suggested here and query CrossRef for bibliographic info. I didn't write the above YAML out by hand but rather used the crossref codec:

./encoda convert "Choat JH (2012) Spawning aggregations in reef fishes; evolutionary processes." --from crossref - --to yaml

I suggest that we use this approach for JATS <mixed-citation> (as we do in the reshape function). However, I think it would be wise to perhaps put it in name or alternateNames or similar (I think description should be avoided because that is where the abstract goes and in some cases we actually have that; and comment has a different semantic structure) and then do the CrossRef querying as a separate enrichment step that won't cause a failure, if for instance there is no network connection.

🎉 This issue has been resolved in version 0.111.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

I suggest that we use this approach for JATS (as we do in the reshape function). However, I think it would be wise to perhaps put it in name or alternateNames or similar (I think description should be avoided because that is where the abstract goes and in some cases we actually have that; and comment has a different semantic structure) and then do the CrossRef querying as a separate enrichment step that won't cause a failure, if for instance there is no network connection.

Yes, i was mistakenly thinking that description was belonging to the citation and not the entire creativeWork.
The CrossRef querying approach sounds great, how could that work?
Should it be an extra conversion? JATS to CrossRef enhanced JATS? Or should it be tried in the JATS codec?

Should it be an extra conversion? JATS to CrossRef enhanced JATS?

Yes, that is what I advocating for above. It shouldn't be part of the decode method of the JatsCodec but rather part of a generic function which can be applied to references of any Article no matter which format it originated from. That is exactly what currently happens here in the reshape function but it is currently "converting" paragraphs into CreativeWorks using CrossRef:

encoda/src/util/reshape.ts

Lines 341 to 370 in 52b872a

let text = textContent(following)
// Remove leading numbers etc (if any)
text = text.replace(/^\s*\d+\s*[.,:;]*\s*/, '')
// Look for a DOI
const match = /\b((DOI\s*:?\s*)|(https?:\/\/doi\.org\/))?(10.\d{4,9}\/[^\s]+)/i.exec(
text
)
if (match) {
// Remove trailing punctuation (if any)
let doi = match[4]
if (doi.endsWith('.') || doi.endsWith(',')) doi = doi.slice(0, -1)
promises.push(decodeDoi(doi, text))
} else {
promises.push(decodeCrossref(text))
}
} else break
step++
// Limit the number of inflight requests to 10
// Avoids this warning https://github.com/sindresorhus/got/issues/1523
if (promises.length >= 10) {
references.push(
...((await Promise.all(promises)) as schema.CreativeWork[])
)
promises = []
}
}

I think this code should be factored our into a separate enrich function and applied to Paragraphs in the references section, but also to string items in the references property of any CreativeWorks (I had forgotten that string is a valid item in references).

In summary, what needs to happen if we take this direction is:

  • In the jats codec, return a string for <mixed-citation> instead of setting the title:
    if (title === undefined && elem.name === 'mixed-citation') {
    title = textOrUndefined(elem)
    }
  • In the reshape function, turn Paragraphs that are in the "References" or "Bibliography" section into plain string items in Article.references
  • Move relevant code from reshape function and put into a new enrich function where it is applied to any item in Article.references that is a string
  • Enable enrich after decode by default but allow user to disable it (like we do for coerce and reshape)

    encoda/src/codecs/types.ts

    Lines 271 to 273 in 39813e0

    let node = await this.decode(await vfile.read(filePath), options)
    if (shouldCoerce) node = await coerce(node)
    if (shouldReshape) node = await reshape(node)