A non-validating tokenizer and parser for the RDF N-Triples and N-Quads serializations (or any “N-x”).
Provides parsing of N-Triples and N-Quads from strings, or tokenizing any “N-x” string.
There are enough parsers already that are faster (see last section), but having a parser for Node.js is useful for building smaller tools.
npm install --save rdf-nx-parser
The module exports a parser object:
var parser = require('rdf-nx-parser');
Use parseTriple()
to parse an N-Triples statement, parseQuads()
for N-Quads. Both return an objects, or null
if the input can't be parsed.
var quad = parser.parseQuad(
'_:foo ' +
'<http://example.com/bar> ' +
'"\\u9B3C\\u8ECA"@jp ' +
'<http://example.com/baz> .'
);
console.log(JSON.stringify(quad, null, 4));
{
"subject": {
"type": "blankNode",
"value": "foo"
},
"predicate": {
"type": "iri",
"value": "http://example.com/bar"
},
"object": {
"type": "literal",
"value": "鬼車",
"language": "jp"
},
"graphLabel": {
"type": "iri",
"value": "http://example.com/baz"
}
}
Literal objects can have an additional language
or datatypeIri
property.
The parser does not verify that the data adheres to the [grammar] 1. It will instead happily parse anything as good as it can:
> parser.parseQuad('<foo> <:///baz> "bar" <$!#]&> .');
{ subject: { type: 'iri', value: 'foo' },
predicate: { type: 'iri', value: ':///baz' },
object: { type: 'literal', value: 'bar' },
graphLabel: { type: 'iri', value: '$!#]&' } }
You can optionally pass an options object to these methods as a second parameter, shown with the defaults here:
parser.parseTriple(input, {
// Set to `true` to get unparsed strings as `value`
//properties
asString: false,
// Include the unparsed token as `valueRaw` property
// when returning objects
includeRaw: false,
// Decode unicode escapes, `\uxxxx` and `Uxxxxxxxx`
// (but not percent encoding or punycode)
unescapeUnicode: true
});
Parsing a whole file of N-Triples / N-Quads lines can easily be done e. g. with Node's readline
module, see the example.
An arbitrary number of “N-x” tokens can be extracted from a string into an array of token objects with the tokenize()
method:
> parser.tokenize(
'<foo> _:bar . "123"^^<http://example.com/int> ' +
'"\u0068\u0065\u006C\u006C\u006F"@en-US . .'
);
[ { type: 'iri', value: 'foo' },
{ type: 'blankNode', value: 'bar' },
{ type: 'endOfStatement', value: '.' },
{ type: 'literal',
value: '123',
datatypeIri: 'http://example.com/int' },
{ type: 'literal',
value: 'hello',
language: 'en-US' },
{ type: 'endOfStatement', value: '.' },
{ type: 'endOfStatement', value: '.' } ]
Each token has at least a type
and a value
property. There are four token types: iri
, literal
, blankNode
and endOfStatement
(can be listed with the getTokenTypes()
method).
The implementation is based on regular expressions (to split the input into tokens) – they are pretty fast on V8. This regex-based implementation is faster than a previous simple state machine (that read the input in one scan). Seems like regexes can be compiled more effectively into machine code.
Works with Node.js 0.10 and higher.
Run with: npm test
(mocha, Chai, Istanbul)
- Raptor library, C
- nxparser, Java