scramjetorg / scramjet

Public tracker for Scramjet Cloud Platform, a platform that bring data from many environments together.

Home Page:https://www.scramjet.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

XMLParse()

jocubeit opened this issue · comments

Any chance you'll be including XMLParse() functionality?

I did think about this on a couple occasions - any proposals regarding a good streaming xml parser module that wouldn't include dependency weight that would incur relativistic effects? ;)

fast-xml-parser seems to be the favourite, but it's a not a streaming parser.

xml-flow looks ok - I've never used it though. Uses the sax library. Also has a BSD license - not sure if this is a concern for you.

xml-streamer is based on node-expat and It implements the Node.js stream.Transform API. Not sure you want to add node-expat and lodash as dependencies though.

muxml is a Node.js transform stream, is written with ES6, has a dependency on sax and is MIT licensed. I'm liking the look of this one.

So maybe the last one? ;-)

Looking at the options I'd just go and use sax as a dependency, as this would carry less weight...

Ok, the last time I inspected this I thought be best way would be to use xpath or queryselector as the argument for the data and I don't see anything that could solve this for us - which would mean sadly a new project (cool!). My idea would then be:

  1. create a sax derived module that would do readable -> sax -> xpath -> readable stream
  2. integrate that into scramjet.

Are you willing to help?

I'm willing to help, not sure how much help I can be though... I come from the C# and ruby worlds, node and javascript are still on my learning curve. I'm using typescript, node and dart on a daily basis though, and I'm a fast learner. Just let me know what you would like me to tackle and I'll take a stab.

I'm thinking xpath will be challenging. We could avoid building a dom but still support querySelector by converting css queries to xpath. I've had a brief look at some xpath modules, but need to do more of a deep dive I think. Don't we have to parse the entire document to filter for an xpath?

I'm thinking we can use an xpath-like filter for matching nodes. I'm assuming each match will then be considered a "row" and be surfaced as an object by the .consume() method. Sorry if this is a little incoherent, I'm typing as I think.

Good, so:

  1. No, I think the XMLParse should not be a plugin especially since CSVParse isn't. I'd start a separate repo just for the sax based streamed parser and use it as a dependency (so that's where your help would be appreciated).

  2. XPath and Selectors could be both supported - we'd just need to see which ones are supported by sax and other tools we could setup. I'm in a process of setting up slack for Scramjet - I'll invite you as soon as we get going.

  3. Yes, "xpath-like filter", that's the idea I have, a bit like the JSONStream module that I use for JSONParse method.

OK, let me know when you have slack setup, in the mean time I'll familiarise myself with ScramJet, JSONParse etc.

commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

There's a discussion on the just opened scramjet slack about this...

I think this could be the good way to work with this.

const flow = require("xml-flow");
const {DataStream} = require("scramjet");
DataStream.from(function() {
    const xml = flow('./path/to.xml')
    const out = new DataStream();
    xml.on("tag:mytag", data => out.write(data));
    out.on("pause", () => xml.pause())
    out.on("resume", () => xml.resume())
    return out;
})  
    .each(console.log)
    .catch(e => {
        console.error(e.stack);
        process.exit(1);
    });