rust-bakery / nom

Rust parser combinator framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to parse until a range of tags

frenetisch-applaudierend opened this issue · comments

I would like to parse arbitrary text with embedded sequences which are delimited by different tags into their parts. E.g.

Test <#embedded sequence 1#> and (*embedded sequence 2*)

should be parsed to Text("Test ") Embedded1("embedded sequence 1") Embedded2("embedded sequence 2"). Ideally all strings in the token should be borrowed from the input string.

The embedded sequences are straightforward, but I fail to specify the parser for the Text tokens. Is it possible to take_until a range of tags is encountered?

Hello @frenetisch-applaudierend.
I think the fifth chapter of the article The Nom Guide (Nominomicon) be able to address your question.

Hi @coalooball

Thanks for the Link, I haven't seen that one before!

However I don't think it applies to my use case, since the mentioned parsers all only allow a predicate on single characters. I would need predicate on different parsers (i.e. take_until(tag("<#").or(tag("(*")))), but this does not seem handled (or I did not see it).

Hi @coalooball

Thanks for the Link, I haven't seen that one before!

However I don't think it applies to my use case, since the mentioned parsers all only allow a predicate on single characters. I would need predicate on different parsers (i.e. take_until(tag("<#").or(tag("(*")))), but this does not seem handled (or I did not see it).

Hello again!
The take_until really doesn't work that way, since it's the equivalent of a terminal node in BNF. I suppose you could use terminated to extract arbitrary text.
Here is my method which is a bit more cumbersome, I don't know if there are any other concise methods:

use nom::{
    branch::alt,
    bytes::complete::{tag, take_till, take_while1},
    character::{is_alphanumeric, is_space},
    sequence::{delimited, terminated},
    IResult,
};

fn is_delimiter(s: u8) -> bool {
    s == 0x2a || s == 0x23
}

fn embedded_sequence(s: &[u8]) -> IResult<&[u8], &[u8]> {
    delimited(
        alt((tag(b"<"), tag(b"("))),
        delimited(
            alt((tag(b"#"), tag(b"*"))),
            take_till(is_delimiter),
            alt((tag(b"#"), tag(b"*"))),
        ),
        alt((tag(b">"), tag(b")"))),
    )(s)
}

fn parse(s: &[u8]) -> IResult<&[u8], &[u8]> {
    terminated(
        take_while1(|x| is_alphanumeric(x) || is_space(x)),
        embedded_sequence,
    )(s)
}

fn main() {}

#[test]
fn test_embedded_sequence() {
    assert_eq!(
        embedded_sequence(b"<#embedded sequence 1#>111").unwrap(),
        (b"111".as_ref(), b"embedded sequence 1".as_ref())
    );
    assert_eq!(
        parse(b"Test <#embedded sequence 1#> and (*embedded sequence 2*)").unwrap(),
        (b" and (*embedded sequence 2*)".as_ref(), b"Test ".as_ref())
    )
}