mjackson / citrus

Parsing Expressions for Ruby

Home Page:http://mjackson.github.io/citrus

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Help creating a rule for a last name

rejeep opened this issue · comments

Hi,

I'm trying to create a rule for a last name. This is what I have come up with:

rule last_name
  [A-Za-z]+ space [iI]+ | [iI]+ &[^iI]+ | [^iI] [A-Za-z]+
end

So:

  • Match a "name", followed by a space, followed by any number of i's. Or
  • If the first characters are one or more i's, then something that is not an i must follow. Or
  • If the first character is not an i, then any "name" can follow

If correct, this should be able to parse:

  • Rule 1) Love III
  • Rule 2) Immelman
  • Rule 3) Donald

But it fails on Immelman. It would also fail on for example Love IIIx.

I guess my second rule is wrong? But why?

I don't follow the logic that you're using. Why not just try a simpler pattern, like [A-Za-z]+ (" "* [iI]+)?? Here's what I get when I use this pattern in irb:

irb> require 'citrus'
=> true
irb> rule = Citrus.rule '[A-Za-z]+ (" "* [iI]+)?'
=> /[A-Za-z]/+ (" "* /[iI]/+)?
irb> rule.test 'Love III'
=> 8
irb> rule.test 'Immelman'
=> 8
irb> rule.test 'Donald'
=> 6

Because I have another rule, which would conflict with this. If I do it like you, then the name David Love III would parse as first name David, middle name Love and last name III. But the first name should be Davis and last name Love III. What I'm trying with my rule is to make sure that the last name can not be only I's.

Maybe it's simpler if I give you the whole grammar:

grammar Name
  rule name
    first_name space middle_name space last_name |
    first_name space last_name |
    first_name
  end

  rule first_name
    [A-Za-z]+
  end

  rule last_name
    [A-Za-z]+ space [iI]+ | [iI]+ &[^iI]+ | [^iI] [A-Za-z]+
  end

  rule middle_name
    ([A-Za-z] '.') {
      delete('.')
    }
    | [A-Za-z]+
  end

  rule space
    [ \t]*
  end
end

Why don't you try something like this:

require 'citrus'

Citrus.eval(<<CITRUS)
grammar Name
  rule name
    first_name space middle_name space last_name space suffix? |
    first_name space last_name space suffix? |
    first_name
  end

  rule first_name
    [A-Za-z]+
  end

  rule middle_name
    ([A-Za-z] '.') {
      delete('.')
    }
    | [A-Za-z]+
  end

  rule last_name
    !suffix [A-Za-z]+
  end

  rule suffix
    [iI]+ | `jr` '.'?
  end

  rule space
    [ \t]*
  end
end
CITRUS

puts Name.parse("David Love III").dump

This grammar separates out the suffix of the name (I've allowed for "jr." as well, just to demonstrate) from the last name. You can see in the dump of the match how the various tokens are broken up.

Ahh, nice. Thanks!