msp-strath / Mary

Mary is the successor of Marx, a content delivery and assessment engine based on markdown and git

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Patterns for strings

gallais opened this issue · comments

If we are going to want to write parsers in Shonkier, we are going to need
to have patterns for strings. My first instinct was to reuse list syntax and
write:

['a'|str] to match strings with a prefix "a"
[c'''c|str] to match strings with a prefix "'" (c can be any identifier)
[c d e|str] to match strings of length at list 3

However this makes it impossible to distinguish between a pattern for a list
of string and one for a string. This breaks examples such as enum where
I used [_|_] to detect whether something was a (non-empty) list or whether
I could assume it was a string.

But maybe such examples should be forbidden? Thoughts?

There is a huge design space here, but we certainly need to get off the ground sooner rather than later. Head-tail splitting and character literals certainly do that.

Let me try to reconstruct what I had in old Shonky, but with something like our current notation. String patterns were strings with patterns spliced in.

  1. If the string's delimiters are foo"..."foo, then its splices are delimited by foo...foo.

  2. Splices should have at least one character between them (otherwise, their behaviour is undefined).

  3. A splice is a tail-pattern. Every splice has a closing delimiter, be it the nonempty text which follows it, or end-of-text. If a tail-pattern is exactly a variable, it matches the text up to but not including the first occurrence of its closing delimiter.

     "this`x`bar`y`" ? "this is barbaric" -> {x = " is ", y = "baric"}
    
  4. There are also head-patterns. A cell is an x-pattern if its head is a head-pattern and its tail is an x-pattern. If a head-pattern is exactly a variable, it matches a single character.

     "this`x`bar`[y|ys]`" ? "this is barbaric" -> {x = " is ", y='b', ys = "aric"}
    

We frown upon left-nested patterns, but they do have defined behaviour.

  1. A string with no splices, used as a pattern, matches exactly itself.

  2. A string with pattern splices may be used as a tail-pattern, but AFAICS, there is never a good reason to do that.

(It's not entirely clear whether this approach is a good idea, but if it is, it may be worth asking whether character literals need to be a separate thing.)

(I'm wondering whether some sort of Kleene star could be added to this technology.)

I'll hack up an experiment outside the main code base.

Meanwhile, I'm also thinking about adding the generation of parsers from grammars, and how that might fit with some sort of "type system"...

some sort of Kleene star

I'm worried of built-in regular expressions on the LHS. Seems like regexp are
rarely a good idea and it makes the whole pattern fragment a lot more complicated.

I do agree it would be nice to be able to write a full string instead of a list
of characters to match a prefix though and it is not clear how to do it by
reusing the existing pattern syntax. But we could maybe have e.g.
["hello " ||xs] to mean that we are matching "hello " ++ xs.

I'm not talking even all the regular languages, and I'm taking care to avoid search. But it'd be cool if we could, say, take a comma-separated list apart.

Here's a wee experiment.

{-# LANGUAGE OverloadedStrings, TupleSections #-}

module Spat where

import Control.Arrow ((***))
import Data.Text as T

myBreakOn :: Text -> Text -> Maybe (Text, Text)
myBreakOn "" t = Just ("", t)
myBreakOn needle haystack = case breakOn needle haystack of
  (_, "") -> Nothing
  (pre, incl) -> Just (pre, T.drop (T.length needle) incl)

type SP = Odd TPat Text

data Odd  a b = a :/: Even b a deriving Show
data Even a b = a :\: Odd b a | Done deriving Show

infixr 4 :/:
infixr 4 :\:

data TPat
  = Str String
  | HPat :& TPat
  | Nil
  deriving Show
infixr 5 :&

data HPat
  = Chr String
  | Lit Text
  deriving Show

type Matching = ([(String, Char)],[(String, Text)])

hMatch :: HPat -> Text -> Maybe ([(String, Char)], Text)
hMatch (Lit p) t = ([],) <$> stripPrefix p t
hMatch (Chr x) t = (((:[]) . (x,)) *** id) <$> uncons t
  
tMatch :: SP -> Text -> Maybe Matching
tMatch (Nil :/: Done) "" = return ([],[])
tMatch (Nil :/: Done) _  = Nothing
tMatch (Nil :/: d :\: r) t = do
  t <- stripPrefix d t
  tMatch r t
tMatch (h :& p :/: r) t = do
  (cm0, t)  <- hMatch h t
  (cm1, sm) <- tMatch (p :/: r) t
  return (cm0 ++ cm1, sm)
tMatch (Str x :/: Done) t = return ([], [(x, t)])
tMatch (Str x :/: d :\: r) t = do
  (s, t) <- myBreakOn d t
  (cm, sm) <- tMatch r t
  return (cm, (x, s) : sm)

example :: SP
example = Nil :/: "this" :\: Str "x" :/: "bar" :\: Chr "y" :& Str "ys" :/: Done

-- tMatch example "this is barbaric"
-- = Just ([("y",'b')],[("x"," is "),("ys","aric")])

We've been chatting about whether

"`[x|xs]`foo`ys`"

should match

"foofoo"

The question is which to do

  • find the first foo, then match what's before it agains [x|xs], in which case fail
  • match x with "f", then only hunt for foo once we have a string variable, giving xs = "oo" and ys = ""

We came to the conlusion that we should do the former, as the latter is encodable if we have an appropriate notion of guard, which is worth having, anyway.

So, here's the deal.

A string pattern is a string literal with patterns spliced into it.

"rhubarb`[x|xs]`custard`[y|ys]`crumble"

giving rise to an alternating sequence d0,p0,d1,p1,...dn.

To match this, a string must end as dn. It is then enough to consider how to match the remaining prefix.

What is a pattern and what does it match? That depends on its position. The whole of a pattern is in a tail position. If a cell pattern is in tail position, then its tail is also in a tail position. So in

[a b]

neither a nor b is in tail position, but in

[a | b]

b is in tail position.

Pattern variables in tail positions match any number of characters, up to the first occurrence of the delimiter which follows them, or to end of input if there is no such. Pattern variables in head positions match single characters. Atoms match their names (where [] has the empty name). A nested string pattern is treated by matching the string that a variable would match.

That is,

"`"`[x|xs]`"`foo`ys`"

tests whether the text before the first foo is nonempty.

OK, it turns out that when atoms show up spuriously in string splice patterns, I'm forced either to make them fail or make them match nothing. Either that, or I have to ditch being polymorphic in the representation of atoms, or be more precise about their relationship with text. I currently make them fail. It's just occurred to me that that is inconsistent with their meaning in expression splices. All atoms should match the empty input.

OK, folks, I think we're there! It's not the last word on writing parsers by a long chalk, but it's a good opening gambit.