ashinn / irregex

Portable Efficient IrRegular Expressions for Scheme

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issues with or, eow/bow and any

silversquirl opened this issue · comments

This code works as expected:

(irregex-extract '(: bow "foo" eow) "foo foo bar foo") ; => '("foo" "foo" "foo")

This code doesn't:

(irregex-extract '(or (: bow "foo" eow) any) "foo foo bar foo") ; => '("foo" " " "f" "o" "o" " " "b" "a" "r" " " "f" "o" "o")
;; Expected: '("foo" " " "foo" " " "b" "a" "r" " " "foo")

If I can help out with fixing this (if it is a bug, that is) please let me know.

Any ideas what this issue is, where in the I should look in the irregex source code to find out or how I could temporarily fix it?

I'm actually using irregex-fold, by the way, it's just quicker to show the issue using irregex-extract

I wrote this piece of CHICKEN Scheme code (could easily be adapted to portable Scheme), which works similarly to irregex-fold (though it's probably slower) and works as expected:

(define (irregex-fold* irx                                                                                         
                       kons                                                                                        
                       knil                                                                                        
                       str                                                                                         
                       #!optional                                                                                  
                       (start 0)                                                                                   
                       (end (string-length str)))                                                                  

  (let loop ((start start)                                                                                         
             (seed knil))                                                                                          

    (let ((m (irregex-search irx str start end)))                                                                  
      (if m                                                                                                        
          (loop (irregex-match-end-index m)                                                                        
                (kons m seed))                                                                                     

          ;; There are no more matches                                                                             
          seed))))

It can be used like this, to simulate irregex-extract:

(reverse (irregex-fold* '(or (: bow "foo" eow) any) (lambda (m seed) (cons (irregex-match-substring m) seed)) '() "foo foo bar foo")) ; => ("foo" " " "foo" " " "b" "a" "r" " " "foo")

I will use this as a workaround until this issue is fixed.

The code to figure out the start of a word will compare src with (car init), which is only true in irregex-fold/fast in the first iteration of the lp. This comparison also exists in bos, where it is correct. I think in this case, it can be tossed out. If this is the first chunk in a sequence, I think we can safely say that it's a beginning of word.

I think this could be a correct patch, but I don't grok the chunking code enough to know for sure:

diff --git a/irregex.scm b/irregex.scm
index 7f29d4d..37ee19a 100644
--- a/irregex.scm
+++ b/irregex.scm
@@ -3400,11 +3400,10 @@
                (fail))))
         ((bow)
          (lambda (cnk init src str i end matches fail)
-           (if (and (or (if (> i ((chunker-get-start cnk) src))
-                            (not (char-alphanumeric? (string-ref str (- i 1))))
-                            (let ((ch (chunker-prev-char cnk src end)))
-                              (and ch (not (char-alphanumeric? ch)))))
-                        (and (eq? src (car init)) (eqv? i (cdr init))))
+           (if (and (if (> i ((chunker-get-start cnk) src))
+                        (not (char-alphanumeric? (string-ref str (- i 1))))
+                        (let ((ch (chunker-prev-char cnk src end)))
+                          (or (not ch) (not (char-alphanumeric? ch)))))
                     (if (< i end)
                         (char-alphanumeric? (string-ref str i))
                         (let ((next ((chunker-get-next cnk) src)))

Maybe @ashinn can take a look?

Definitely a bug, and @sjamaan's analysis is correct, but I'll have to take a closer look at the patch. Should have time this weekend.

Great! Thanks @ashinn and @sjamaan.

There was another bug which is it should have been using (chunker-prev-char cnk init src), which I fixed but otherwise adopted @sjamaan's patch as is. The code in eow appears to already be correct, though we should indeed have more tests for this.

Thanks for the help guys!