db47h / ragel

Go driver for ragel scanners

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ragel-go

godocb goreportb

This package provides a driver for ragel based scanners in Go for streamed input. It takes care of all the boilerplate code, letting the user focus on the ragel state machine definition.

It is only intended for use with ragel machines defined as scanners. See section 6.3 of the ragel guide.

Rationale

Streaming input data to ragel -- i.e. reading from an io.Reader as opposed to loading the whole input into memory, requires a driver, a piece of code that buffers input and manipulates pointers for the ragel state machine.

Here is the Go version of the main loop from the C-like scanner in the ragel examples:

func scanner(r io.Reader) {
    var (
        buf = make([]byte, BUFSIZE)
        cs, act, ts, te int
        have, curLine = 0, 1
        done bool
    )

    // ragel init code
    %% write init;

    for !done {
        var (
            err error
            p = have
            ln, pe int
            eof = -1
        )

        if ( p == len(buf) ) {
            fmt.Fprintf(os.Stderr, "out of buffer space\n" )
            os.Exit(1)
        }

        for {
            ln, err = r.Read(buf[p:])
            if ln != 0 || err != nil {
                break
            }
        }
        pe = p + ln;

        // trigger end of file on any error
        if err != nil {
            eof = pe
            done = true
        }

        // Run the ragel state machine
        %% write exec;

        if cs == clang_error {
            fmt.Fprintf(os.Stderr, "parse error (invalid char)\n" )
            break
        }

        if ts == 0 {
            have = 0
        } else {
            // There is a prefix to preserve, shift it over.
            have = pe - ts;
            copy(buf, buf[ts:pe])
            te -= ts
            ts = 0
        }
    }
}

In this example, the state machine actions just write tokens to stdout. This is a lot of scaffolding and yet a few important things are still missing, like a convenient way to retrieve the tokens and some mechanism to keep track of line and column position for these tokens.

This package provides all the boilerplate code needed to read data from an io.Reader, buffering, token position tracking (line and column) and error handling and lets you focus on your ragel state machine definition.

Quick start

In essence, creating a scanner is as simple as creating a ragel machine definition for your language of choice, along with a tiny Go interface implementation.

Ragel state machine definition

Here is an abbreviated version for the C-like language mentioned above (with support for comments, strings and char literals removed for brevity):

// scanner/myc.rl

// Package scanner provides a scanner for my-C language.
//
package scanner // import "github.com/me/myc/scanner"

import "github.com/db47h/ragel/v2"

// Tokens for my-C
//
const (
    Ident ragel.Token = iota
    Float
    Symbol
)

// ragel state machine definition.
%%{
    machine lang;

    # utf-8 support
    include WChar "utf8.rl";

    newline = '\n' @{ s.Newline(p) };

    main := |*
        alpha_u = uletter | '_';
        alnum_u = alpha_u | digit;

        # punctuation
        punct {
            s.Emit(ts, Symbol, string(data[ts:te]));
        };

        # identifiers
        alpha_u alnum_u* {
            s.Emit(ts, Ident, string(data[ts:te]))
        };

        # ignore white space and control codes
        newline | 0x00..0x20 | 0x7f;

        # integers
        digit+ ('.' digit+)* {
            s.Emit(ts, Float, string(data[ts:te]))
        };
    *|;
}%%

%%write data nofinal;

// MyC implements ragel.Interface. This is the canonical implementation
// available at https://godoc.org/github.com/db47h/ragel#Interface
// with the type name changed to MyC.
//
type MyC struct {}

func (MyC) Init(s *ragel.State) (int, int) {
    var cs, ts, te, act int
    %%write init;
    s.SaveVars(cs, ts, te, act)
    return %%{ write start; }%%, %%{ write error; }%%
}

func (MyC) Run(s *ragel.State, p, pe, eof int) (int, int) {
    cs, ts, te, act, data := s.GetVars()
    %%write exec;
    s.SaveVars(cs, ts, te, act)
    return p, pe
}

Note that this example imports a file named "utf8.rl" for UTF-8 support. This should be generated first and only once:

go run $GOPATH/github.com/db47h/ragel/cmd/unicode2ragel/main.go -o utf8.rl

Now supposing that the above example is saved in a file named myc.rl, run it through ragel with the following command:

    ragel -Z -G2 myc.rl -o myc.go

This will generate a myc.go file. I tend to use the -G2 because it yields the fastest running code in most of my use cases. Your mileage may vary.

You might also want add go:generate directives in some other .go file in the same package:

//go:generate ragel -Z -G2 myc.rl -o myc.go
//go:generate sed -i "1s|.*|// Code generated by ragel. DO NOT EDIT.|" myc.go

The second directive changes the first line of the generated code so that the generated file gets ignored by the Go tooling.

The MyC struct defined in the ragel state machine definition file is the ragel.Interface implementation that is used in the call to ragel.New. Besides changing the struct name to suit your needs (e.g. make it private), this bit of code must be copy-pasted as-is from the reference implementation. If you feel that you need to change it for some reason, please file an issue.

Scanning

Call ragel.New with an io.Reader and a MyC struct instance, then iterate over tokens with Scanner.Next:

// main.go

package main // import "github.com/me/myc"

import (
    "os"
    "fmt"

    "github.com/db47h/ragel/v2"
    "github.com/me/myc/scanner"
)

func main() {
    s := ragel.New("stdin", os.Stdin, scanner.MyC{})
    for {
        pos, tok, literal := s.Next()
        switch tok {
        case ragel.EOF:
            return
        case ragel.Error:
            fmt.Fprintf(os.Stderr, "scan error: %v: %v\n", s.Pos(pos), literal)
        default:
            fmt.Printf("%v: %d %s\n", s.Pos(), tok, literal)
        }
    }
}

That's it.

Supported Go and Ragel versions

Go

This package uses the Go modules convention introduced in Go 1.11. The current major version is v2, so it should be imported like this:

    import "github.com/db47h/ragel/v2"

This is supported by Go 1.11, 1.10.4 and 1.9.7.

It might be possible to get it to work with earlier versions using third-party package management tools, like dep, but this has not been tested.

Ragel

Ragel provides good Go support from version 6.9 onwards.

The development releases (versions 7.x) should be avoided for the time being. They have issues with that bit of code in the Go generator:

    return %%{ write start; }%%, %%{ write error; }%%

If you need to use a 7.x version, just change the write statements like so:

    return MachineName_start, MachineName_error

where MachineName is the FSM name as used in the ragel machine statement.

For all ragel versions, the alphabet data type needs to left to the default setting of byte, even for UTF-8 input. i.e. if you have any alphtype … ; statement, remove it.

Handling UTF-8 input

UTF-8 input is supported by including UTF-8 state machines automatically generated by the companion tool unicode2ragel.

Implementation details

The scanner reads input data in chunks of 32KiB (the default setting), then tokenizes the whole buffer and pushes the tokens found in a FIFO queue.

scanner.Next pops tokens from the queue and populates it as needed.

An alternative to using a queue would be to run the scanning loop in a goroutine and use a channel to emit tokens -- like text/template/parse in Go's standard library. This approach is however 2 to 5 times slower than using a queue (channels are great, just not here). See #4.

Error handling is done by returning a ragel.Error token. This can happen in 4 cases:

  • an IO error while reading from the input reader
  • the input buffer gets full
  • an illegal symbol is encountered (i.e. unhandled by the state machine definition)
  • a ragel action calls State.Errorf

The first two cases are non-recoverable errors; any subsequent call to scanner.Next returns ragel.EOF. For illegal symbols, scanning resumes at the following byte. Actions calling State.Errorf must update ragel's p variable (or issue an appropriate fgoto) for non-fatal errors.

The standard ragel state variables data, p, pe, eof, cs, ts, te and act are directly available in state machine definitions, i.e. data[p] is the current character.

The ragel.Interface

Ragel state machine definitions are templates where Go code is intermingled with ragel directives (like the %% write exec; in the above example) and ragel generates compilable Go code out of these templates.

The ragel directives %% write exec; and %% write init; generate code that expects a few variables to be accessible in the program scope where these directives appear. As a result, and although the variables names can be customized (they could even be struct fields), the ragel state machine definition is tightly coupled with the Go code that drives it. In order to have a package that is go-gettable and updatable independently of users' own state machine definitions, I needed to find a way to minimize the amount of Go code present in a state machine definition file.

Since there is no way around some copy-paste, I came up with the ragel.Interface interface and a canonical implementation that must be placed in state machine definition files. This implementation uses as little code as possible in order to limit potential bugs and avoid painful updates.

API

The ragel.Interface implementation discussed above brings a *State instance s into the scope of actions in ragel state machine definitions. Actions can use its Emit, Errorf and Newline methods. The first two are mostly self-explanatory. Users who need accurate line and column tracking must take care of calling Newline at the right places (see how this is done in the above example).

The State.SaveVars and State.GetVars methods manage the standard ragel state variables and are only meant to be used by the ragel.Interface implementation. As such, they must not be used by client code and should be considered private for all intents and purposes.

The Scanner type wraps things together and provides the scanning client API.

Behind the scenes, State and Scanner are thin wrappers around state, their sole purpose being a clean separation of concerns between the API used by state machine definitions from the API used by parsers.

LICENSE

Copyright 2018 Denis Bernard db047h@gmail.com

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

Go driver for ragel scanners

License:MIT License


Languages

Language:Go 100.0%