remove dependency on yara

Question

remove dependency on yara

williballenthin opened this issue a year ago · comments

yara is currently used as a pattern matching engine to find code sequences here:
https://github.com/mandiant/GoReSym/blob/f2009bd92819df2f69d74c4a32f1300be284a4fa/objfile/scanner.go

however, yara is written in C and has to be linked to this Go project. the compilation, build, and distribution is tricky; see #23 #24 and #27 for issues influenced by this complexity.

the patterns that are passed to yara are pretty simple - they could be trivially converted to binary regular expressions since there's no use of yara's condition logic.

the Go regexp module doesn't support binary regexes; however, the https://github.com/rsc/binaryregexp module does, and its easy to use. we should consider migrating from yara to binaryregexp to keep GoReSym pure-Go and therefore easier to build and distribute.

here's an example test demonstrating the translation from a yara pattern to a binaryregexp:

package main

import (
	"testing"

	"rsc.io/binaryregexp"
)

func TestBinaryRegex(t *testing.T) {
	t.Run("basic non-UTF-8 data", func(t *testing.T) {
		r := binaryregexp.MustCompile(`\xfd\xe3`)

		if !r.MatchString("\xfd\xe2") {
			t.Errorf("failed to match non-UTF-8 data")
		}
	})

	// x64firstmoduledata
	// $sig = { 48 8D 0? ?? ?? ?? ?? EB ?? 48 8? 8? ?? 02 00 00 66 0F 1F 44 00 00 }
	t.Run("x64firstmoduledata", func(t *testing.T) {
		r := binaryregexp.MustCompile(`\x48\x8D[\x00-\x0F]....\xEB.\x48[\x80-\x8F][\x80-\x8F].\x02\x00\x00\x66\x0F\x1F\x44\x00\x00`)

		// 0x000000000044D80A: 48 8D 0D 8F DA 26 00                    lea     rcx, runtime_firstmoduledata
		// 0x000000000044D811: EB 0D                                   jmp     short loc_44D820
		// 0x000000000044D813: 48 8B 89 30 02 00 00                    mov     rcx, [rcx+230h]
		// 0x000000000044D81A: 66 0F 1F 44 00 00                       nop     word ptr [rax+rax+00h]    <- always seems to be present
		if !r.Match([]byte{0x48, 0x8D, 0x0D, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00}) {
			t.Errorf("failed to match data verbatim")
		}

		// extra bytes at start
		if !r.Match([]byte{0xFF, 0xFF, 0x48, 0x8D, 0x0D, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00}) {
			t.Errorf("failed to match data with prefix bytes")
		}

		// extra bytes at end
		if !r.Match([]byte{0x48, 0x8D, 0x0D, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00, 0xFF, 0xFF}) {
			t.Errorf("failed to match data with postfix bytes")
		}

		// first byte doesn't match
		if r.Match([]byte{0xFF, 0x8D, 0x0D, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00}) {
			t.Errorf("unexpected match")
		}

		// byte 2 range is different
		if !r.Match([]byte{0x48, 0x8D, 0x00, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00}) {
			t.Errorf("failed to match data variant 1")
		}
		// byte 2 range is different
		if !r.Match([]byte{0x48, 0x8D, 0x0F, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00}) {
			t.Errorf("failed to match data variant 2")
		}
	})
}

Running tool: /usr/local/go/bin/go test -timeout 30s -run ^TestBinaryRegex$ github.com/mandiant/GoReSym

ok  	github.com/mandiant/GoReSym	0.002s

Willi Ballenthin · Answer 1 · Wed Aug 02 2023 19:54:58 GMT+0800 (China Standard Time)

if you'd like me to open a PR with these changes I'd be happy to do so.

Stephen Eckels · Answer 2 · Wed Aug 02 2023 21:22:17 GMT+0800 (China Standard Time)

I completely agree with you here, I'd thought the cgo process would be more transparent. I originally introduced the dependency out of necessity for an efficient signature scanner. I had hand rolled my own and testing showed it was orders of magnitude slower than yara, hence the switch.

We have a few interesting constraints, such as nibble level byte signatures being used, and also I believe skip ranges (which could be converted as two or more sub regexes matching within a range). If these constraints can be obeyed I would be in complete support of this!

Willi Ballenthin · Answer 3 · Wed Aug 02 2023 21:42:14 GMT+0800 (China Standard Time)

such as nibble level byte signatures being used

as shown above, i think we can do things like: [\x80-\x8F] (regex) instead of 8? (yara). It's a little more verbose, but the logic is equivalent.

, and also I believe skip ranges

i think we could do .{0, 50} (regex) instead of [0-50] (yara).

Stephen Eckels · Answer 4 · Wed Aug 02 2023 21:46:07 GMT+0800 (China Standard Time)

Sorry I didn't fully review your suggested alternative (it's late here). That logic seems perfectly sound to me 😁

If you'd like to open a PR I would review more carefully soon and be very appreciative!