remove dependency on yara
williballenthin opened this issue · comments
yara is currently used as a pattern matching engine to find code sequences here:
https://github.com/mandiant/GoReSym/blob/f2009bd92819df2f69d74c4a32f1300be284a4fa/objfile/scanner.go
however, yara is written in C and has to be linked to this Go project. the compilation, build, and distribution is tricky; see #23 #24 and #27 for issues influenced by this complexity.
the patterns that are passed to yara are pretty simple - they could be trivially converted to binary regular expressions since there's no use of yara's condition
logic.
the Go regexp
module doesn't support binary regexes; however, the https://github.com/rsc/binaryregexp module does, and its easy to use. we should consider migrating from yara to binaryregexp to keep GoReSym pure-Go and therefore easier to build and distribute.
here's an example test demonstrating the translation from a yara pattern to a binaryregexp:
package main
import (
"testing"
"rsc.io/binaryregexp"
)
func TestBinaryRegex(t *testing.T) {
t.Run("basic non-UTF-8 data", func(t *testing.T) {
r := binaryregexp.MustCompile(`\xfd\xe3`)
if !r.MatchString("\xfd\xe2") {
t.Errorf("failed to match non-UTF-8 data")
}
})
// x64firstmoduledata
// $sig = { 48 8D 0? ?? ?? ?? ?? EB ?? 48 8? 8? ?? 02 00 00 66 0F 1F 44 00 00 }
t.Run("x64firstmoduledata", func(t *testing.T) {
r := binaryregexp.MustCompile(`\x48\x8D[\x00-\x0F]....\xEB.\x48[\x80-\x8F][\x80-\x8F].\x02\x00\x00\x66\x0F\x1F\x44\x00\x00`)
// 0x000000000044D80A: 48 8D 0D 8F DA 26 00 lea rcx, runtime_firstmoduledata
// 0x000000000044D811: EB 0D jmp short loc_44D820
// 0x000000000044D813: 48 8B 89 30 02 00 00 mov rcx, [rcx+230h]
// 0x000000000044D81A: 66 0F 1F 44 00 00 nop word ptr [rax+rax+00h] <- always seems to be present
if !r.Match([]byte{0x48, 0x8D, 0x0D, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00}) {
t.Errorf("failed to match data verbatim")
}
// extra bytes at start
if !r.Match([]byte{0xFF, 0xFF, 0x48, 0x8D, 0x0D, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00}) {
t.Errorf("failed to match data with prefix bytes")
}
// extra bytes at end
if !r.Match([]byte{0x48, 0x8D, 0x0D, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00, 0xFF, 0xFF}) {
t.Errorf("failed to match data with postfix bytes")
}
// first byte doesn't match
if r.Match([]byte{0xFF, 0x8D, 0x0D, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00}) {
t.Errorf("unexpected match")
}
// byte 2 range is different
if !r.Match([]byte{0x48, 0x8D, 0x00, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00}) {
t.Errorf("failed to match data variant 1")
}
// byte 2 range is different
if !r.Match([]byte{0x48, 0x8D, 0x0F, 0x8F, 0xDA, 0x26, 0x00, 0xEB, 0x0D, 0x48, 0x8B, 0x89, 0x30, 0x02, 0x00, 0x00, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00}) {
t.Errorf("failed to match data variant 2")
}
})
}
Running tool: /usr/local/go/bin/go test -timeout 30s -run ^TestBinaryRegex$ github.com/mandiant/GoReSym
ok github.com/mandiant/GoReSym 0.002s
if you'd like me to open a PR with these changes I'd be happy to do so.
I completely agree with you here, I'd thought the cgo process would be more transparent. I originally introduced the dependency out of necessity for an efficient signature scanner. I had hand rolled my own and testing showed it was orders of magnitude slower than yara, hence the switch.
We have a few interesting constraints, such as nibble level byte signatures being used, and also I believe skip ranges (which could be converted as two or more sub regexes matching within a range). If these constraints can be obeyed I would be in complete support of this!
such as nibble level byte signatures being used
as shown above, i think we can do things like: [\x80-\x8F]
(regex) instead of 8?
(yara). It's a little more verbose, but the logic is equivalent.
, and also I believe skip ranges
i think we could do .{0, 50}
(regex) instead of [0-50]
(yara).
Sorry I didn't fully review your suggested alternative (it's late here). That logic seems perfectly sound to me 😁
If you'd like to open a PR I would review more carefully soon and be very appreciative!