google / zoekt

Fast trigram based code search

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Panic when indexing document with symbols that cross rune boundaries

mmm444 opened this issue · comments

Sometimes ctags (both exuberant and universal) return symbol boundaries that cross rune boundaries. This leads to a panic during indexing.

I have reduced the issue that happened during indexing my codebase to this test:

func TestBadRunesFromCtags(t *testing.T) {
	content := []byte("Hlášení: {0}")
	b := testIndexBuilder(t, nil,
		Document{
			Name:    "bad_rune.cs",
			Content: content,
			Symbols: []DocumentSection{{6, 9}}, // splits í rune (ctags does this)
		},
	)
	searchForTest(t, b, &query.Substring{Pattern: "Hlášení"})
}

which panics:

--- FAIL: TestBadRunesFromCtags (0.00s)
panic: runtime error: index out of range [recovered]
        panic: runtime error: index out of range

goroutine 340 [running]:
testing.tRunner.func1(0xc42012a2d0)
        /usr/local/go/src/testing/testing.go:711 +0x2d2
panic(0x799260, 0xa05be0)
        /usr/local/go/src/runtime/panic.go:491 +0x283
github.com/google/zoekt.(*postingsBuilder).newSearchableString(0xc42007c500, 0xc42012c34f, 0xf, 0x1, 0xc42012c350, 0x1, 0x1, 0x2, 0xc42023fb30, 0x0, ...)
        /home/mmm/go/src/github.com/google/zoekt/indexbuilder.go:138 +0xc07
github.com/google/zoekt.(*IndexBuilder).Add(0xc4200b8420, 0x803c18, 0x7, 0xc42012c340, 0xf, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /home/mmm/go/src/github.com/google/zoekt/indexbuilder.go:376 +0x451
github.com/google/zoekt.testIndexBuilder(0xc42012a2d0, 0x0, 0xc420051f20, 0x1, 0x1, 0x10)
        /home/mmm/go/src/github.com/google/zoekt/index_test.go:48 +0x1b4
github.com/google/zoekt.TestBadRunesFromCtags(0xc42012a2d0)
        /home/mmm/go/src/github.com/google/zoekt/index_test.go:1743 +0x18b
testing.tRunner(0xc42012a2d0, 0x81e5c0)
        /usr/local/go/src/testing/testing.go:746 +0xd0
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:789 +0x2de
exit status 2
FAIL    github.com/google/zoekt 0.042s

This happened on latest HEAD which was at 3f9ebd3 at time of writing this issue report.

I think the issue can be solved by changing == to <= on lines https://github.com/google/zoekt/blob/master/indexbuilder.go#L94 and https://github.com/google/zoekt/blob/master/indexbuilder.go#L118 . But I am not sure if this is the right way and place to deal with it.

The document sections are generated by Zoekt too, as CTags doesn't actually output byte ranges, IIRC. Can you check how this invalid input actually got generated?

I'm a bit torn whether this should generate an error rather than silently swallowing/modifying the given section .

Uh oh, sorry about the confusion. I was expecting something that is not true.

The way to trick ctags into spliting the rune is simple:

# cat test.cs
aá:{0}
# hexdump -C test.cs 
00000000  61 c3 a1 3a 7b 30 7d 0a                           |a..:{0}.|
00000008
# ctags --version
Universal Ctags 0.0.0(7918d19f), Copyright (C) 2015 Universal Ctags Team
Universal Ctags is derived from Exuberant Ctags.
Exuberant Ctags 5.8, Copyright (C) 1996-2009 Darren Hiebert
  Compiled: Feb 12 2018, 11:44:40
  URL: https://ctags.io/
  Optional compiled features: +wildcards, +regex, +multibyte, +option-directory, +xpath
# ctags -f- test.cs
aÃ	test.cs	/^aá:{0}$/;"	p
# ctags -f- test.cs | hexdump -C
00000000  61 c3 09 74 65 73 74 2e  63 73 09 2f 5e 61 c3 a1  |a..test.cs./^a..|
00000010  3a 7b 30 7d 24 2f 3b 22  09 70 0a                 |:{0}$/;".p.|
0000001b

In the hexdump you can see that there are only 2 bytes before the tab 0x09. So the reasons why this happens are basically a confusing file extension and some non-ASCII UTF-8 inside that resembles a programming language.

The more I think about it I am also not sure whether this should be fixed on the Zoekt side. But where? Is this a bug of ctags encoding handling then?

Which version of ctags do you use? I think universal-ctag --json should handle this correctly, no?

oh sorry, Ctags 0.0.0(7918d19f) - I missed the blurb.

I'll have a look.

I was using HEAD version universal-ctags but I didn't rename it to universal-ctags. I instead changed the name in the zoekt source code. And that turned out to be a big mistake.

When I run the universal-ctags with JSON output it fails in a strange way:

# universal-ctags --output-format=json -f- test.cs 
(null)

But this output is actually handled in zoekt. So in the end everything works.

Do you think it is worth reporting the (null) output issue to universal-ctags?

Feel free to close this issue. Thank you for you time!

yes, please do.

There is still something to do here, though, but I'll take care of it.