google / codesearch

Fast, indexed regexp search over large file trees

Home Page:http://swtch.com/~rsc/regexp/regexp4.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cindex is silently ignoring some text files and there's no way to tell why

victor-sudakov opened this issue · comments

I have a couple of text files (UTF-8, with mostly ASCII and Cyrillic characters) which cindex/csearch ignore.

The worst problem is that I cannot tell why cindex ignores them, there is no "verbose" option to cindex. Maybe there is a character somewhere in the file cindex does not like but how do I tell?

iconv -f utf-8 -t utf-16 < text/book1.txt > /dev/null never complains so I presume the book1.txt file is valid UTF-8. But cindex excludes it from search.

codesearch version:
codesearch/oldstable,now 0.0~hg20120502-3+b11 amd64 on Debian 10.

The problem may be related to #26

I believe there is also a line length limit that causes files to not be indexed.

You might have better luck switching to zoekt if possible.

I believe there is also a line length limit that causes files to not be indexed.

I've just tried glimpse on it. glimpseindex skips this file too, it can be forced to index it by glimpseindex -E
The are long lines somewhere in the file indeed.

$ file text/book1.txt 
text/book1.txt: UTF-8 Unicode text, with very long lines

I should probably grep the text for long lines and see what comes out.

You might have better luck switching to zoekt if possible.

Not in the Debian repo unfortunately.

I've found the offending line. It is not even long, but removing it allows indexing again. The whole line is below (yes it's the whole line by itself)

(HTTPConnectionPool(host='172.31.38.116', port=8008): Max retries exceeded with

I'm really surprised. There should be a switch to cindex to either disable file contents heuristics or to expose it verbosely.