cindex is silently ignoring some text files and there's no way to tell why

Question

cindex is silently ignoring some text files and there's no way to tell why

victor-sudakov opened this issue 3 years ago · comments

I have a couple of text files (UTF-8, with mostly ASCII and Cyrillic characters) which cindex/csearch ignore.

The worst problem is that I cannot tell why cindex ignores them, there is no "verbose" option to cindex. Maybe there is a character somewhere in the file cindex does not like but how do I tell?

iconv -f utf-8 -t utf-16 < text/book1.txt > /dev/null never complains so I presume the book1.txt file is valid UTF-8. But cindex excludes it from search.

codesearch version:
codesearch/oldstable,now 0.0~hg20120502-3+b11 amd64 on Debian 10.

The problem may be related to #26

Damian Gryski · Answer 1 · Tue Jan 04 2022 10:52:03 GMT+0800 (China Standard Time)

I believe there is also a line length limit that causes files to not be indexed.

You might have better luck switching to zoekt if possible.

Victor Sudakov · Answer 2 · Tue Jan 04 2022 15:08:09 GMT+0800 (China Standard Time)

I believe there is also a line length limit that causes files to not be indexed.

I've just tried glimpse on it. glimpseindex skips this file too, it can be forced to index it by glimpseindex -E
The are long lines somewhere in the file indeed.

$ file text/book1.txt 
text/book1.txt: UTF-8 Unicode text, with very long lines

I should probably grep the text for long lines and see what comes out.

You might have better luck switching to zoekt if possible.

Not in the Debian repo unfortunately.

Victor Sudakov · Answer 3 · Tue Jan 04 2022 17:21:30 GMT+0800 (China Standard Time)

I've found the offending line. It is not even long, but removing it allows indexing again. The whole line is below (yes it's the whole line by itself)

(HTTPConnectionPool(host='172.31.38.116', port=8008): Max retries exceeded with

I'm really surprised. There should be a switch to cindex to either disable file contents heuristics or to expose it verbosely.