Missing files and results

nikhilkalige opened this issue · comments

I am unable to see all search results for a string that I was trying to search. If I just look at the index file and grep it, I get more results then what I am seeing in the webpage.

I also found that certain files are never indexed. File search for these shows zero results, also these files do not show up in the index file too..

it's hard to give specific answers with specific data.

files that are never indexed are usually too large or binary
see also

var sizeMax = flag.Int("file_limit", 128*1024, "maximum file size")

which version are you using?

do the stats (bottom of page) indicate that data was skipped?

can you reduce the example to something smaller?


 Used 10M mem for 16953 documents (58M) from 1 repositories.

The file is a .c file with size 259K which would be greater than 128K.. That may indicate the problem..
However, its hard to pass options into git_index_flags, I can't find a way to pass more that one flag into it.


I think you can do -git_index_flags="-flag1 flag2"

args = append(args, strings.Split(indexFlags, " ")...)

if you want to clarify the help string there, that would be great.

I tried
"-branches=master,develop sizeMax=1048576",
"-branches=master,develop -sizeMax=1048576"
"'-branches=master,develop sizeMax=1048576'"

May be we could do strings.Split() and - as prefix, so that you could pass "flag1=data flag2=data"

yeah, good idea. Send me a change.

any further comments on "If I just look at the index file and grep it, I get more results then what I am seeing in the webpage." ? Is this a cutoff by number of matches, or does it really not show up (try restricting to the file you know it should be in.)

I think increasing the size fixed it.. Let me investigate more and see if I can get more info if it concurs..

I still seem to get this problem, cat reponame.zoekt | grep stringval gives me 19 values, while the search gives me only 4. The stringval also shows up in files that are not as big as the one I mentioned in prior comments.

The file which have stringval are indexed properly, as I can get good results for other values from these files. Does the length (42 characters) of the searched string matter?

if you do search for stringval, and restrict the search to a file that you know contains it (using "f:path/to/file"), does that return the data?

(I'd also be happy to debug the shard directly, if you are able to share it privately with me.)

Yup, using f:path works..

Sorry :(, can't really share the data..

can you check that for incomplete results, the following condition triggers?


Line 159 in 2f0c630


I think this is related to the web server. If i run ./zoekt -index_dir /var/data/index/ "stingval" | wc -l, then I get 19 results.

If that is true, the webserver should show a "Show more" link next to the results.

oops.. crap.. It does... My mistake.. sorry about bothering you.. I did not expect that..

did you get many matches for "stringval" that were symbol defintions?

How large is the corpus (number of files, number of bytes)? You can query "r:"

I got bitten by this today as well. I think we should make this more visible.

Found 1 repositories (17517 files, 82Mb content)

I would say 11/19 are valid code and the rest 8 are comments. If I try sym:stringval, I get 4 results.
I considered every presence of stingval inside a piece of code as a symbol, may be that wrongs?

The other problem seems to be numerous files that show up tagged Duplicate result, but with the same path.

the sym: operator looks for symbol definitions, eg

class Blabla { .. }

in c++.

Looks your files are tiny (~ 500 bytes each), which throws off some coarse heuristics for matchcount that I introduced.

Re: duplicate results, are you indexing multiple branches? Does your project use submodules?
From which branches do the duplicate results come from?

I am trying to index three branches.. The result I get is somethings like

MdEmbed.c [branch1]

MdEmbed.c [branch1] DuplicateResult
MdEmbed.c [branch1] DuplicateResult

MdEmbed.c [branch2]

MdEmbed.c [branch2] DuplicateResult
MdEmbed.c [branch2] DuplicateResult

MdEmbed.c [branch3]

MdEmbed.c [branch3] DuplicateResult
MdEmbed.c [branch3] DuplicateResult

that is weird. Each (branch, filename) combo should be there just once. How much files does a single branch have, and how many distinct (filename, filecontent) pairs should you have roughly?

The 3 branches are almost the same, they are usually merged back and forth every 2-3 days.

how many files does each branch have?

can you try the latest version and see if it improved?

16977 files..
Awesome.. that was perfect