google / codesearch

Fast, indexed regexp search over large file trees

Home Page:http://swtch.com/~rsc/regexp/regexp4.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot index / search one file

GoogleCodeExporter opened this issue · comments

What steps will reproduce the problem?
1. Index the attached file with cindex
2. Search for a pattern inside it
3. No hits

What is the expected output? What do you see instead?

repro$
repro$ cindex -reset
repro$ cindex badfile
2013/03/13 18:51:52 index /tmp/repro/badfile
2013/03/13 18:51:52 flush index
2013/03/13 18:51:52 merge 0 files + mem
2013/03/13 18:51:52 0 data bytes, 92 index bytes
2013/03/13 18:51:52 done
repro$ cindex -list
/tmp/repro/badfile
repro$
repro$
repro$
repro$ grep main badfile
libc.so.6        __libc_start_main
torch             main
torch              realmain(int, char**)
libglib-2.0....          g_main_context_iteration
libglib-2.0....           g_main_context_prepare
libglib-2.0....            g_main_context_dispatch
#85 0x00000032f5c38f0e in g_main_context_dispatch () from 
/lib64/libglib-2.0.so.0
#87 0x00000032f5c3ca3a in g_main_context_iteration () from 
/lib64/libglib-2.0.so.0
#93 0x000000000040e74d in realmain(int, char**) ()
#94 0x000000000040e933 in main ()
repro$
repro$ csearch main  <= no results here !!
repro$ 
repro$ grep threads badfile 
============ All threads ==========
============ All threads ==========
repro$
repro$ csearch threads <= no results here !!
repro$ 

I cannot find (with csearch) text that is in a file I have indexed (cindex)

What version of the product are you using? On what operating system?

I'm using the Linux binaries that are available on the Download page.
I tried to compile go / codesearch but couldn't make it work (my go
install might be funky).

Please provide any additional information below.

It looks like the problem happens at indexing time.

Original issue reported on code.google.com by bserg...@gmail.com on 14 Mar 2013 at 1:57

Attachments:

Also, I have one line that is crazy long: 2245 characters. Maybe the problem is 
that the indexer reads line by line and has some hardcoded limit on the number 
of chars in a single line ?

Original comment by bserg...@gmail.com on 14 Mar 2013 at 2:04

  • Added labels: ****
  • Removed labels: ****
Try indexing with -verbose and -logskip flags to see if the file is getting 
skipped.

The arbitrary limits are in the source so you can always hand edit and tweak 
them. I have a version at

http://github.com/junkblocker/codesearch

which I did to specifically add such options.

Original comment by manpreet...@gmail.com on 14 Mar 2013 at 4:07

  • Added labels: ****
  • Removed labels: ****
Also, I've been using codesearch as part of a webapp at work that does forensic 
analysis of crashes (by letting us search through backtraces), and it's amazing 
:)

I'm kinda stuck right now because I cannot index some files and I'm thinking 
about using a different indexer / search system, but really codesearch is all I 
need so if someone can figure out what the problem is that would be awesome.

Thanks !!

Original comment by bserg...@gmail.com on 14 Mar 2013 at 2:00

  • Added labels: ****
  • Removed labels: ****
Thanks for the tip. Indeed I've removed those long lines and now everything 
works fine. I've seen that your copy of the code has that -maxlinelen that 
should be what I need. Now I have to understand how to build a go program ...

Original comment by bserg...@gmail.com on 14 Mar 2013 at 5:57

  • Added labels: ****
  • Removed labels: ****
Feel free to close the issue whoever can.

Original comment by bserg...@gmail.com on 14 Mar 2013 at 6:39

  • Added labels: ****
  • Removed labels: ****
I'm going to leave this open until I can get something like -logskip into
the mainline codesearch branch.

Original comment by rsc@golang.org on 14 Mar 2013 at 2:08

  • Added labels: ****
  • Removed labels: ****
I don't know how far you guys should go with that, but having those 2 options 
to set the maxLineLen and maxFileSize on the command line would also help.

The default behavior could be to print a message like that (with a better 
phrasing probably / different options names) when a file got skipped.

=> /tmp/foo wasn't indexed (maxLine too long) / try to reindex with cindex 
-maxLineLen 3000

=> /tmp/foo wasn't indexed (file too big) / try to reindex with cindex 
-maxFileSize 1M

Original comment by bserg...@gmail.com on 14 Mar 2013 at 4:26

  • Added labels: ****
  • Removed labels: ****
Alright, I figured it out, thanks.

repro$ awk '{print length($0)}' badfile | sort -n | tail
972
1001
1043
1071
1456
1529
1724
1792
2259
2328

and in index/write.go there's a 
    maxLineLen      = 2000

Original comment by bserg...@gmail.com on 14 Mar 2013 at 6:36

  • Added labels: ****
  • Removed labels: ****