gvansickle / ucg

UniversalCodeGrep (ucg) is an extremely fast grep-like tool specialized for searching large bodies of source code.

Home Page:https://gvansickle.github.io/ucg/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cut ripgrep some slack in benchmarks ;-), document how to run benchmarks

gvansickle opened this issue · comments

@BurntSushi rightly reported that the 0.3.0 benchmarks do not pass '-u' to rg, thus giving the other utilities which don't look at .gitignore files an arguable unfair advantage. Address this.

Also, document how to obtain the corpi and run the benchmarks. This was slated for 0.3.0 (in my best intentions at least), but didn't make it. Maybe add rg's Linux corpus into the mix as well, it's a more typical use-case.

@BurntSushi : In early results, it does looks like you're beating me with the '-u':

| Test Case | built_ucg | inst_ucg | inst_ag | inst_ripgrep | inst_pcre2grep | inst_system_grep | inst_gnu_grep_e |
|-|-|-|-|-|-|-|
| TC2 | 0.266699 | 0.26657 | 1.55662 | 0.181029 | 0.915668 | 0.324283 | 0.322742 |

... and tables don't work in Github issue comments, great. :-/ Anyway, this is your second benchmark, PM_RESUME against the built linux tree. ucg == 0.267, rg -u == 0.181. Except something's not right, you and everyone else are getting 5 hits, I'm getting 11 hits. As the wise Mark Freuder Knopfler, OBE once said, "Two men say they're Jesus/One of them must be wrong", and I'm guessing that may be me in this instance...

...well, wait again. I'm actually getting 5 hits, but also detecting 6 recursive directory loops due to symlinks (which are mistakenly being counted as hits). You're doing a physical traversal right? I'm defaulting to logical.

Yeah, my impression is that standard behavior is to not follow symlinks, so ripgrep won't do it by default. If you pass -L, then that should do the trick.

Also, I'm kind of surprised at how slow ag is for you. Are you running on a VM?

Yeah, the Fedora 24 numbers are on Virtual Box. If github supported tables, I could post the system info my benchmark suite obtains for you here. I'm working on getting the results into HTML form suitable for posting (graphs and everything), but I'm not quite there yet. Let me try the table for that specific system:

Test System Details

Parameter Value
Distribution Fedora 24 (Workstation Edition)
Kernel name Linux
Kernel release 4.7.9-200.fc24.x86_64
Kernel build info '#'1 SMP Thu Oct 20 14:26:16 UTC 2016
CPU model name Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz
CPU architecture x86_64
CPU number of sockets 1
CPU cores per socket 4
CPU threads per core 1
CPU ISA extensions apic clflush cmov constant_tsc cx8 de eagerfpu fpu fxsr ht hypervisor lahf_lm lm mca mce mmx msr mtrr nonstop_tsc nopl nx pae pat pge pni pse pse36 rdtscp rep_good sep sse sse2 sse4_1 sse4_2 ssse3 syscall tsc vme xtopology
Hypervisor present Yes
Hypervisor vendor Oracle VM VirtualBox
Hypervisor type full

I guess that's almost readable. Like I said before, it's way past time for me to get a new rig. Never enough round tuits.....

Yeah, in my testing the silver searcher does much worse in a virtual machine than on a native system, and my current hypothesis is because of memory maps. You can test it for yourself by passing the --mmap flag to ripgrep. I bet you'll see a noticeable slow down. :-)

(That's not to say it invalidates your benchmark. Running these tools in a VM is a perfectly common and legitimate use case. But it's probably important to acknowledge or at least understand.)

Yep, I've done the experiments too (ucg still has some dormant mmap code in it if you look hard enough). I can build for Cygwin, and even there it's a hit. Not knowing what all is going on under the hood, it sure does seem contrary to what one would expect. But yeah, with virtualization, it makes a bit more sense that there'd be a hit here.

Similar topic: Have you tried asynchronous I/O for reading in the files? I have not, but I'm curious if that's any better than just a read() loop.

@gvansickle I've always heard pretty terrible things about async I/O on Linux, so I've never tried it. See: http://stackoverflow.com/questions/8513663/linux-disk-file-aio

Note that ripgrep does I/O differently from ucg. When it doesn't use memory maps, it reads incrementally. I think ucg just slurps the entire file in at once and then searches it, right?

Right. I try to read the entire file with one read() call. Honestly I was a bit surprised that worked as well as it does. Again it's probably due to the use-case: mostly smallish files. I should gather statistics on that.....

Honestly I was a bit surprised that worked as well as it does.

Me too. :P It actually holds up pretty well even on largeish files too. (Look at the subtitle benchmarks.)

@BurntSushi : I just updated the one benchmark in the README.md. Sorry it took so long (in more ways than one: now you're winning! ;-))
I still have to document the "how to reproduce" better, but it's still little more than a "make check" away. hopefully this long weekend.