Do we really care a lot about duplicate commands in `histdb` call? Improves histdb perf by 100x

Question

Do we really care a lot about duplicate commands in `histdb` call? Improves histdb perf by 100x

phiresky opened this issue 5 years ago · comments

phiresky commented 5 years ago

(semi-related to #2)

When you remove the group by command and add an index on time:

create index hist_time on history(start_time);

For example compare this (slow):

-- explain query plan
select session, dir, replace(commands.argv, '
', '\n') as argv, max(start_time) as max_start
from
  history
  left join commands on history.command_id = commands.id
  left join places on history.place_id = places.id
where not (commands.argv like 'histdb%') and places.host='phirearch'
group by history.command_id, history.place_id
order by max_start desc
limit 62

to this:

-- explain query plan
select session, dir, replace(commands.argv, '
', '\n') as argv, start_time
from
  history
  left join commands on history.command_id = commands.id
  left join places on history.place_id = places.id
where not (commands.argv like 'histdb%') and places.host='phirearch'
order by history.start_time desc
limit 62

For me, this reduces the duration of simple calls like histdb or histdb borg from 1100ms to 10ms.

It would even be possible to wrap this in another call to remove duplicates with the only caveat that then you may get somewhat less results than you requested with --limit

phiresky · Answer 1 · Thu May 30 2019 06:32:03 GMT+0800 (China Standard Time)

Ok looks like sqlitebrowser also has a huge impact on performance so it's not good to use it to compare. Running it in the shell, the time difference is 200ms with the original query, 3ms with my adjustment.

When filtering by a directory, the original query takes 400ms in SqliteBrowser but only 100ms from shell. The modified query takes 10ms in sqlitebrowser, 4ms in shell.

Let me know if you think this change is a good idea, I can send a PR.

Tom Hinton · Answer 2 · Tue Jun 04 2019 19:43:55 GMT+0800 (China Standard Time)

Indexing on time makes a lot of sense, as this doesn't affect the outputs. Similarly the changes in your other issue make sense for this reason.

I am not sure about changing the outputs, since quite a lot of people use this and I personally often get upset when people change the behavior of things that I use.

I think my preferred solution if you want the speed benefit of removing the group by is just to hack it locally. This is probably also influenced by the fact that I don't find the delay perceptible (and I work on a pretty old computer).

phiresky · Answer 3 · Wed Jun 05 2019 06:58:39 GMT+0800 (China Standard Time)

Well for a normal histdb call it doesn't really matter much i think - but do you also not notice a delay in the interactive version, where it does a synchronous query on every keystroke?

I mean my goal is to have hstr (https://github.com/dvorka/hstr) which is awesome with histdb as a backend, since that is much nicer than the current histdb-interactive script. But since all of this doesn't need a db change (except for adding the index) it shouldn't be a problem to use the query there but not here.

Tom Hinton · Answer 4 · Wed Jun 05 2019 19:26:35 GMT+0800 (China Standard Time)

For whatever reason, in the reverse-isearch mine is not noticeably slow.
I work on a thinkpad x230 so it is not a beast of a machine particularly.
My history list is a bit smaller than yours (~10M) but that's not a radical difference.
I wonder whether there's something else happening to make it slow, although I don't have a good hypothesis for this.

That hstr thing looks like something you could imitate well using fzf and a command to run your faster query. If you make something which works well let me know and I will put it in the readme / in as a script since many people are keen on that sort of interface.

phiresky · Answer 5 · Wed Jun 05 2019 20:07:23 GMT+0800 (China Standard Time)

Indexing on time makes a lot of sense

Forgot to respond to this earlier: sadly indexing on time doesn't help at all as long as you group it (try explain query plan and see it will still scan the whole table)

mine is not noticeably slow

It's not bad enough to be really annoying, but it's definitely noticeable for me, especially for the first three chars you type in (try "cat"), the delay is gone when you enter a longer string. My patch is below btw, I think the interactive part can be applied without any visible changes (since you only see one result anyways). The creation of the time index is not in there. I could also send a PR if you want, but probably not worth another migration (just on creation of new databases).

That hstr thing looks like something you could imitate well using fzf

Would probably be a bit harder since I also want to have an interactive toggle to make it only search for commands in the current dir (which IMO is kind of a killer feature of histdb). Or maybe change order according to that. I'll let you know when I make something.

diff --git a/histdb-interactive.zsh b/histdb-interactive.zsh
index decc877..c2139d7 100644
--- a/histdb-interactive.zsh
+++ b/histdb-interactive.zsh
@@ -59,7 +59,7 @@ _histdb_isearch_query () {
 commands.argv,
 places.dir,
 places.host,
-datetime(max(history.start_time), 'unixepoch')
+datetime(history.start_time, 'unixepoch')
 from history left join commands
 on history.command_id = commands.rowid
 left join places
@@ -67,8 +67,7 @@ on history.place_id = places.rowid
 where commands.argv glob '*$(sql_escape ${BUFFER})*'
 ${where_host}
 ${where_dir}
-group by commands.argv, places.dir, places.host
-order by ${maxmin}(history.start_time) ${ascdesc}
+order by history.start_time ${ascdesc}
 limit 1
 offset ${offset}"
     local result=$(_histdb_query -separator $'\n' "$query")
diff --git a/sqlite-history.zsh b/sqlite-history.zsh
index 033199c..70a7057 100644
--- a/sqlite-history.zsh
+++ b/sqlite-history.zsh
@@ -311,7 +311,7 @@ histdb () {
     local seps=$(echo "$cols" | tr -c -d ',' | tr ',' $sep)
     cols="${cols}, replace(commands.argv, '
 ', '
-$seps') as argv, max(start_time) as max_start"
+$seps') as argv, start_time as max_start"
 
     local mst="datetime(max_start, 'unixepoch')"
     local dst="datetime('now', 'start of day')"
@@ -325,7 +325,6 @@ from
   left join commands on history.command_id = commands.id
   left join places on history.place_id = places.id
 where ${where}
-group by history.command_id, history.place_id
 order by max_start desc
 ${limit:+limit $limit}) order by max_start asc"
 
@@ -336,7 +335,6 @@ from
   left join commands on history.command_id = commands.id
   left join places on history.place_id = places.id
 where ${where}
-group by history.command_id, history.place_id
 order by max_start desc) order by max_start asc"
 
     if [[ $debug = 1 ]]; then

Tom Hinton · Answer 6 · Thu Jun 06 2019 00:33:52 GMT+0800 (China Standard Time)

OK - thanks - I'll try and look into this a bit more this week. Even for single character queries I find the isearch is basically instant, so I think there must be something weird going on but I can't guess what.

Tom Hinton · Answer 7 · Thu Jun 06 2019 00:43:29 GMT+0800 (China Standard Time)

I have a feeling the query plan for this is a bit stupid anyway - it should be scanning commands for matching history rows and then using the index on command_id to efficiently search history, rather than scanning history. I will see whether I can coerce it into doing something else with the same output by moving some joins around.

Tom Hinton · Answer 8 · Thu Jun 06 2019 15:50:12 GMT+0800 (China Standard Time)

This might be a definite no and I quite understand if it is, but is there any chance I could have your history file to see whether I can reproduce the performance you are getting?

Alternatively if you could select count(*) in all the tables or some other statistics (e.g. distinct count on columns) I can try generating some test data of the same size.

Tom Hinton · Answer 9 · Thu Jun 06 2019 16:08:09 GMT+0800 (China Standard Time)

also, as another line of inquiry, could you try create index if not exists history_command_place on history(command_id, place_id); and see whether this helps the group by.

phiresky · Answer 10 · Thu Jun 06 2019 16:58:25 GMT+0800 (China Standard Time)

I tried making that index before and sqlite used it in the select query but it didn't really change performance I think. My histdb file is maybe 90% imported with https://github.com/phiresky/ts-histdbimport. That could change some perf characteristics cause most things are in the same "place" I guess? I also vacuumed it before testing.

I just wrote a script to generate a history file: https://gist.github.com/phiresky/f9de114c7eda7cf80409a1fe304787ca . I'll send a copy of a generated db and also see if that is the same as mine.

phiresky · Answer 11 · Fri Jun 07 2019 21:34:03 GMT+0800 (China Standard Time)

Here you go, here's a slightly larger file (generated with above script): 400k entries, 40MB. zsh-history.db.gz

(After running update places set host = currenthost and vacuum)

Timing of a simple histdb call: 0.99s to 1.003s. via SCAN TABLE (once for count, once for results)

After creating the index and applying the patch: 0.8s - 0.90s. Almost all of that time is spent on the select count(*), without the count the time is 0.05s.

If i replace the count query with a simplified one, it goes down to ~0.15s:

@@ -325,19 +326,16 @@ from
   left join commands on history.command_id = commands.id
   left join places on history.place_id = places.id
 where ${where}
-group by history.command_id, history.place_id
 order by max_start desc
 ${limit:+limit $limit}) order by max_start asc"
 
     ## min max date?
-    local count_query="select count(*) from (select ${cols}
+    local count_query="select count(*)
 from
   history
   left join commands on history.command_id = commands.id
   left join places on history.place_id = places.id
-where ${where}
-group by history.command_id, history.place_id
-order by max_start desc) order by max_start asc"
+where ${where}"
 
     if [[ $debug = 1 ]]; then
         echo "$query"

(not sure why the select is wrapped in another select?)

Tom Hinton · Answer 12 · Mon Jun 10 2019 19:07:49 GMT+0800 (China Standard Time)

Thanks for all this effort - I'm quite busy with work at the moment so I don't have much spare time to think about other programs, but I will get to this sooner or later.

The inner / outer select is there to get the presentation right, and it may be it could be rolled into one select. The difficulty is probably that it's sorted one way on the inside and the other way on the outside.

Tom Hinton · Answer 13 · Mon Jun 10 2019 19:20:12 GMT+0800 (China Standard Time)

I just had a quick go with the db you are looking at there; generating a query with

$ histdb -d bash --host 
select strftime(case when datetime(max_start, 'unixepoch') > datetime('now', 'start of day') then '%H:%M' else '%d/%m' end, max_start, 'unixepoch', 'localtime') as time, session as ses, dir, host, argv as cmd from (select session, replace(places.dir, '/home/hinton', '~') as dir, places.host as host, replace(commands.argv, '
', '
') as argv, max(start_time) as max_start
from
  commands 
  join history on history.command_id = commands.id
  join places on history.place_id = places.id
where true and commands.argv glob '*bash*'
group by history.command_id, history.place_id
order by max_start desc
limit 57) order by max_start asc

and for the count query

select count(*) from (select session, replace(places.dir, '/home/hinton', '~') as dir, places.host as host, replace(commands.argv, '
', '
') as argv, max(start_time) as max_start
from
  commands 
  join history on history.command_id = commands.id
  join places on history.place_id = places.id
where true and commands.argv glob '*bash*'
group by history.command_id, history.place_id
order by max_start desc
)

I have rearranged the joins a bit in that query from what master produces at the moment.

If I chuck that into sqlite with .timer on I get Run Time: real 0.170 user 0.126354 sys 0.043123 for the limited query and Run Time: real 0.167 user 0.123798 sys 0.042933 for the count query.

This is a bit slow, but maybe 5X quicker than on your machine - roughly what spec are you running on? Mine is an i5-3320 at 2.6ghz with a reasonable SSD and 16G memory.

The reason for the group by is that I often run the same command lots of times in a row because I'm changing inputs and inspecting outputs - I don't like having these dups in the history or the reverse-isearch as they are clutter. However I do store the repeat runs just in case I want to know about them later for some reason (admittedly this has never come up yet).

I don't know if there is a more efficient way to get this effect - I could in principle prevent the append of the duplicates into the log, but then their time & exit status information would be lost.

phiresky · Answer 14 · Mon Jan 13 2020 21:23:27 GMT+0800 (China Standard Time)

Sorry, I didn't really look into this further. I'm using histdb in parallel with normal history currently, so I only use it for archiving and specific queries (filtering by dir is awesome) - So I currently don't really care about performance. Just one more note since I just thought of this:

The query with duplicates could have a LIMIT set to wanted_LIMIT * 1.5, then put into anther query that filters duplicates out. Then, if the number of returned entries is < then wanted number of rows the query can be reexecuted in a loop with the limit increasing exponentially (2x, then 4x, then 8x) until the desired number of rows is returned or limit reaches db size. that way the change would be invisible from user perspective and good for performance, with the only cost being code complexity.

tmpm697 · Answer 15 · Fri Jul 03 2020 15:23:50 GMT+0800 (China Standard Time)

Think the same, this is the last wanted improvement that I hope zsh-histdb can adopt it.

Tom Hinton · Answer 16 · Fri Jul 03 2020 16:31:11 GMT+0800 (China Standard Time)

As usual, happy to accept pull requests that improve perf whilst preserving the output I want. Duplicate commands in the log are not output that I want, so unfortunately the performance cost of removing them has to be paid.

@tmpm697 you have made several requests for things to be different in some way, but you haven't shown any work you've done to make them different.

I know this probably comes across harshly, but I want to make the terms on which this software is made available quite clear: I made this program to meet my needs, which it does, so I don't need to do any more work on it. If you want it to change, do some work to change it and you will find satisfaction.

Phiresky's last comment describes one way that you could try to improve this aspect of query performance. I'm not going to do it because I also don't care to spend time making history queries faster; queries I make return in about 0.04 seconds, and I think I would not intervene until the latency was about 0.2 seconds.