bmoscon / cryptostore

A scalable storage service for cryptocurrency data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

InfluxDB L3/L2 Books very slow.

JamesKBowler opened this issue · comments

Hello,
I have cryptostore working on InfluxDB 2.0 using influxdb-client-python and a few visual charts setup on the GUI. Every so often (on book_interval) data seems to stop writing to InfluxDB. Further digging around and it looks like problem is the ts logic in these two elif statements.

        elif data_type == L2_BOOK:
            for entry in self.data:
                ts = int(Decimal(entry["timestamp"]) * 1000000000)
                while ts in used_ts:
                    ts += 1
                used_ts.add(ts)

                agg.append(f'{data_type}-{exchange},symbol={pair},exchange={exchange},delta={entry["delta"]} side="{entry["side"]}",timestamp={entry["timestamp"]},receipt_timestamp={entry["receipt_timestamp"]},price={entry["price"]},amount={entry["size"]} {ts}')
        elif data_type == L3_BOOK:
            for entry in self.data:
                ts = int(Decimal(entry["timestamp"]) * 1000000000)
                while ts in used_ts:
                    ts += 1
                used_ts.add(ts)

                agg.append(f'{data_type}-{exchange},symbol={pair},exchange={exchange},delta={entry["delta"]} side="{entry["side"]}",id="{entry["order_id"]}",timestamp={entry["timestamp"]},receipt_timestamp={entry["receipt_timestamp"]},price="{entry["price"]}",amount="{entry["size"]}" {ts}')
                ts += 1

Orderbooks can take minutes to iterate over due to the increasing size of used_ts and calling Decimal, whilst this is happening messages are aggragted in Redis which eats memory.

For instance here is the current setup.

# influx_current.py
from decimal import Decimal

import pickle

with open('l3_orderbook.pickle', 'rb') as handle:
    data_list = pickle.load(handle)
data = (d for d in data_list)

# From current setup
used_ts = set()
agg = []
exchange, data_type, pair = 'COINBASE', 'l3_book', 'BTC-USD'
for entry in data:
    ts = int(Decimal(entry["timestamp"]) * 1000000000)
    while ts in used_ts:
        ts += 1
    used_ts.add(ts)

    agg.append(f'{data_type}-{exchange},symbol={pair},exchange={exchange},delta={entry["delta"]} side="{entry["side"]}",id="{entry["order_id"]}",timestamp={entry["timestamp"]},receipt_timestamp={entry["receipt_timestamp"]},price="{entry["price"]}",amount="{entry["size"]}" {ts}')
    ts += 1

Result:

In [1]: %time %run influx_current.py
   ...: 

CPU times: user 7min 36s, sys: 112 ms, total: 7min 36s
Wall time: 7min 36s

and now the faster version.

# influx_speedup.py
from decimal import Decimal
import pandas as pd
import numpy as np

import pickle

with open('l3_orderbook.pickle', 'rb') as handle:
    data_list = pickle.load(handle)

data = (d for d in data_list)


df = pd.DataFrame(list(data))
df['ts'] = df['timestamp'].apply(Decimal) * 1000000000
df.index = df['ts'].apply(int)
values = df.index.duplicated(keep='first').astype(float)
values[values == 0] = np.NaN
missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff
df.index = df.index + np.cumsum(values).astype(int)

agg = []
exchange, data_type, pair = 'COINBASE', 'l3_book', 'BTC-USD'
for ts, entry in df.to_dict('index').items():
    agg.append(
        f'{data_type}-{exchange},symbol={pair},exchange={exchange},delta={entry["delta"]} side="{entry["side"]}",id="{entry["order_id"]}",timestamp={entry["timestamp"]},receipt_timestamp={entry["receipt_timestamp"]},price="{entry["price"]}",amount="{entry["size"]}" {ts}')

Result:

In [1]: %time %run influx_speedup.py
CPU times: user 1.62 s, sys: 538 ms, total: 2.16 s
Wall time: 1.56 s

Code inspired by this answer on StackOverflow

Am I missing something here?

Thanks
l3_orderbook.zip

added zip file which contains pickled self.data.

interestingly this fails, but the problem can be solved, still working on it.

assert True not in df.index.duplicated()

Failing orderbook snapshot
l3_orderbook_1617129070.429816.zip

The orderbook pickle has 83260 entries and 81057 timestamp modifications are made to provide timestamp uniqueness, therefore current code process carries out many iterations.

Very simple example

import pandas as pd
import numpy as np

# The problem, make a duplicate index have unique sequential values.
s = pd.Series([1., 2., 3., 4., 3., 5., 6., 6., 7., 9., 10., 9.])

# Solution.
s.sort_values(inplace=True)
v = s.duplicated(keep='first').astype(float)
v[v == 0] = np.NaN
m = np.isnan(v)
c = np.cumsum(~m)
r = c + s
print(pd.DataFrame({'s': s, 'v': v, 'm': m, 'c': c, 'r': r}))

adding this function to the influx.py file.

    def ts_sort(self):
        df = pd.DataFrame(list(self.data))
        df.index = df['timestamp'].apply(Decimal) * 1000000000
        df.index = df.index.map(int)
        df.sort_index(inplace=True)
        v = df.index.duplicated(keep='first').astype(float)
        v[v == 0] = np.NaN
        df.index = df.index + np.cumsum(~np.isnan(v))
        assert True not in df.index.duplicated()
        return df.to_dict('index')

and modify the elif statements.

        elif data_type == L2_BOOK:
            for ts, entry in self.ts_sort().items():
                agg.append(f'{data_type}-{exchange},symbol={pair},exchange={exchange},delta={entry["delta"]} side="{entry["side"]}",timestamp={entry["timestamp"]},receipt_timestamp={entry["receipt_timestamp"]},price={entry["price"]},amount={entry["size"]} {ts}')

        elif data_type == L3_BOOK:
            for ts, entry in self.ts_sort().items():
                agg.append(f'{data_type}-{exchange},symbol={pair},exchange={exchange},delta={entry["delta"]} side="{entry["side"]}",id="{entry["order_id"]}",timestamp={entry["timestamp"]},receipt_timestamp={entry["receipt_timestamp"]},price="{entry["price"]}",amount="{entry["size"]}" {ts}')

ok, this is a harder problem to solve than I initially thought, well at least for my brain. No matter how I construct the DataFrame I end up needing to know a future calculated value. The problem with the above code is they mess with the order of the data. Other issues like no cumsum reset, so every value ends up getting incremented even if it's not required. Also, when I add a cumsum and it resets the result could be lower than the previous result, therefore, causing another duplicate.

Still, I think a speed improvement could be made but my brain hurting now.

Personally I think timeseries databases that have this timestamp restriction are ill suited to book data, unless you have a specific use case (like only saving top 10 levels) in which case you can add the appropriate tag for each level to make them unique. The influx best practices guide recommends two things to store this sort of data (duplicate timestamps), and I opted to increment the timestamp. A better solution would probably be to just store the update as a json blob, since influx will store string/binary data without issue.

see: https://docs.influxdata.com/influxdb/v2.0/write-data/best-practices/

So, I (or someone else) can make this change - store the entire book/delta as a json blob. I'm not sure if it will work for the very large updates (I havent seen anything about a max len) so it may not be possible