Remove redundant fields
kderme opened this issue · comments
The dbsync schema currently has a number of redundant fields that can be removed. Some of them are
tx_out.address
,tx_out.paymentCred
,collateral_tx_out.address
,collateral_tx_out.paymentCred
stake_address.view
drep_hash.view
pool_hash.view
We could consider removing them from the default schema.
Some json fields are theoretically redundant and could be parsed by clients, however that's not as easy as the other fields.
Keeping in line with various CIPs for human representation, would be beneficial to retain bech32-encoded and remove raw fields (eg:address_raw instead of address), but either way would be helpful to reduce amount of storage on mainnet.
I understand that most users use the bech32 tx_out.address field (base58 for Byron), but there are some advantages in the raw version:
- bech32 requires computing before inserting, since the address comes in the raw format from the ledger. This takes syncing time
- for other tables keeping the raw format is necessary for db-sync, so it feels more consistent to also keep the raw for addresses
- bech32 is the user facing standard, but db-sync works as a backend, so the transition can happen in the app level.
- It takes less disk space. This shows the difference in sizes between the two fields:
select count (*) as count, pg_column_size(address) - pg_column_size(address_raw) as diff from tx_out group by pg_column_size(address) - pg_column_size(address_raw) order by pg_column_size(address) - pg_column_size(address_raw);
count | diff
-----------+------
51 | 15
5381517 | 16
1 | 19
1 | 21
1 | 22
75 | 24
177 | 25
1256 | 27
14441735 | 28
39974479 | 29
5 | 30
69 | 31
8 | 32
128 | 34
6344 | 35
1 | 40
119554727 | 46
1 | 50
1 | 51
1 | 85
1 | 188
1 | 254
1 | 303
1 | 371
1 | 969
1 | 1186
1 | 1415
1 | 1417
1 | 1431
1 | 1743
1 | 1811
1 | 1994
1 | 2409
1 | 2812
1 | 4406
1 | 6460
1 | 6491
1 | 7218
1 | 7493
1 | 7542
1 | 7882
1 | 8264
(42 rows)
This gives a total minimum estimation of 7.15GB
, not taking into account differences in the indexes and is not fully synced so it gets worse over time.
Eventually this boils down to users preferences, in case they want to do the transition for the benefits it provides. I just wanted to give the full picture.
Either way is fine for me personally, but I feel dbsync utility will have greater advantage retaining bech32:
bech32 requires computing before inserting, since the address comes in the raw format from the ledger. This takes syncing time
For new addresses, it would - for existing addresses it could be a single numeric field if it was a foreign key to address table.
bech32 is the user facing standard, but db-sync works as a backend, so the transition can happen in the app level.
Yes, most clients will represent addresses as bech32 - which for a given address could be repeated every time address is queried, while for dbsync - could be reduced to initial first time address was seen on-chain.
It takes less disk space. This shows the difference in sizes between the two fields
The space will be saved more if address is moved to seperate table and only used as foreign key for tx_out and collateral_tx_out (avoiding repitition of keys).
The entire base of comparison could drastically change if #1333 could be prioritised (which could even hold both formats if needed, and avoids repeated entries every time an address has a tx)