Remove redundant fields

Question

Remove redundant fields

kderme opened this issue 5 months ago · comments

Kostas Dermentzis commented 5 months ago

The dbsync schema currently has a number of redundant fields that can be removed. Some of them are

tx_out.address, tx_out.paymentCred,
collateral_tx_out.address, collateral_tx_out.paymentCred
stake_address.view
drep_hash.view
pool_hash.view

We could consider removing them from the default schema.
Some json fields are theoretically redundant and could be parsed by clients, however that's not as easy as the other fields.

RdLrT · Answer 1 · Wed Jan 03 2024 05:40:02 GMT+0800 (China Standard Time)

Keeping in line with various CIPs for human representation, would be beneficial to retain bech32-encoded and remove raw fields (eg:address_raw instead of address), but either way would be helpful to reduce amount of storage on mainnet.

Kostas Dermentzis · Answer 2 · Tue Jan 09 2024 17:25:34 GMT+0800 (China Standard Time)

I understand that most users use the bech32 tx_out.address field (base58 for Byron), but there are some advantages in the raw version:

bech32 requires computing before inserting, since the address comes in the raw format from the ledger. This takes syncing time
for other tables keeping the raw format is necessary for db-sync, so it feels more consistent to also keep the raw for addresses
bech32 is the user facing standard, but db-sync works as a backend, so the transition can happen in the app level.
It takes less disk space. This shows the difference in sizes between the two fields:

select count (*) as count, pg_column_size(address) - pg_column_size(address_raw) as  diff  from tx_out group by pg_column_size(address) - pg_column_size(address_raw) order by pg_column_size(address) - pg_column_size(address_raw);
   count   | diff 
-----------+------
        51 |   15
   5381517 |   16
         1 |   19
         1 |   21
         1 |   22
        75 |   24
       177 |   25
      1256 |   27
  14441735 |   28
  39974479 |   29
         5 |   30
        69 |   31
         8 |   32
       128 |   34
      6344 |   35
         1 |   40
 119554727 |   46
         1 |   50
         1 |   51
         1 |   85
         1 |  188
         1 |  254
         1 |  303
         1 |  371
         1 |  969
         1 | 1186
         1 | 1415
         1 | 1417
         1 | 1431
         1 | 1743
         1 | 1811
         1 | 1994
         1 | 2409
         1 | 2812
         1 | 4406
         1 | 6460
         1 | 6491
         1 | 7218
         1 | 7493
         1 | 7542
         1 | 7882
         1 | 8264
(42 rows)

This gives a total minimum estimation of 7.15GB, not taking into account differences in the indexes and is not fully synced so it gets worse over time.

Kostas Dermentzis · Answer 3 · Tue Jan 09 2024 17:26:39 GMT+0800 (China Standard Time)

Eventually this boils down to users preferences, in case they want to do the transition for the benefits it provides. I just wanted to give the full picture.

RdLrT · Answer 4 · Tue Jan 09 2024 19:55:23 GMT+0800 (China Standard Time)

Either way is fine for me personally, but I feel dbsync utility will have greater advantage retaining bech32:

bech32 requires computing before inserting, since the address comes in the raw format from the ledger. This takes syncing time

For new addresses, it would - for existing addresses it could be a single numeric field if it was a foreign key to address table.

bech32 is the user facing standard, but db-sync works as a backend, so the transition can happen in the app level.

Yes, most clients will represent addresses as bech32 - which for a given address could be repeated every time address is queried, while for dbsync - could be reduced to initial first time address was seen on-chain.

It takes less disk space. This shows the difference in sizes between the two fields

The space will be saved more if address is moved to seperate table and only used as foreign key for tx_out and collateral_tx_out (avoiding repitition of keys).

The entire base of comparison could drastically change if #1333 could be prioritised (which could even hold both formats if needed, and avoids repeated entries every time an address has a tx)

RdLrT · Answer 5 · Sat Feb 03 2024 07:36:50 GMT+0800 (China Standard Time)

@kderme - payment cred is currently leftover (I assume that'd be still something to be dropped - or could that be in future when #1396 is considered?) in linked PR closing this issue