IntersectMBO / cardano-db-sync

A component that follows the Cardano chain and stores blocks and transactions in PostgreSQL

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Remove redundant fields

kderme opened this issue · comments

The dbsync schema currently has a number of redundant fields that can be removed. Some of them are

  • tx_out.address, tx_out.paymentCred,
  • collateral_tx_out.address, collateral_tx_out.paymentCred
  • stake_address.view
  • drep_hash.view
  • pool_hash.view

We could consider removing them from the default schema.
Some json fields are theoretically redundant and could be parsed by clients, however that's not as easy as the other fields.

Keeping in line with various CIPs for human representation, would be beneficial to retain bech32-encoded and remove raw fields (eg:address_raw instead of address), but either way would be helpful to reduce amount of storage on mainnet.

I understand that most users use the bech32 tx_out.address field (base58 for Byron), but there are some advantages in the raw version:

  • bech32 requires computing before inserting, since the address comes in the raw format from the ledger. This takes syncing time
  • for other tables keeping the raw format is necessary for db-sync, so it feels more consistent to also keep the raw for addresses
  • bech32 is the user facing standard, but db-sync works as a backend, so the transition can happen in the app level.
  • It takes less disk space. This shows the difference in sizes between the two fields:
select count (*) as count, pg_column_size(address) - pg_column_size(address_raw) as  diff  from tx_out group by pg_column_size(address) - pg_column_size(address_raw) order by pg_column_size(address) - pg_column_size(address_raw);
   count   | diff 
-----------+------
        51 |   15
   5381517 |   16
         1 |   19
         1 |   21
         1 |   22
        75 |   24
       177 |   25
      1256 |   27
  14441735 |   28
  39974479 |   29
         5 |   30
        69 |   31
         8 |   32
       128 |   34
      6344 |   35
         1 |   40
 119554727 |   46
         1 |   50
         1 |   51
         1 |   85
         1 |  188
         1 |  254
         1 |  303
         1 |  371
         1 |  969
         1 | 1186
         1 | 1415
         1 | 1417
         1 | 1431
         1 | 1743
         1 | 1811
         1 | 1994
         1 | 2409
         1 | 2812
         1 | 4406
         1 | 6460
         1 | 6491
         1 | 7218
         1 | 7493
         1 | 7542
         1 | 7882
         1 | 8264
(42 rows)

This gives a total minimum estimation of 7.15GB, not taking into account differences in the indexes and is not fully synced so it gets worse over time.

Eventually this boils down to users preferences, in case they want to do the transition for the benefits it provides. I just wanted to give the full picture.

Either way is fine for me personally, but I feel dbsync utility will have greater advantage retaining bech32:

bech32 requires computing before inserting, since the address comes in the raw format from the ledger. This takes syncing time

For new addresses, it would - for existing addresses it could be a single numeric field if it was a foreign key to address table.

bech32 is the user facing standard, but db-sync works as a backend, so the transition can happen in the app level.

Yes, most clients will represent addresses as bech32 - which for a given address could be repeated every time address is queried, while for dbsync - could be reduced to initial first time address was seen on-chain.

It takes less disk space. This shows the difference in sizes between the two fields

The space will be saved more if address is moved to seperate table and only used as foreign key for tx_out and collateral_tx_out (avoiding repitition of keys).

The entire base of comparison could drastically change if #1333 could be prioritised (which could even hold both formats if needed, and avoids repeated entries every time an address has a tx)

@kderme - payment cred is currently leftover (I assume that'd be still something to be dropped - or could that be in future when #1396 is considered?) in linked PR closing this issue