bmstar2012 / solana-rpc-ansible

Ansible config for Solana RPC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Solana RPC role

An Ansible role to deploy a Solana RPC node. This configures the validator software in RPC mode running under the user solana. The RPC service is installed as a user service running under this same user.

Updates

  • 16/02 - From Solana 1.8.15 (mainnet) and 1.9.6 (testnet) onwards you will need to specify solana_full_rpc_api: true for this role to actually create a fully exposed RPC API node.

Hardware Requirements

An RPC server requires at least the same specs as a Solana validator, but typically has higher requirements. In particular, we recommend using 256 GB of RAM in order to store indexes. For more information about hardware requirements, please see https://docs.solana.com/running-validator/validator-reqs. We strongly recommend you use a baremetal provider (not Hetzner) rather than a cloud provider unless you know what you're doing (and then why are you reading this page?).

Before deploy you should prepare the host so that the directory that you use for your Accounts database and your Ledger location are properly configured. This can include setting up a tmpfs folder for accounts and a separate filesystem (ideally on an NVME drive) for the ledger. A common way to configure this might be:

/solana/tmpfs - a 100 GB tmpfs partition to hold accounts state
/solana/ledger - a 2 TB NVME drive to hold ledger

Why bare metal and not cloud?

Cloud servers (AWS, GCP, etc.) are generally unsuitable for Solana for a number of reasons:

  1. Egress is really expensive and Solana will use a lot of egress
  2. The single core performance is generally too low and cannot boost in the way that baremetal can
  3. Many cloud providers do not want the kind of workload that Solana is on their lower cost instances, leading you to use very expensive baremetal instances.

Why not Hetzner?

Hetzner has decided that they do not want Solana RPC services running on their network. They actively block connections to Solana entrypoints and ratelimit any Solana traffic. Not only will your Solana node struggle to keep up with the network (on mainnet it will likely never catch up), but Hetzner is also very likely to shut down your account.

Software Requirements

  • Ansible >= 2.7 (tested primarily on Ansible 2.8)
  • Ubuntu 18.04+ on the target deployment machine

This role assumes some familiarity with the Solana validator software deployment process.

Role Variables

The deploy ensures that the checksum for the version of solana-installer that you are downloading matches one given in vars/main.yml. In case you want to insatll a solana version not listed there, it is good if you first download and check the sha256 checksum of the solana-installer script (https://raw.githubusercontent.com/solana-labs/solana/master/install/solana-install-init.sh).

There are a large number of configurable parameters for Solana. Many of these have workable defaults, and you can use this role to deploy a Solana RPC node without changing any of the default values and you should be able to have a decent experience. If you run this role without specifying any parameters, it'll configure a standard mainnet RPC node.

Basic variables

These are the basic variables that configure the setup of the validators. They have default values but you probably want to customise them based on your setup.

Name Default value Description
solana_version stable The solana version to install.
solana_full_rpc_api true Whether to enable the full RPC API or not. That's typically what you want.
solana_root /solana Main directory for solana ledger and accounts
solana_ledger_location /solana/ledger Storage for solana ledger (should be on NVME)
solana_accounts_location /solana/ledger/accounts Storage for solana accounts information. In case you use tmpfs for accounts this should be a subdirectory of your tmpfs mount point (e.g. /solana/tmpfs/accounts in case tmpfs is mounted on /solana/tmpfs
solana_snapshots_location Storage for solana snapshots. Can be useful to keep on a separate NVME from your ledger.
solana_keypairs [] List of keypairs to copy to the validator node. Each entry in the list should have a key and name entry. This will create /home/solana/<name>.json containing the value of key.
solana_generate_keypair true Whether or not to generate a keypair. If you haven't specified solana_keypairs and you set this to true, a new key will be generated and placed in /home/solana/identity.json
solana_public_key /home/solana/identity.json Location of the identity of the validator node.
solana_network mainnet The solana network that this node is supposed to be part of
solana_environment see defaults/main.yml Environment variables to specify for the validator node, most importantly RUST_LOG
solana_enabled_services [ solana-rpc ] List of services to start automatically on boot
solana_disabled_services [ ] List of services to set as disabled

Ports

The following ports needs to be configured for your RPC server.

Name Default value Description
solana_gossip_port 8001 Port for gossip traffic (needs to be open publicly in firewall for both TCP and UDP)
solana_rpc_port 8899 (+8900) Ports for incoming RPC (and websocket). This is typically only open on localhost. Place a proxy like haproxy in front of these port(s) and don't expose them publicly.
solana_rpc_bind_address 127.0.0.1 Address to bind RPC on. This should typically be localhost. Place a proxy like haproxy in front of this to accept public traffic
solana_dynamic_port_range 8002-8020 Port for incoming solana traffic. May need to be open publicly in firewall for UDP.

From this list, you can tell that you need at least 8001-8020 open in your firewall for incoming traffic in the default case.

For pure RPC nodes it may be possible to close down the TPU and TPU forward ports. These ports are dynamically allocated and you can see them by looking at your node in solana gossip. If you want to firewall them, you can use this utility: https://github.com/rpcpool/tpu-traffic-classifier. Using this tool you can block incoming TPU and TPU forward on a local node by running:

./tpu-traffic-classifier -config-file config.yml -our-localhost -tpu-policy DROP -fwd-policy DROP -update=false

Put this in a SystemD service and you can have it start at boot of node and leave it continuously running.

Network specific variables

Default values for these variables are specified in vars/{{ solana_network }}-default.yml (e.g. vars/mainnet-default.yml). You can also specify your own by providing the file {{ solana_network }}.yml. You will need to specify all these variables unless you rely on the defaults.

Name Default value Description
solana_network mainnet The solana network this node should join
solana_metrics_config see vars/mainnet-default.yml The metrics endpoint
solana_genesis_hash see vars/mainnet-default.yml The genesis hash for this network
solana_entrypoints see vars/mainnet-default.yml Entrypoint hosts
solana_known_validators see vars/mainnet-default.yml Known validators from where to fetch snapshots and genesis bin on start up
solana_expected_bank_hash see vars/mainnet-default.yml Expected bank hash
solana_expected_shred_version see vars/mainnet-default.yml Expected shred version
solana_index_exclude_keys see vars/mainnet-default.yml Keys to exclude from indexes for performance reasons

RPC specific variables

Name Default value Description
solana_rpc_faucet_address Specify an RPC faucet
solana_rpc_history true Whether to provide historical values over RPC
solana_account_index program-id spl-token-owner spl-token-mint Which indexes to enable. These greatly improve performance but slows down start up time and can increase memory requirements.

Performance variables

These are variables you can tweak to improve performance

Name Default value Description
solana_snapshot_compression Whether to compress snapshots or not. Specify none to improve performance.
solana_snapshot_interval_slots How often to take snapshots. Increase to improve performance. Suggested value is 500.
solana_pubsub_max_connections 1000 Maximum number of pubsub connections to allow.
solana_bpf_jit Whether to enable BPF JIT . Default on for devnet.
solana_banking_threads 16 Number of banking threads.
solana_rpc_threads Number of RPC threads (default maximum threads/cores on system)
solana_limit_ledger_size solana default, 250 mio Size of the local ledger to store. For a full epoch set a value between 350 mio and 500 mio. For best performance set 50 (minimal value).
solana_accounts_db_caching Whether to enable accounts db caching
solana_accounts_shrink_path You may want to specify another location for the accounts shrinking process

Bigtable

You can specify Google Bigtable account credentials for querying blocks not present in local ledger.

Name Default value Description
solana_bigtable_enabled false Enable bigtable access
solana_bigtable_upload_enabled false Enable bigtable uploading (the credentials you provide below needs write access)
solana_bigtable_project_id Bigtable project id
solana_bigtable_private_key_id Bigtable private key id
solana_bigtable_private_key Bigtable private key
solana_bigtable_client_email Bigtable client email
solana_bigtable_client_id Bigtable client id
solana_bigtable_client_x509_cert_url Bigtable cert url

For more information about BigTable see https://github.com/solana-labs/solana-bigtable .

Handling forks

Occasionally devnet/testnet will experience forks. In these cases use the following parameters as instructed in Discord:

Name Default value Description
solana_hard_fork Hard fork
solana_wait_for_supermajority Whether node should wait for supermajority or not

CPU governor & Sysctl settings

There are certain configurations that you need to do to get your RPC node running properly. This role can help you make some of these standard config changes. However, full optmisation depends greatly on your hardware so you need to take time to be familiar with how to configure your hardware right.

However, the most important element of optimisation is the CPU performance governor. This controls boost behaviour and energy usage. On many hosts in DCs they are configured for balance between performance and energy usage. In the case of Solana we really need them to perform at their fastest. To set the servers CPU governor there are three options:

  1. You have access to BIOS and you set the BIOS cpu setting to max performance. This seems to work well for HPE systems. In this case, specify the variable cpu_governor: bios. This is sometimes required for AMD EPYC systems too.
  2. You have acccess to BIOS and you set the BIOS cpu setting to os control. This should be the typical default. In this case you can leave the cpu_governor variable as default or set it explicitly to cpu_governor: performance.
  3. You don't have access to BIOS or CPU governor settings. If possible, try to set cpu_governor: performance. Otherwise, hopefully your provider has configured it for good performance!

The second config you need to do is to edit various kernel parameters to fit the Solana RPC use case.

One option is to deploy solana-sys-tuner together with this config to autotune some variables for you.

A second option, especially if you are new to tuning performance is tuned and tune-adm from RedHat, where the throughput-performance profile is suitable.

Finally, if you deploy through this role you can also specify a list of sysctl values for this playbook to automatically set up on your host. This allows full control and sets them so that they are permanently configured. Here is a list of sysctl values that we have used on rpcpool:

sysctl_optimisations:
  vm.max_map_count: 700000
  kernel.nmi_watchdog: 0
# Minimal preemption granularity for CPU-bound tasks:
# (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
  kernel.sched_min_granularity_ns: '10000000'
# SCHED_OTHER wake-up granularity.
# (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
  kernel.sched_wakeup_granularity_ns:  '15000000' 
  vm.swappiness: '30'
  kernel.hung_task_timeout_secs: 600
# this means that virtual memory statistics is gathered less often but is a reasonable trade off for lower latency
  vm.stat_interval: 10
  vm.dirty_ratio: 40
  vm.dirty_background_ratio: 10
  vm.dirty_expire_centisecs: 36000
  vm.dirty_writeback_centisecs: 3000
  vm.dirtytime_expire_seconds: 43200
  kernel.timer_migration: 0
# A suggested value for pid_max is 1024 * <# of cpu cores/threads in system>
  kernel.pid_max: 65536
  net.ipv4.tcp_fastopen: 3
# From solana systuner
# Reference: https://medium.com/@CameronSparr/increase-os-udp-buffers-to-improve-performance-51d167bb1360
  net.core.rmem_max: 134217728
  net.core.rmem_default: 134217728
  net.core.wmem_max: 134217728
  net.core.wmem_default: 134217728

Example Playbooks

Mainnet node:

    - hosts: rpc_nodes
      become: true
      become_method: sudo
      roles:
         - { role: rpcpool.solana-rpc, solana_network: mainnet }

Testnet node:

    - hosts: rpc_nodes
      become: true
      become_method: sudo
      roles:
         - { role: rpcpool.solana-rpc, solana_network: testnet }

Devnet node:

    - hosts: rpc_nodes
      become: true
      become_method: sudo
      roles:
         - { role: rpcpool.solana-rpc, solana_network: devnet }

Starting the RPC node

After the deploy you can login to the machine and run su -l solana to become the solana user.

To see the Solana validator command line generated for you during the deploy you can take a look at /home/solana/bin/solana-rpc.sh. Remember that any changes to this file will be overwritten next time you run this Ansible.

For the first start up, you should comment out --no-genesis-fetch and --no-snapshot-fetch in the file /home/solana/bin/solana-rpc.sh. This will allow solana to download the basic files it requires for first time start up. Remember to activate these lines again after you have started the validator for the first time.

Then start up the solana RPC process by running systemctl --user start solana-rpc. You can see status of the process by running systemctl --user status solana-rpc. The first start up will take some time. You can monitor start up by running solana catchup --our-localhost.

Finally, to see logs for your Solana RPC node run journalctl --user -u solana-rpc -f.

If this is your first time running a Solana node, you can find more details about how to operate the node on https://docs.solana.com/running-validator/validator-start and https://github.com/agjell/sol-tutorials/.

Checking the RPC node

The basic check after you've veriried that the node has started is to track catchup:

solana catchup --our-localhost

After this you can continue to check that it is serving RPC calls correctly.

Testing RPC access

You can also try a few easy validation commands (thanks buffalu: https://gist.github.com/buffalu/db6458d4f6a0b70ac303027b61a636af):

curl http://localhost:8899 -X POST -H "Content-Type: application/json" -d '
  {"jsonrpc":"2.0","id":1, "method":"getSlot", "params": [
      {
        "commitment": "processed"
      }
    ]}
'

curl http://localhost:8899  -X POST -H "Content-Type: application/json" -d '
  {"jsonrpc":"2.0","id":1, "method":"getSlot"}
'

Testing websocket access

The easiest way to test websockets is to install the utility wscat. To do so you'll need to install NodeJS and NPM and then run npm install wscat.

You can then connect to your websocket in the following way:

wscat -c localhost:8900

From there you'll get a command prompt where you can manually enter your websocket subscription requests:

> {"jsonrpc":"2.0", "id":1, "method":"slotSubscribe"}

You should now start receiving regular updates on the slots as they are confirmed by your RPC node.

RPC node falling behind/not catching up

The most typical performance issue that an RPC node can face is that it keeps falling behind the network and is not able to catch up.

If it can't catch up the first time you started it up, this would typically be due to a misconfiguration. The most common issue is your CPU boost frequencies (for more details on CPU config see above):

  • Check that your CPU is recent enough (anything < EPYC 2nd gen on AMD or < Cascade Lake on Intel will struggle)
  • Check that your CPU governor is not set to energy saving mode in BIOS and in your kernel settings
  • Observe the CPU frequencies when running solana with watch -n 1 grep MHz /proc/cpuinfo, you'll need it to be > 3ghz on all cores typically (rule of thumb). You do not want to see any core going to 1.4-1.8 ever.

If it used to be able to catch up but is no longer (or if fixing the CPU didn't solve it):

  • Check memory/cpu/network - do you have good CPU frequencies, are you dipping into swap (not enough memory) or is your provider throttling UDP packets?
    • CPU: Fix performance governor/boost setting, get newer generation CPU or CPU with better all-cores turbo (check wikichip for details). Remember that MHz is not the same across different generations. Broadwell 3.0 ghz is not the same as Cascade Lake 3.0 ghz or EPYC 3rd gen 3.0 ghz.
    • Network: Check UDP packet throttling and connectivity. You need at least a 500 mbps pipe without any throttling on UDP. Some providers like to block UDP or throttle it for DDoS protection. This is both on incoming and outgoing. If you are throttled on incoming your node will not receive shreds from the network in time. Check your firewalls that you are not
    • Memory: Download more RAM. Solana doesn't like to run on swap so if you are regularly dipping into swap you need to fix that. One temporary solution can be to disable spl-token-owner / spl-token-mint indexes. They have grown really big.
    • Disk: Check that your NVME for holding ledger and/or accounts isn't dead or dieing. A simple dmesg or SMART status query should be able to tell you.
  • There's a bug that after heavy getBlocks call over RPC the node stays permanently behind, try a restart of the node and if that helps that may be your issue
  • Have you tried unplugging it and plugging it in again? Sometimes it can help to clean your ledger and restart.
  • Check your traffic patterns. Certain RPC traffic patterns can easily push your node behind. Maybe you need to add another node and split your RPC traffic or you need to ratelimit your calls to problematic queries like getProgramAccounts.

Access to historical data

By default, when you start the RPC node it will being building its local ledger from the blocks that it receives over the Solana network. This local ledger starts from the point of the accounts snapshot that you downloaded when your node was starting. If you don't add --no-snapshot-fetch to your solana-validator command line, the validator will often pull a snapshot from the network when it is starting. This will leave holes or gaps in your ledger between the point where you stopped your RPC node and the point at which it downloaded the accounts snapshot. To avoid this, always specify --no-snapshot-fetch after the first time you started the node. Remember that any time you pull a snapshot you will create a hole in the local ledger.

The size of the local ledger is determined by the parameter --limit-ledger-size, which is measured in shreds. A shred is a fixed data unit. The conversion betweens shreds and blocks is not fixed, as blocks can be varying size. Therefore it is very difficult to say how much history measured in time or in number of blocks that your node will store. You will have to tune it according to your needs. A good starting point can be 250-350 million shreds which should cover approximately an epoch, which should in turn mean approximately 3 days.

The exact amount of data the RPC node will store also depends on the parameters --enable-cpi-and-log-storage and --enable-rpc-transaction-history. These are necessary for the node to retain and serve full block and transaction data.

Your node can only provide data which it has stored in its local ledger. This means that your history will always begin from the point at which you started the node (actually: the snapshot slot for which you started the node). If the network is currently at slot N and you pulled a snapshot at slot M, then your node will start to rebuild it's history between slot M and slot N. This is what is happening during catchup, the node is processing (replaying) everything that happened between M and N until it catches up with the network and can process all the current incoming data.

The node can (in theory) store as much history as you can fit on high speed storage (e.g. if you /don't/ specify --limit-ledger-size or you give it a huge value). However, this doesn't scale back to genesis. To get all history, you can use the built in Google BigTable support. You can both set your node to upload data to a Google BigTable instance, where it can be permanently available for historical querying. You can also configure your node to support queries to a BigTable instance. In this case, for any queries which the node does not have in its local ledger, it will make a request to Google BigTable and if it finds it in Google BigTable it can pull the data from there.

Some RPC providers and the Solana Foundation have copies of BigTable that go back to genesis. For more information about this, see https://github.com/solana-labs/solana-bigtable .

Indexes and performance: or, why is my RPC so slow?

There are three indexes that the Solana validator generates program-id, spl-token-mint, spl-token-owner. The last two are used to support queries either via getTokensByOwner, getTokenLargestAccounts or via getTokensByDelegate. They are also used to suport queries of getProgramAccounts which employ specific filters.

These indexes have started to grow huge. If you do not need these queries to be fast for your RPC node, then you should remove them as you will reduce memory usage of your node considerably as well as improve start up times.

If you DO need these RPC calls then you DO need to activate the indexes via the account index flag, otherwise these calls will run intolerably slow. This will require a lot of RAM - generally we don't recommend deploying these with less than 512 gb ram available.

An alternative to these might be using Geyser plugins, such as the postgres plugin, that can help speed up queries without relying on in-memory indexes: https://github.com/rpcpool/solana-geyser-park.

Security concerns

Security is a big field and you cannot rely on a small guide in a GitHub repo. Typically, at the very least you should make sure that your RPC server does not expose port 8899 and 8900 directly without any kind of proxy and access control in front of it. An easy way to do this is to use nginx or HAproxy as a reverse proxy. You can add SSL support and authentication in this way through the built in tools of each of these.

To be safe, you can ensure that your rpc-bind-address is set to 127.0.0.1 (the default for this role) so that it will only respond to requests locally.

Other playbooks

Usually you will want to deploy a reverse proxy in front of the Solana RPC. HAproxy is a great option and we have a playbook for configuring HAproxy for a solana rpc server here.

Other guides and docs

These are some other guides, resources and docs written about Solana RPC:

We make no claims as to the accuracy or quality of any of these docs. Please review and make your own mind for what docs to follow!

License

MIT

Author Information

This role was originally developed by Triton One. Patches, suggestions and improvements are always welcome.

About

Ansible config for Solana RPC

License:MIT License


Languages

Language:Shell 100.0%