Our Node Snapshot Architecture

Polkachu Team | 2022-03-18

In the past few months, we have developed a robust infrastructure to take and host node snapshots for 3 Substrate-based blockchains and 16 Tendermint-based blockchains. Our scripts are open-sourced and our hosting website has clear and easy-to-follow instructions. The service has made the life easy for many validators and node operators alike.

In this post, we will lay out the overall architecture of our snapshot service and put all our open-sourced resources together in one place. For one thing, it is make our work more transparent to the community; for another, we hope that it inspires many others to develop complementary services to make node snapshot services more redundant. We will use Tendermint-based chains for illustration for this post, but a similar approach applies to Substrate-based chains.

Overall Node Operations as Validator

Before we dive into the snapshot service specifically, it is important to understand our 3-node infrastructure. For each Tendermint-based blockchain where we run an active/inactive validator, we run 3 nodes:

A validator node: This node is with high-grade hardware. It is to sign the blocks only and does not serve any other purpose.
A backup/snapshot node: This node is hosted in a different data center from the validator node. It serves two purpose: in normal times, it is used to take node snapshot and push the snapshot to the hosting file server; in emergency times when the validator node is down and cannot be recovered, it turns off the snapshot service and becomes the validator node temporarily until we find a replacement validator node. The node has "100/0/10" pruning and "null" indexer settings to keep the storage requirement small, since it is used for snapshot.
A relayer node: This node attaches to a relayer hub. It also serves two purpose: For one thing, it serves as the light client to run our IBC relay operations; for another, it serves as a state-sync server. The state-sync server is open for public consumption. We use the service ourselves to periodically state-sync our backup/snapshot node to keep the storage requirement at minimum. After all, we want the snapshot service to provide small but viable snapshots to the community. At the same time, our own usage serves as a periodical test to ensure that the state-sync service is always up. This node has "40000/2000/10" pruning, "kv" indexer and "2000/5" snapshot settings.

The core idea of our 3-node infrastructure is that the validator node and relayer node should be separate and be always up, while the backup/snapshot node can have downtimes. We use such downtimes to take daily node snapshots, as well as periodically reset the node with state-sync to keep the snapshot size small.

Our Snapshot Service

Our snapshot service has three components. Here we briefly introduce each component, and then we will break them down in more details in later sections.

The snapshot node: As explained above, this node is always fully synced but can be temporarily paused to take the node snapshot or be reset to state-sync. It uses scheduled cron task to take a snapshot a day and push to the file servers.
The file servers: We use two file servers to hosts all snapshots. It is important to have these servers to have high storage and high network bandwidth, but they do not require much CPU and memory. Our file storage software is Minio, which is S3 compatible.
The front-end web server: This server hosts the site that our end users directly interact with. It communicates with the file servers to retrieve the snapshot information; it hosts our step-by-step instruction on how to process the snapshots; and it also runs tasks to prune old snapshots.

Snapshot Node

This node is quite straightforward for validators. It is just a full-synced node with a cron job that takes a snapshot and pushes the snapshot to the file server daily. All our deployment scripts are open-sourced in this GitHub repo.

For the node deployment, we have one playbook for each Tendermint-based chains. For example, the code below deploys an Osmosis node to a server named osmosis_main defined in the inventory

ansible-playbook -i inventory osmosis.yml -e "target=osmosis_main"

For taking snapshot and pushing to file servers, the repo has another playbook to deploy such a shell script to the destination server. For example:

ansible-playbook -i inventory snapshot.yml -e "target=osmosis_main"

Finally, we add a cron job to the server. For example, if we want to take the snapshot at midnight each day, we add the following cron task:

0 0 * * * /bin/bash /home/<USER>/snapshot.sh

File Server

We use Minio as our file server software. It is S3 compatible, so its API is easy to interact with using many open-sourced libraries. It has a distributed mode so that you can deploy the software in a cluster high availability and redundancy (we do not use this feature yet). It also has an admin portal so an administrator can directly interact with the files in a web UI.

The deployment of Minio is rather straightforward by following its doc. We deployed the software a few times and decided to make an Ansible playbook to make our life easier. Hope it is helpful for others who want to deploy the same software with Ansible.

Front-End Web Server

This is the only part of our stack that is close-sourced. It is mostly for security reasons. As the server hosts some sensitive environment variables as well as some user info, we want this code base to be as obscure as possible. However, here is the high-level summary:

It is a PHP Laravel server-side application. We also sprinkle some VueJS to make user interaction better when needed.
We use a popular PHP s3 adapter library to interact with the Minio file server API.
We use Laravel Task Scheduler to serve as a cron job wrapper to prune the old snapshots and free up the file server storage.

Besides trying our best to have a clean web UI, we also spend much time refining our instruction to make it clear and easy to follow. At the time of this writing, our snapshot services support 16 Tendermint-based blockchains.

Final Words

The above architecture is by no means final. In fact, we have been evolving our snapshot setup quite a lot in the past few months. Every time we review it, we find rooms for improvement. Over time, however, we have found that the pace of change is slowing and we have reached a comfortable spot to share our learnings to benefit the community at large. Hope you find this post and our open-source repos helpful. If you have any feedback, please reach out to us via email at [email protected] or directly submit a GitHub PR. Thanks for reading.