Polkachu Team | 2022-03-18
In the past few months, we have developed a robust infrastructure to take and host node snapshots for 3 Substrate-based blockchains and 16 Tendermint-based blockchains. Our scripts are open-sourced and our hosting website has clear and easy-to-follow instructions. The service has made the life easy for many validators and node operators alike.
In this post, we will lay out the overall architecture of our snapshot service and put all our open-sourced resources together in one place. For one thing, it is make our work more transparent to the community; for another, we hope that it inspires many others to develop complementary services to make node snapshot services more redundant. We will use Tendermint-based chains for illustration for this post, but a similar approach applies to Substrate-based chains.
Before we dive into the snapshot service specifically, it is important to understand our 3-node infrastructure. For each Tendermint-based blockchain where we run an active/inactive validator, we run 3 nodes:
The core idea of our 3-node infrastructure is that the validator node and relayer node should be separate and be always up, while the backup/snapshot node can have downtimes. We use such downtimes to take daily node snapshots, as well as periodically reset the node with state-sync to keep the snapshot size small.
Our snapshot service has three components. Here we briefly introduce each component, and then we will break them down in more details in later sections.
This node is quite straightforward for validators. It is just a full-synced node with a cron job that takes a snapshot and pushes the snapshot to the file server daily. All our deployment scripts are open-sourced in this GitHub repo.
For the node deployment, we have one playbook for each Tendermint-based chains. For example, the code below deploys an Osmosis node to a server named osmosis_main defined in the inventory
ansible-playbook -i inventory osmosis.yml -e "target=osmosis_main"
For taking snapshot and pushing to file servers, the repo has another playbook to deploy such a shell script to the destination server. For example:
ansible-playbook -i inventory snapshot.yml -e "target=osmosis_main"
Finally, we add a cron job to the server. For example, if we want to take the snapshot at midnight each day, we add the following cron task:
0 0 * * * /bin/bash /home/<USER>/snapshot.sh
We use Minio as our file server software. It is S3 compatible, so its API is easy to interact with using many open-sourced libraries. It has a distributed mode so that you can deploy the software in a cluster high availability and redundancy (we do not use this feature yet). It also has an admin portal so an administrator can directly interact with the files in a web UI.
The deployment of Minio is rather straightforward by following its doc. We deployed the software a few times and decided to make an Ansible playbook to make our life easier. Hope it is helpful for others who want to deploy the same software with Ansible.
This is the only part of our stack that is close-sourced. It is mostly for security reasons. As the server hosts some sensitive environment variables as well as some user info, we want this code base to be as obscure as possible. However, here is the high-level summary:
Besides trying our best to have a clean web UI, we also spend much time refining our instruction to make it clear and easy to follow. At the time of this writing, our snapshot services support 16 Tendermint-based blockchains.
The above architecture is by no means final. In fact, we have been evolving our snapshot setup quite a lot in the past few months. Every time we review it, we find rooms for improvement. Over time, however, we have found that the pace of change is slowing and we have reached a comfortable spot to share our learnings to benefit the community at large. Hope you find this post and our open-source repos helpful. If you have any feedback, please reach out to us via email at [email protected] or directly submit a GitHub PR. Thanks for reading.