Holy Trinity: A System Approach to Tendermint-Based Chain Validation

Polkachu Team | 2022-04-14

When we started our first validator node on a Tendermint-based chain, the maintenance of one server seemed like a daunting challenge. Over time, as we expanded our operations, we had to pause once in a while to think from a system perspective to refactor our operations. Each refactor has allowed us to scale horizontally to more chains, add more services (state-sync, snapshot, public RPC, IBC relaying, etc), and develop robust monitoring. At the same time, these changes have also reduced our operational complexity and helped us make fewer mistakes.

Now we are actively validating on close to 20 Tendermint-based chains, we feel that it is a good time to summarize our system approach to Tendermint-based chain validation. For one thing, the writing process helps us clarify our own thoughts. For another, we hope it can be helpful to aspiring and established validation operators. A quick warning though: the post can appear very opinionated since it is specific to how we run our operation. While other operators might not want to adopt our system in a wholesale fashion, we hope that they can still be inspired by our system approach.

Many thanks to our fellow validators who hang out with us on public/private Discord channels. They have taught us everything. This post is a summary of what we have learned from the community.

Three-Node Infrastructure

While it is often referred to as "running a validation node", we like to think of our job as "running a validation system". In an earlier post, we have already introduced our 3-node infrastructure. It is worth a recap here, as it is the foundation of what follows.

For each Tendermint-based blockchain where we run a validator, we operate 3 nodes:

Validator Node: This node is with high-grade hardware in a dedicated server. It is to sign the blocks only and does not serve any other purpose.
Backup Node: This node is hosted in a different data center from the Validator Node. It serves two purposes: in peace times, it is used to take node snapshot and push the snapshot to the hosting file server; during emergency when Validator Node is down and cannot be recovered, it turns off the snapshot service and becomes the validator node temporarily until we find a replacement Validator Node. Backup Node has "100/0/10" pruning and "null" indexer settings to keep the storage requirement small.
Relayer Node: This node attaches to a relayer hub (a hub hosts many nodes on the same server). It also serves two purposes: For one thing, it serves as the light client to run our IBC relay operations; for another, it serves as a state-sync server. These state-sync servers are open for public consumption. We use the service ourselves to periodically state-sync Backup Node to keep the storage requirement at minimum. These nodes also open public RPC and API endpoints, although we like to keep them low-key as they are mostly used for internal consumption only. Our own usage of these services serves as a periodical test to ensure that they are always up. This node has "40000/2000/10" pruning, "kv" indexer and "2000/5" snapshot settings.

The 3 nodes work together as our validation system, a system that we internally refer to as "Holy Trinity". The system has the following benefits that we will detail one by one in later sections.

Redundancy in Validation Service
Public Services with Efficient Resource Utilization
Rotational Assistance to Reduce Storage Bloat
Layered Monitoring
Cost Optimization
Automate or Die

Redundancy in Validation Service

As a Tendermint-based chain validator, it is important to keep high uptime while eliminating the double-sign risk (big slashing). While there are sophisticated ways to manage a High-Availability setup through Horcurux or tmkms, we have found it is just earlier to keep hot spares in case of disasters.

Here is what we do. We always keep a backup copy of the validator key off of Validator Node. If Validator Node fails and we cannot recover it in a short time, we switch the validator key to Backup Server by strictly following our checklist. If, God forbid, Backup Node fails at the same time, we turns to Relayer Node as the last resort. All 3 nodes are fully synced at all time except a brief period each midnight when Backup Node is down for node snapshot. All 3 nodes are in different data centers so it is extremely unlikely they fail at the same time.

Of course, this system can not address network-related failures when Validator Node gets cut off and we are not sure if it is still running or when it will be back on line. To be fail-safe in such situations, you will probably need Horcurux. We think such event is tolerable given our small share of voting power in most networks. Instead, we turn this scenario on its head and use the chance of potential network failures as a selection criterion on cloud infrastructure providers. More details in a later section.

Public Services with Efficient Resource Utilization

When someone visits our website, they often come away with the impression that we run a vast server infrastructure. Since we offers validation service, node snapshot, state-sync, RPC, REST API, and IBC relaying, we must be running a large server farm for each chain, right? No. We only run 2.x servers for each network.

Validator Node is its own high-quality dedicated server; Backup Node is typically a low-quality cloud VPS instance; and Relayer Node resides on a shared high-quality dedicated server with many other Relayer Nodes. Validator Node's sole purpose is to be a validator. Backup Node is the backbone for our Node Snapshot. Relayer Node is where we offer IBC Relaying, State-Sync, RPC endpoint and REST API endpoint. Such setup allows us to offer all these services with only 2.x servers.

Disclaimer: We understand it is not ideal to mix IBC relayers with other services. However, we currently offer IBC relayer as a voluntary backup for well-established channels behind major relay operators. We feel we can get away with this poor man version. If we have more resources in the future, we will separate them out with a 4-node system. By then, our Holy Trinity will evolve into Unbreakable Square.

Rotational Assistance to Reduce Storage Bloat

Tendermint-based chain has a storage problem. Every node will eventually be filled with old states. They are considered obsolete unless you run an archive node. Some chains grow a few hundred MBs per day, while Osmosis currently grows 7 GBs per day even with the most aggressive pruning.

We solve the storage-bloat problem by periodically re-syncing nodes with the help of other nodes in the system.

Once a month (or more frequently if needed), we re-sync Backup Node, because we want to keep the public-facing node snapshot as small as possible. We will pause Backup Node, delete all old states with the "unsafe-rest-all" command, and state-sync with the help of Relayer Node. Most likely it will be successful and this step is done. In rare instances, state-sync fails to work, and we will restore Backup Node with the latest node snapshot (made within the last 24 hours by Backup Node itself) from the file server to restore the 3-node redundancy as quickly as possible. We then spin up a test server to trouble-shoot the state-sync issue. Hopefully we fix the issue quickly and then state-sync Backup Node. Once Backup Node is re-synced, we will manually trigger a snapshot, typically under 1 GB at that time.

Then we ask ourselves: Do we need to reduce the storage bloat for Validator Node and/or Relayer Node at this time? Most likely the answer is no, as we typically have large disk spaces for these two nodes and we are not as concerned about storage-bloat since they do not produce public-facing snapshots. Once in a while, however, the answer will be yes, because no amount of storage space can satisfy the disk greed of the Tendermint God.

If we are to re-sync Validator Node, we will follow our checklist to temporarily swap the validator key to Backup Node (thus promoting it to be a validator), delete all old states from Validator Node, either download the latest tiny snapshot or state-sync with the help of Relayer Node (note that this state-sync process is guaranteed to work since we have just done it with Backup Node), and then swap the validator key back to Validator Node.

If we are to re-sync Relayer Node, we will temporarily disable IBC relaying, download the latest tiny snapshot and restart. This process can be a bit more careless than that of Validator Node, because there is no validator key to manage and no slashing risk to worry about.

After all is done, all 3 nodes should have a small storage footprint. They get to live to fight the Tendermint God another day.

Layered Monitoring

You cannot spell "validator" without l-a-y-e-r-e-d m-o-n-i-t-o-r-i-n-g. Okay, we have totally made it up, but it does not diminish the universal truth of such statement. It is very important to set up a layered monitoring system. This way, a non-critical issue triggers one alert but a critical issue triggers multiple alerts.

Our monitoring system has many different components. Different components all send data to a central monitor server via either pull (e.g. the monitor server scrapes Prometheus endpoints) or push (e.g., each node uses Promtail to push logs to the monitor server's Loki endpoint). The central monitor server uses Grafana to visualize the data, Alert Manager to trigger alerts and Loki to receive and organize logs.

Overall, we have the following 6 layers of monitoring:

General server performance: We monitor server performance (CPU, memory, storage, I/O) using node-exporter. It sits on all 3 nodes.
Tendermint node performance: We monitor a Tendermint node using the default Prometheus output that comes with a standard installation. It sits on all 3 nodes.
Validator performance: We monitor if a validator has missed a threshold amount of blocks using the excellent TenderDuty. Counter-intuitively, it should NOT sit on Validator Node. For one thing, Validator Node should do as little work as possible other than signing blocks. More importantly, if Validator Node gets cut off from the network, we need TenderDuty to still send alerts. All things considered, we put this monitoring on Relayer Node only.
Tendermint chain performance: We monitor the overall chain performance using Cosmos-Exporter and this Grafana template. This component monitors the whole active set of validators. We do not really need this one for our own validator, since it is well covered by TenderDuty. However, we use it on an ad-hoc basis for other validators. At times, we will send a friendly manual alert to them when we see something unusual on their nodes. This component sits on Relayer Node only.
Bot wallet balance: We use many bot wallets and need to keep them refilled whenever the balance is low. Some of them are used for IBC relayers and others are used for Authz-enabled restaking with the help of the excellent restake.app. In total, we have close to 30 bot wallets across scores of networks. It will only grow in the future. Since it is a pain to monitor them individually, we consolidate the wallet balance monitoring into one single Grafana dashboard. This monitoring component uses Cosmos-Exporter and a customized version of this Grafana template. This component sits on Relayer Node only.
Log monitoring: Each node uses Promtail to push logs to the monitor server's Loki endpoint. This way, all logs are in one place. It is easy to search and perfect for live monitoring during the time of node upgrades or trouble shooting. This component sits on all 3 nodes.

Component 1, 2 and 3 all have email or PagerDuty alerts attached. When a non-critical issue happens, one alert will fire. For example, if Backup Node crashes while its underlying server is still functional, the alert attached with Component 2 will fire. When a critical issue happens, all three will fire. For example, if Validator Node gets cut off from the network, alerts attached to all 3 components will fire.

Component 4, 5, and 6 do not have alerts attached. They are monitored on an ad-hoc basis.

While we are proud of our monitoring system, there are many areas for improvement. One major hole in our system is the lack of monitoring of our IBC relaying. Right now, we just watch our IBC relayer's bot wallets as proxy. When the balances go down, we know they are working. Please let us know if you have better ways to monitor IBC relayers.

Cost Optimization

Many think that a validator's performance is a function of its infrastructure spending. We like to think of this question from a different angle. Because we have a validation system, it is much more productive to analyze what each node should be optimized for and then plan infrastructure spending accordingly. The result is that we are able to be a high-performing validator (never slashed for downtime or double-sign) with a small budget (under $60/month for small network and $100-150/month for large network).

Validator Node: It is a given that Validator Node requires high-quality dedicated server. We typically follow team's hardware guideline and use TenderDuty to ensure that our server has sufficient specs. However, we see no need to overkill on hardware specs. In our view, the most important element to optimize for is cloud provider's network reliability. As discussed above, because of our validation system, we can easily tolerate faults related to Validator Node's server. On the other hand, we can only wait and pray in case of network outages when we do not know if the node is alive or when the outage will be restored. For the same reason, it is never a good idea to host Validator Node at home.

Backup Node: In our validation system, Snapshot Node only needs to be alive on a needed basis. To push the logic to the extreme, in peace times, it only needs to be alive a few minutes a day for the snapshot to be done. Therefore, you can get away with some extremely cheap shared cloud VPS. In our validation system, the most important element to optimize for is its data center location. It has to sit in a different data center from Validator Node. When Validator Node's data center announces a scheduled maintenance, you can switch the key to Backup Node and just shut down Validator Node during the maintenance. It is passable if you want to host this node at home, although we do not recommend it.

Relayer Node: For this server, the most important element to optimize for is its room for future growth. We typically put 5-10 Tendermint-based nodes on the same server. Sometimes we stretch one server to the extreme before splitting all nodes into two servers. For this server, we go for a dedicated server (so no stolen CPU by your neighbors) with the best performance-to-price ratio. While it is hard to be specific, here is a rule of thumb: This server should be a total waste if you only plan to run 1 node on it, but it should not break your bank.

Automate or Die

We follow the time-tested method of PDD: Pain-Driven Development. Because we run 3 nodes for each network (3x20 nodes since we operate on 20 networks), there are lots of painfully repetitive work. On top of that, as described above, we periodically re-sync all these nodes to manage Tendermint's storage bloat. It gets overwhelming painful over time.

Whenever we repeat the same task 3 times, we try our best to automate and we open-source these scripts.

Secure Server Setup
Monitor Server Setup
Tendermint Node Installation
Node Snapshot Script, State-Sync Script, Auto-Compound Script, Hermes Relayer Config Script: All in the same repo as 3, but with different Ansible playbooks

Again, if you think that you are running a "validation node", the above automation can be easily disregarded as out of scope. However, if you think of it as running a "validation system", such automation becomes critical to keep the system running without pushing the team to the edge. Moreover, these automations make the deployment error-proof and the execution environment more predictable. This can be as critical as ensuring that every server has water-proof security setup, or as trivial as always having a custom shortcut "system" to go to the "systemd" folder because we go there so often.

Final Words

As mentioned above, this post is very opinionated because it is based on how we validate on Tendermint-based chains. For many established validators who read this post, many of our processes are either unique or different. In extreme cases, they can be regarded as "wrong" (for example, we do not run sentry nodes).

That said, we hope that your key takeaway is our mental model to think of the validation service as a system. In this system, you have several elements working together to ensure high performance, low downtime and no errors. In this system, you get to optimize cost without sacrificing performance. In this system, you have layered monitoring to always stay on top without the need to always stay awake. In this system, you automate, refactor and free up time to expand operations to new chains. This way, you grow your operations to the whole Interchain Galaxy and get to face the Tendermint God as the final boss.

P.S.: This article is dedicated to many fellow validators who have taught us all. Any errors are our own though. Many thanks to Stakepile, windpowerstake, Binary Holdings, and NosNode for supporting us on Osmosis. Their generous support has reduced our daily grind to stay alive on a major network, so we have time to think, write and at times shitpost.