Polkachu Team | 2022-04-14
When we started our first validator node on a Tendermint-based chain, the maintenance of one server seemed like a daunting challenge. Over time, as we expanded our operations, we had to pause once in a while to think from a system perspective to refactor our operations. Each refactor has allowed us to scale horizontally to more chains, add more services (state-sync, snapshot, public RPC, IBC relaying, etc), and develop robust monitoring. At the same time, these changes have also reduced our operational complexity and helped us make fewer mistakes.
Now we are actively validating on close to 20 Tendermint-based chains, we feel that it is a good time to summarize our system approach to Tendermint-based chain validation. For one thing, the writing process helps us clarify our own thoughts. For another, we hope it can be helpful to aspiring and established validation operators. A quick warning though: the post can appear very opinionated since it is specific to how we run our operation. While other operators might not want to adopt our system in a wholesale fashion, we hope that they can still be inspired by our system approach.
Many thanks to our fellow validators who hang out with us on public/private Discord channels. They have taught us everything. This post is a summary of what we have learned from the community.
While it is often referred to as "running a validation node", we like to think of our job as "running a validation system". In an earlier post, we have already introduced our 3-node infrastructure. It is worth a recap here, as it is the foundation of what follows.
For each Tendermint-based blockchain where we run a validator, we operate 3 nodes:
The 3 nodes work together as our validation system, a system that we internally refer to as "Holy Trinity". The system has the following benefits that we will detail one by one in later sections.
As a Tendermint-based chain validator, it is important to keep high uptime while eliminating the double-sign risk (big slashing). While there are sophisticated ways to manage a High-Availability setup through Horcurux or tmkms, we have found it is just earlier to keep hot spares in case of disasters.
Here is what we do. We always keep a backup copy of the validator key off of Validator Node. If Validator Node fails and we cannot recover it in a short time, we switch the validator key to Backup Server by strictly following our checklist. If, God forbid, Backup Node fails at the same time, we turns to Relayer Node as the last resort. All 3 nodes are fully synced at all time except a brief period each midnight when Backup Node is down for node snapshot. All 3 nodes are in different data centers so it is extremely unlikely they fail at the same time.
Of course, this system can not address network-related failures when Validator Node gets cut off and we are not sure if it is still running or when it will be back on line. To be fail-safe in such situations, you will probably need Horcurux. We think such event is tolerable given our small share of voting power in most networks. Instead, we turn this scenario on its head and use the chance of potential network failures as a selection criterion on cloud infrastructure providers. More details in a later section.
When someone visits our website, they often come away with the impression that we run a vast server infrastructure. Since we offers validation service, node snapshot, state-sync, RPC, REST API, and IBC relaying, we must be running a large server farm for each chain, right? No. We only run 2.x servers for each network.
Validator Node is its own high-quality dedicated server; Backup Node is typically a low-quality cloud VPS instance; and Relayer Node resides on a shared high-quality dedicated server with many other Relayer Nodes. Validator Node's sole purpose is to be a validator. Backup Node is the backbone for our Node Snapshot. Relayer Node is where we offer IBC Relaying, State-Sync, RPC endpoint and REST API endpoint. Such setup allows us to offer all these services with only 2.x servers.
Disclaimer: We understand it is not ideal to mix IBC relayers with other services. However, we currently offer IBC relayer as a voluntary backup for well-established channels behind major relay operators. We feel we can get away with this poor man version. If we have more resources in the future, we will separate them out with a 4-node system. By then, our Holy Trinity will evolve into Unbreakable Square.
Tendermint-based chain has a storage problem. Every node will eventually be filled with old states. They are considered obsolete unless you run an archive node. Some chains grow a few hundred MBs per day, while Osmosis currently grows 7 GBs per day even with the most aggressive pruning.
We solve the storage-bloat problem by periodically re-syncing nodes with the help of other nodes in the system.
Once a month (or more frequently if needed), we re-sync Backup Node, because we want to keep the public-facing node snapshot as small as possible. We will pause Backup Node, delete all old states with the "unsafe-rest-all" command, and state-sync with the help of Relayer Node. Most likely it will be successful and this step is done. In rare instances, state-sync fails to work, and we will restore Backup Node with the latest node snapshot (made within the last 24 hours by Backup Node itself) from the file server to restore the 3-node redundancy as quickly as possible. We then spin up a test server to trouble-shoot the state-sync issue. Hopefully we fix the issue quickly and then state-sync Backup Node. Once Backup Node is re-synced, we will manually trigger a snapshot, typically under 1 GB at that time.
Then we ask ourselves: Do we need to reduce the storage bloat for Validator Node and/or Relayer Node at this time? Most likely the answer is no, as we typically have large disk spaces for these two nodes and we are not as concerned about storage-bloat since they do not produce public-facing snapshots. Once in a while, however, the answer will be yes, because no amount of storage space can satisfy the disk greed of the Tendermint God.
If we are to re-sync Validator Node, we will follow our checklist to temporarily swap the validator key to Backup Node (thus promoting it to be a validator), delete all old states from Validator Node, either download the latest tiny snapshot or state-sync with the help of Relayer Node (note that this state-sync process is guaranteed to work since we have just done it with Backup Node), and then swap the validator key back to Validator Node.
If we are to re-sync Relayer Node, we will temporarily disable IBC relaying, download the latest tiny snapshot and restart. This process can be a bit more careless than that of Validator Node, because there is no validator key to manage and no slashing risk to worry about.
After all is done, all 3 nodes should have a small storage footprint. They get to live to fight the Tendermint God another day.
You cannot spell "validator" without l-a-y-e-r-e-d m-o-n-i-t-o-r-i-n-g. Okay, we have totally made it up, but it does not diminish the universal truth of such statement. It is very important to set up a layered monitoring system. This way, a non-critical issue triggers one alert but a critical issue triggers multiple alerts.
Our monitoring system has many different components. Different components all send data to a central monitor server via either pull (e.g. the monitor server scrapes Prometheus endpoints) or push (e.g., each node uses Promtail to push logs to the monitor server's Loki endpoint). The central monitor server uses Grafana to visualize the data, Alert Manager to trigger alerts and Loki to receive and organize logs.
Overall, we have the following 6 layers of monitoring:
Component 1, 2 and 3 all have email or PagerDuty alerts attached. When a non-critical issue happens, one alert will fire. For example, if Backup Node crashes while its underlying server is still functional, the alert attached with Component 2 will fire. When a critical issue happens, all three will fire. For example, if Validator Node gets cut off from the network, alerts attached to all 3 components will fire.
Component 4, 5, and 6 do not have alerts attached. They are monitored on an ad-hoc basis.
While we are proud of our monitoring system, there are many areas for improvement. One major hole in our system is the lack of monitoring of our IBC relaying. Right now, we just watch our IBC relayer's bot wallets as proxy. When the balances go down, we know they are working. Please let us know if you have better ways to monitor IBC relayers.
Many think that a validator's performance is a function of its infrastructure spending. We like to think of this question from a different angle. Because we have a validation system, it is much more productive to analyze what each node should be optimized for and then plan infrastructure spending accordingly. The result is that we are able to be a high-performing validator (never slashed for downtime or double-sign) with a small budget (under $60/month for small network and $100-150/month for large network).
Validator Node: It is a given that Validator Node requires high-quality dedicated server. We typically follow team's hardware guideline and use TenderDuty to ensure that our server has sufficient specs. However, we see no need to overkill on hardware specs. In our view, the most important element to optimize for is cloud provider's network reliability. As discussed above, because of our validation system, we can easily tolerate faults related to Validator Node's server. On the other hand, we can only wait and pray in case of network outages when we do not know if the node is alive or when the outage will be restored. For the same reason, it is never a good idea to host Validator Node at home.
Backup Node: In our validation system, Snapshot Node only needs to be alive on a needed basis. To push the logic to the extreme, in peace times, it only needs to be alive a few minutes a day for the snapshot to be done. Therefore, you can get away with some extremely cheap shared cloud VPS. In our validation system, the most important element to optimize for is its data center location. It has to sit in a different data center from Validator Node. When Validator Node's data center announces a scheduled maintenance, you can switch the key to Backup Node and just shut down Validator Node during the maintenance. It is passable if you want to host this node at home, although we do not recommend it.
Relayer Node: For this server, the most important element to optimize for is its room for future growth. We typically put 5-10 Tendermint-based nodes on the same server. Sometimes we stretch one server to the extreme before splitting all nodes into two servers. For this server, we go for a dedicated server (so no stolen CPU by your neighbors) with the best performance-to-price ratio. While it is hard to be specific, here is a rule of thumb: This server should be a total waste if you only plan to run 1 node on it, but it should not break your bank.
We follow the time-tested method of PDD: Pain-Driven Development. Because we run 3 nodes for each network (3x20 nodes since we operate on 20 networks), there are lots of painfully repetitive work. On top of that, as described above, we periodically re-sync all these nodes to manage Tendermint's storage bloat. It gets overwhelming painful over time.
Whenever we repeat the same task 3 times, we try our best to automate and we open-source these scripts.
Again, if you think that you are running a "validation node", the above automation can be easily disregarded as out of scope. However, if you think of it as running a "validation system", such automation becomes critical to keep the system running without pushing the team to the edge. Moreover, these automations make the deployment error-proof and the execution environment more predictable. This can be as critical as ensuring that every server has water-proof security setup, or as trivial as always having a custom shortcut "system" to go to the "systemd" folder because we go there so often.
As mentioned above, this post is very opinionated because it is based on how we validate on Tendermint-based chains. For many established validators who read this post, many of our processes are either unique or different. In extreme cases, they can be regarded as "wrong" (for example, we do not run sentry nodes).
That said, we hope that your key takeaway is our mental model to think of the validation service as a system. In this system, you have several elements working together to ensure high performance, low downtime and no errors. In this system, you get to optimize cost without sacrificing performance. In this system, you have layered monitoring to always stay on top without the need to always stay awake. In this system, you automate, refactor and free up time to expand operations to new chains. This way, you grow your operations to the whole Interchain Galaxy and get to face the Tendermint God as the final boss.
P.S.: This article is dedicated to many fellow validators who have taught us all. Any errors are our own though. Many thanks to Stakepile, windpowerstake, Binary Holdings, and NosNode for supporting us on Osmosis. Their generous support has reduced our daily grind to stay alive on a major network, so we have time to think, write and at times shitpost.