Payout Guard Bug Post-Mortem

Polkachu Team | 2022-04-15

Yesterday, our Payout Guard product experienced a disruption for about 12 hours. It has been restored. We also put additional safeguards for the future.

Issue: Payout Guard was down for about 12 hours. It impacted automatic payouts for 2 Kusama eras. We were not aware of the issue until a fellow validator reached out via Element chat. The same bug should've impacted Polkadot and Polkadex too, except that the outage was too short to disrupt any payouts.

First-Order Cause: We use PHP as backend and frontend for Polkachu web app, and we use a small NodeJS app (just open-sourced today) that integrates with the PolkadotJS API library. All complicated business logics are housed inside the PHP code base. When it needs to interact with a Substrate blockchain, it sends an API request to the NodeJS app and then consumes the response.

Payout Guard has two steps to interact with the API. The PHP code first sends a list of validator stashes to the NodeJS app's "unclaimed_eras" endpoint to get their unclaimed eras. Then it sends a batch of stashes along with the unclaimed eras to the NodeJS app's "payout" endpoint to process the payout. Each batch has up to 6 payouts, so it might take multiple batches each era to complete the task.

The bug was in the "unclaimed_eras" endpoint. Somehow the NodeJS app treated the PHP request payload as an object rather than an array, and the code just skipped the whole object rather than iterating through an array that it was expecting.

Root Cause: We do not know unfortunately . The NodeJS code handled the PHP payload as array just fine since inception. Yesterday, we did not release any new codes, nor did we upgrade the server OS, nor did we change any library dependency. We reviewed our PHP code and it is definitely sending an array as payload, and the NodeJS app had treated it as array until yesterday.

Solution: We updated the NodeJS code to handle an object rather than an array. This is an unsatisfying solution. We feel that the product might break again in the future since we still do not know the root cause. We have just open-sourced our NodeJS app and hope the community can spot the real issues. After all, we are not strong in Javascript.

Other Safeguards: We have added an email alert so we can be more aware of such issues in the future. The threshold is the duration of 1.5 eras (6 hours for Kusama and 36 hours for Polkadot). Unrelated, we have also signed up for Polkadot JS API library's release update through GitHub, so we can upgrade to its latest version on time.

Some Learnings: Typing is hard. Polkachu is a dinosaur in the world of coding. In his youth he took a wrong turn and learned PHP and Javascript (both loosed-typed languages). What makes it more complex is the type-conversion between different code bases. There is still so much to learn...

Follow our official account and intern account on Twitter