Waivio

Recommended Posts

HF27 and the scheduling problem.

25 comments

andablackwidow1.8 K2 years ago6 min read

As you already know we are having HF27 tomorrow. It contains only two fixes. First one addresses old rare bug in snapshot mechanism that obviously had to actually happen. Second one fixes problem with backup witness schedule, described f.e. here, an unforeseen side effect of changes related to new HF26 OBI mechanism.


Snapshot problem

Just quickly about first one. When snapshot is generated, it basically stores state on disk, so it can be restored later, making setting up nodes much faster than full replay. The bug was caused by the fact that next_id - an internal counter of objects of particular type - was not stored, so when snapshot was loaded, it was set to value of id of last restored object plus one. Object ids are internal thing and while they are returned by some APIs, database_api in particular, they should not matter... Except proposals use that internal id to set proposal id that is then used in transactions to target the proposal f.e. during update. So when snapshot was prepared when last proposal ever created was also already removed, the value of next_id in original state was different (bigger) than its value after snapshot was loaded. That was a setup necessary for disaster. As a result new proposal being created had different identifier on original node and on node recreated from snapshot. That meant that transaction containing proposal update that was correct on one node was guaranteed to fail on the other. And that's what happened killing some API nodes set with use of snapshot.


The scheduling problem is more interesting, because the scheduling itself is more complicated. I'll try to express it in "human language" :o)

https://images.ecency.com/DQmemDHXQLydYppbc8sU272MGvBnqJpuGMw11d3jMqGxLxm/run.jpg

The schedule

Imagine a bunch of runners (witnesses) competing who can run to the top of the hill first. There are also crowds of people backing those runners (voters). The reward is a place in history for all eternity (chance to produce a block as part of immutable blockchain). Every 21 blocks the winners are elected in the following way by impartial judge (scheduling algorithm):

  • first 20 winners (witness_schedule_object.max_voted_witnesses) are chosen based on their backing (total power of votes).

As the winners are selected, they are teleported back to the starting line, so they are not chosen again in further selection steps.

  • next 0 winners (witness_schedule_object.max_miner_witnesses) are chosen for their effort
  • finally remaining number to fill the quota of 21 (HIVE_MAX_WITNESSES), that is 1 (witness_schedule_object.max_runner_witnesses), is selected based on extrapolation of their current position and pace.

Since blocks need to be produced, the scheduling can't wait until witnesses actually reach the finish line.

So far so good. The problem of the bug was not in the scheduling but in handling of special events related to voting.

For most of the time people in the audience just sit and wave their flags. However once in a while a person will stand up and either start cheering, which increases the pace of the runner they are backing (increased vote), or they will start booing, sometimes throwing their flag completely, which decreases the pace of runner (decreased vote). When such event happens, the judge has to recalculate their estimate on the completion of the race, using old power of the backing and the time of last estimate up to current time, and then use new value of that power for further run. The problem arose not from what changed, but from what did not change but should've.

The change

One of the main courses of HF26 was OBI. It works by allowing witnesses to exchange their confirmations of block validity outside of blockchain, without waiting for actual vote in form of signed block. But the confirmations can't be sent by just any witness - they have to be signed by the witnesses that are going to produce blocks in near future. For that reason we have to know in advance who they are. With just current witness schedule, depending on when we are within it, we might know only couple of witnesses, not enough for full irreversibility. That's why new future witness schedule was introduced, so we always know enough witnesses in advance. It looked like that new object needed to be used only for OBI and as a target for scheduling mechanism, but the bug showed that such assumption was wrong. As a result when change in vote happened, witnesses were recalculated using wrong data (from previous schedule).

The bug

Imagine you are the runner and that every time one of your backers stands up, even if they are to cheer you on, they also throw a bottle of water at you. Of course that would slow you down, and in some cases it might even make you roll downhill. Moreover, the more individual active backers you have, the more frequently they stand up and therefore the more bottles are thrown your way. Even worse, you are pushed back proportionally to the power of your support. In other words, you are hit by a bug more frequently if you have more active voters and the bug hits you harder if your overall votes are big. There is one more aspect to it. If at some point, despite the bug, you are the winner - which puts you back to the starting line - and then you are hit by the bug, you might be pushed back below starting line (underflow), which is teleporting you near the top of the hill, which makes you likely to be winner again.

"But is it really that common? I mean, the bug only happens when voters change their vote."
It might look weird at first, but it is frequent due to claim_reward_balance_operation. Be it curation, authoring or beneficiary, the reward always contains part that is expressed in VESTs and claiming it influences voting power.


Positives

There are two positives of the bug. First, the backup witnesses that are normally very rarely selected, are now battle tested. They have a proof that their nodes are not just mock-ups and are up to the task. Second, it showed the importance of testing on mirror-net. It is a new tool, so not everyone knew / was convinced enough to join, but if enough real people took part of the testing, the bug was trivial to find and fix. All it took was to spot that it was there. We can't expect @gtg to do that when he is substituting for several witnesses all at once :o)

Hashtags 3
HiveDevs - A community of developers working on Hive related projects. Visit our Discord at https://discord.gg/cvnByhu

Comments

Sort byBest
AI
Waivio AI Assistant
How can I help you today?