Waivio

Recommended Posts

Future optimizations (speculation)

20 comments

andablackwidow1.6 K2 years ago19 min read

Some of the following changes are very likely possibilities, others might be in the realm of science fiction with strong lean towards the latter. I'll start with the least interesting but more likely, changing towards daydreaming at the end :o)


Execution time.

In last article I've described some of the optimizations that were made for coming HF26. I also said there is still room for improvements.

Undo sessions.
While running flood test with 1MB blocks I've noticed some peculiar values of execution time. See for yourself (irrelevant parts of block stats cut out): "exec":{"offset":-399927,"pre":21,"work":200490,"post":33088,"all":233599}. Just a remainder that in case of node that produced a block, exec.work contains time of phase 0, while exec.post contains phase 1 and 2. In this particular case there were no pending transactions left (other than those that were put inside block), so only phase 1, but still, phase 0 applies transactions and packs them inside block, phase 1 reapplies freshly produced block: applies the same transactions and on top of that does the automatic state processing. How come exec.post is 6 times shorter than exec.work? I did some more logging and it turned out one key element is responsible for all the difference. Block producer has to apply transactions one by one, opening individual undo sessions and squashing them when transaction is ok. When block is applied, it works under just one undo session - for entire block.

Flood test works with just custom_jsons, so there are no changes to the state - undo sessions were empty. Custom_jsons are also very fast to execute (well, unless they are delegate_rc_operation - not the case of this test), so the influence of the preparation/cleanup code necessary to execute each and every transaction is brought out. When sessions are empty, opening/squashing them consists of only running over all the registered state indexes and adding/removing individual index session objects onto/from stack. My gut feeling tells me the latter is the problem, although I've not confirmed it yet. If that is the case, then we could avoid constant talk to allocator by introducing new functionality of squash_and_reopen and undo_and_reopen - we'd just open one session and keep it open reusing it after squashing and/or undo. If that helps, then there is also reapplication of pending transactions. It also uses individual sessions for each transaction. Finally - a lot trickier to change, but should be doable - transactions coming from API/P2P also use separate sessions.

Accessing singletons.
Some state objects are only in one copy, but they are constantly accessed. One is particularly popular - dynamic_global_property_object. I wonder if asking for it from the index every time is not a bit wasteful. Most likely the effect won't even be noticeable, but who knows, I always wanted to try to just pull it once at the start and then use it through normal pointer. We've recently discovered that new boost version had some nasty things to say about such practice, since it started to erase object from index, if exception is thrown during its modification. That would mean we could not rely on pointer to state object to stay the same all the time. We've worked around that problem, so testing the idea should be safe again.

P2P.
I've mentioned before that there is certain instability in time P2P consumes to process block messages. Blocks take priority when sending fetch requests, but is that also true for answering such requests? I don't know. There are probably just two people that know enough about the code to analyze and potentially address the issue, but they are engaged in not just Hive, so changes here should be put in the realm of fiction.


RAM consumption.

HF24 was the one that brought memory consumption down from ridiculous to meh-but-ok levels Hive nodes need now. There are more changes of the same type that can still be done, but at this point they all together might free up maybe 1GB of memory - not that impressive considering the amount of work required. It would look a lot better though if we did such change after cutting down memory consumption by 2+. Sounds interesting?

Drop inactive data to DB.
Currently the state index that takes the most memory (over 9GB), despite all the trimming, is the one related to comment_object. Then there are all the indexes related to accounts (2.6GB). But how many comments are actively replied to or voted on? How many accounts are involved in transactions at any given time? Maybe it is not that necessary to keep all that data in memory "just in case"? The idea reuses concept of MIRA (Multi-Index Rocksdb Adapter), but in more selective way. MIRA communicated all the changes to RocksDB immediately, so while the node did not consume a lot of memory, it was very slow (one of the reasons it was dropped from code). But we could do it in a bit more clever way, keeping users whose data we need in memory, only writing to database when we are ready to drop it from RAM after enough time have passed for last changes to become irreversible (now faster with OBI! :o) ). For comments we'd need to change method of hashing, otherwise we would not gain anything (now hash covers both author and permlink, so the database index can't selectively load only for active users).

The only downside of this approach is slightly slower execution of transactions that force the node to pull data from database to memory when inactive user becomes active.

Multi-index.
Let's get more radical. All the state objects reside in boost multi-indexes. While the structure provides various types of indexes, hived only uses one, the red-black tree. Each node of such tree consists of three pointers and a flag, taking 32 bytes for each index type (times number of indexed objects). For example, comment_object itself takes 32 bytes, but it has 2 indexes, which add 32 bytes each, for 96 bytes per comment. If you are surprised that comment only takes that little, it is because the content is in HAF/Hivemind, not in node memory, and the data needed for calculating rewards is in separate temporary comment_cashout_object. What's left is only the data that we need permanently to be able to tell the comment exists and how it is related to other comments. Ok, so 32 bytes for index node does not look to be that much, but it can be brought down to 16 bytes (12 if we are extra aggressive about it, under assumption we'd never have more than 2 billion objects of the same type in memory; note that we already have assumption about not having more than 4 billion, since object id is 32bit). Since all state objects are stored in multi-index structures, saving on the overhead of such structure should add up to serious numbers.

To turn 8 byte pointers into 4 byte indexes, we'd need to use pool allocator (I know this concept as "block allocator" by the way). Let's say we are going to use blocks of 4k objects. Then the lower 12 bits of object index is a chunk index within block and the upper bits represent index of the block. We can use that trick not just for index tree nodes, but for state objects as well. It can even let us drop now mandatory by_id indexes (since the allocator will be enough to find object by its block+chunk id), although it won't be applicable to the most important types of objects, since as proposed above, we want to remove them from memory, therefore object index and its allocation index might diverge.

Now there is one more mystery. When we add up all the memory officially taken by state objects, we can account for maybe 2/3 at best. Where is the rest? It is the issue that needs investigating, but in case it is related to overhead of boost interprocess allocator (the one that is used to store all the objects in shared_memory_file) or memory fragmentation, use of pool allocators might mitigate most of it (that would be the best outcome, because we'd gain a lot more from optimizations that we want to do anyway).


Storage consumption.

It might not look that important at first, since it is just relatively cheap disk storage. However nodes keep eating more and more, even more now with HAF in place, that calls for exceptionally fast NVMe storage. Some steps to address the problem were already done for HF26 in form of compressed block log, but in the long run it is just buying some time. HAF is in a way still in its infancy and there are signs that its storage can be radically optimized (although don't ask me for any details as I only know general concept of HAF, never worked on its code). Hivemind is ridiculously bad, and it starts in its APIs - the best approach is to just phase it out in favor of something new.

Node operators, especially witnesses, usually don't just run one node. Each node consumes at least the space required by block_log (350GB or something close after compression). Even if we managed to aggressively reduce memory consumption, the space requirements would still remain off-putting for regular users that wouldn't otherwise mind to run their own node.

Any change in this field requires preparations in form of isolating block log code from rest of hived code. Once it is put behind well defined interface, we can start testing and implementing various ideas catering to different node operators. It would allow selecting the features you need just like it can be done with plugins.

1. Pruning with fallback.

The node would keep select range of the most recent blocks either in memory or on local disk (depending on the settings), frequently pruning them as they become old. If some peer asks for block that was pruned, the node would fall back to redirect the call to some known unpruned node. Fallback could probably be also implemented in load balancer.

Pruned nodes would be most suitable for people that did not operate any nodes so far, or for those who operate many nodes in single location. The downside is that if you ever need to replay, you'd need to acquire block log from somewhere or sync from scratch, which is going to take a lot longer than just pure replay.

By the way, the idea of pruning with fallback can also be partially applied to Hivemind or HAF.

2. Archiving.

Similar to above, however instead of removing pruned blocks, it would occasionally store a whole chunk of blocks on slower/cheaper storage. Block log would be split into many smaller files, easier to store and transfer. If fallback was also added, the archiving could target even the sequential tape storage, which would make the node ready for next hundred years of Hive existence while keeping the benefit of having access to your own block_log in case of replay.

3. Reusing HAF.

At the moment HAF probably does not contain all the data necessary to implement block API, it wasn't designed for that after all, but it shouldn't be too hard to supplement. The block data is pushed to HAF anyway, so if you are operating HAF node, there should be no need to have separate block_log.

4. Sharing.

When you have many nodes in single location (witness and API nodes perhaps?), and you want all of them to be able to provide any block, they could use the same file(s) on shared network location. The problem is that all those nodes would also want to write to that shared block_log. That's why they wouldn't be writing directly but through a service. Only the first node would actually write, for all other nodes the service would only verify that what they wanted to write is the same (and frankly, if they disagree on old irreversible blocks, they are running different chains already).

There can probably be more ideas on how to provide information stored in block_log, but above should suffice for now :o)


Volatile sidechain.

Still about storage consumption, but the idea is radical enough to have its own paragraph (it is the most of all in the fiction space).

The great benefit of having a blockchain is that whatever you do, will stay available forever. The great downside of having a blockchain is that whatever you do will keep consuming storage forever. And there are things you wouldn't actually mind if they were lost, as they don't bring value beyond limited timeframe and in the same time are not necessary to restore state during replay.

Let's start with something easy. You are playing in The Speed-chess Battle Tournament. You pay in some fee, then you play against other players, and finally the rewards are given out. All the monetary transfers are recorded on the blockchain, and so are the moves of all the players (so everyone can verify that there was no foul play). But once some time has passed, all that matters is the fees paid and rewards earned. What if we could store the moves of players in a separate chain, designed to only keep the records for a month? But not a private one kept by app operator - we still want all the benefits of having wide range of nodes validating and transmitting transactions related to that chain, we just want it to be cheaper. Since the records are dropped after a month, the cost of RC related to history_bytes (dominant cost of many operations) should be a lot lower. And make it more responsive while at it (half-second blocks?).

Wouldn't it put more work on the shoulders of witnesses to manage two chains at once? Not necessarily. The amount of incoming transactions would be the same, process of validation would be almost the same, just some transactions would be marked as volatile to indicate that they should be placed outside of main chain. Since by definition such transactions should be generally stateless for consensus code, it could turn out that they might be easier to apply than regular transactions (because they might not even need undo sessions). Each witness, instead of producing one block within 3 seconds, would be producing that one block plus couple consecutive sidechain blocks. Sidechain could be handled by separate thread, so there would be no problem of one blocking production of the other.

Of course there are challenges. First of all, not all operations make sense to be stored in volatile chain. F.e. those related to assets or voting cannot be stored there. In case of custom_jsons there really is no way for consensus code to tell which of them could be allowed to be volatile (at least until we gain ability to define and enforce schemas for custom_jsons, which calls for 2nd layer governance, which in turn calls for SMTs), so apps that use them would have to be prepared that their operations might be marked wrong and react accordingly. In practice it would be responsibility of app's client code to send properly marked transactions according to the specific demands of application server. Even then some operations might still be marked wrong. F.e. Hivemind should ignore volatile follow, however orders to clear notifications, that should disappear on their own after a month anyway, could be volatile (and interpreted as such even if placed in main chain). I think the biggest problem there is a lot of opportunities to introduce bugs in the apps. The general rule is simple though: if the operation changes state, it cannot be volatile (and if marked as one, it should be ignored), unless the state itself is volatile (like in case of notifications).

Another challenge is the relation between main chain and sidechain. Now that we have only one chain of blocks, it is the witnesses that decide on final order of transactions. Sidechain would be similar within itself. Since main chain cannot rely on data from sidechain, it would not have any connection to it. So only the sidechain should link to main chain, at least once in a while. The fragments of main chain and sidechain that are after last connection should be considered concurrent by the apps (and therefore not having definite order). If there was just one sideblock between regular blocks, considering all the optimizations already done, it would be no challenge to just link from sideblock to the last regular block, however if sidechain was to be considerably faster, it would need to link to at least one block in the past (because during time of production of first sideblock from the batch, "the last regular block" to link to, at least in order of timestamps, could still be in production).

https://images.ecency.com/DQmUdKGsixhxxeidHAweHWkYAry4ZkhD9xiTsD5TmJBkbgY/chains.png
Relation of block 2 and vb2.1+vb2.2 is unknown (despite volatile blocks having later timestamps) - they should be treated as concurrent.

Can a witness miss regular block but not volatile blocks? Can he miss some but not all volatile blocks? When next witness can assume volatile blocks were missed?

Solidify.
Imagine you have a Hive based version of Signal. Normally you talk to your wife about grocery list, arrange a meeting with buddies, pass links to funny cats during work or haggle over price of drugs with your dealer (not judging, I don't know the content of your conversation, it is all encrypted). Such conversations disappearing after a month won't cause problems. However someone might have promised you something and you want to be able to go back to that message. Solidify! You pass transaction to regular chain that says: "volatile block X, transaction Y, looked like this (full content)". As long as it is still within a month (and also the block X is in the definite past - look above problems with ordering) the node can accept such transaction confirming that it took place. Of course solidification means you will be paying history_bytes RC cost for someone's transaction (since its full content is contained in your own). Notice that when replaying or syncing from deep past, node won't have the ability to validate such transaction. It would need to assume it was ok if it is contained within correctly signed block. Which brings me to another idea:

Volatile signatures.
Let's store signatures as volatile and forget about them after a month. When replaying, unless with --validate-during-replay, node ignores signatures. When syncing from deep past the node validates everything, however it doesn't really need to. The past block was signed by proper witness, next blocks are based on that block, looking couple blocks into the relative future it is clear that enough witnesses confirmed it to make it irreversible. But what if witnesses conspired to rewrite the past? Not having to fake user signatures, they would be able to do that. True, however they'd need to toss out a whole month worth of most recent transactions, since for last month you'd still have all the data like today. User signatures on last month transactions protect those from being faked and TaPoS on them protects past from being rewritten.

We wouldn't be able to remove signatures from the main chain retroactively, since block ids are based on block headers that contain merkle root, which in turn depends on merkle digests of contained transactions. Those are calculated from the whole signed transactions, signatures included. But for future blocks making signatures volatile (and not include them in the digests) could reduce amount of data permanently stored in main chain.

Volatile comments and votes.
Not all of course, but "Cheers. Great post!" or "Congratulations, your post received 2.71% upvote from manual-curator bot" might be nice when you see them, but they don't have a lasting effect, they bring cost but not a lot of value. Same with dust or ineffective post-cashout votes. They act as "likes", a notification that someone spent time reading your content. They don't really need to be there after a month. While third party apps using custom_jsons don't have the ability to enforce their rules on operation volatility, the hived itself obviously can. Keep in mind general rule - if it changes state, it cannot be volatile. The following rules would have to apply:

  • volatile comments can only be edited, deleted, replied to or voted on with volatile operations
  • you can volatile reply to non-volatile comment, you can also volatile vote on non-volatile comment (it won't have any effect though)
  • volatile comments are not subject to author nor curation rewards, therefore it should also not be possible to set options for them
  • volatile comments can't be used in proposals
  • it is up to services like Hivemind to decide if edits of volatile comments should become inaccessible with main comment after a month, or such edit would prolong the life of main comment; similar thing applies to replies to such comments; from perspective of consensus code each volatile transaction has to die at definitive time after it was included in block (so f.e. you won't be able to solidify someone's comment after a month even if there were later replies to that comment that are younger than month)
  • when replaying blockchain, once node reaches most recent month when it has volatile blocks, extra care needs to be taken, f.e. instead of asserting on missing target - volatile comment that was replied to - node needs to assume it was there, but is now outside its time window

There are most likely more rules that needs to be enforced, but that's just the general idea.


Which of the above ideas would you prefer to see implemented?

Hashtags 5
HiveDevs - A community of developers working on Hive related projects. Visit our Discord at https://discord.gg/cvnByhu

Comments

Sort byBest
AI
Waivio AI Assistant
How can I help you today?