What the Heck is a Merkle Tree?
How it’s often useful to learn things no one asks you to
They’re used to structure Blockchains, which is why you might have heard of them, but also you might not know that Git uses Merkle Trees to store commits and compare diffs between files.
In the corporate world, aka at work, the conceptual stuff tends to take a back seat.
I’m often asked to learn a new framework or library but no one ever asks you learn about data structures — but I have a sneaking suspicion they’re quite important.
Why are they useful?
Merkle Trees are a way to structure your data so that it has a few very useful properties.
- Deduplication — it’s impossible to have two duplicate nodes in a Merkle Tree, so they’re efficient.
- Self-discovery — each node in a Merkle Tree contains its own address, so it doesn’t matter so much where the node is kept on a network as long as you can verify that it’s the node you want.
- Immutability — they’re difficult to change, Merkle Trees tend to get added to rather than updated, I’ll explain why this is good thing later on.
How do they work?
In a directory tree a child can only have one direct parent but a child in a Merkle Tree can have any number of direct parents. Parents are more like tags than they are folders.
Say I have two files both with a picture of a dog in. They’ll be represented as nodes in the tree, both with an address to identify them:
These two nodes both contain an image of a dog so I might want to group them together.
In a directory I’d take these two nodes (or files) and put them inside a folder. In a Merkle structure I’d add another node on top that has the addresses to each of these nodes.
I can add any number of parents to these nodes without ever effecting the first parent I added:
This is more like a tagging system than a directory one but better because I can include parent addresses inside any other node, essentially tagging tags:
It has the advantages of a tagging structure and a directory structure.
Node Addresses (or CIDs)
The address for the node in a real Merkle Tree is called a Content Identifier or CID, it’s an id that’s tied to the actual content of the node.
Now here’s the interesting bit: the CID of the node will change if the content changes.
That’s why it’s impossible to have two nodes with the same address.
It’s also why Merkle Trees tend to be immutable; because if we change the content of the node, its address will change…so it’s no longer the same node.
Take our original example, we have two nodes with dogs in and a parent to group them together:
Say one of the dog images changes, its address will also have to change because the address is tied to the content of the node:
This breaks the parent link. Now the parent needs to change to point to the new address, which will change its address as well.
This is because the child addresses in the parent are its contents. If the content changes then the address also needs to change:
Now say the contents of
address2 were being shared over a network and everyone on the network had a copy of this tree, when
address1 changed to
address9 and transformed the tree, the other people of the network would still have the original tree with
Rather than mutate everyone’s tree, we can just add another parent node to it like this:
One of the reasons this is useful is that it’s so efficient for storing and addressing data online.
HTTP addresses vs Merkle Tree addresses
With the current server system we use, data is stored at addressed servers. We might store our dog image at:
If someone else wants to use that same image they’re likely to copy the same image and upload it again to their own server at:
Now we’ve got two identical dog images being served from two different places.
On a Merkle Tree these two dog images might be stored on two or more computers but if the images are identical they’ll always get the same CID address. So, to the network, they’re the same node.
For actual data Merkle Trees are really efficient. But what about all those parent nodes?
Is the tree itself wasteful?
If the nodes on a Merkle Tree are immutable (e.g. never erased) and we’re constantly adding data and new parent nodes to group that data won’t the tree become a huge mess as it grows and grows?
There is a point to this argument…if you’re using server architecture. Say this tree was held inside a database on a server somewhere in an AWS farm. After a while the data-set would be massive.
This is partly why Merkle Trees are best suited for peer to peer networks. On a peer to peer network no one peer has the whole tree, they just have parts of the tree. If a part of the tree is never used by a peer, it dies.
This leads to another point…
Downloading old or unpopular content
On peer to peer networks the content has to be popular, meaning lots of nodes have it. Otherwise the content becomes difficult to find or simply disappears completely.
This isn’t such a problem for real-time content, like a chat app or a live video streaming network. But what about situations where you want to keep a history of the tree?
Blockchains and Git
I’ve already mentioned that Merkle Trees are used for Git because it’s such a good data structure for tracking and comparing nodes. Each node’s address allows you to make quick comparisons and track a history of changes.
This is why they’re also really good for keeping a ledger or blockchain to track transactions:
A blockchain is shared over a peer to peer network so each node can add a new transaction to the chain. Every new transaction adds another parent to the tree.
Usually a peer to peer network doesn’t really care about a node dying if the content isn’t popular but for a blockchain this situation doesn’t apply. An old node in the tree could still be important.
Say people begin saving their wills or other important legal documents to a blockchain. This isn’t popular content but it is important. The blockchain has to be preserved somewhere.
And that blockchain will just get bigger and bigger as more transactions are added? Um…yeah.
Isn’t that a problem? I think so, but I’m not an expert.
Merkle Trees have some more benefits and interesting features that make them a very useful data structure. If you’re interested in learning more Merkle and p2p related stuff I’d recommend having a look at Proto Schools tutorials on Content Addressing and Merkle DAGs.