Jekyll2021-08-31T10:18:36+00:00https://ordep.dev/feed.xmlBlog about softwareI’m a passionate Software Engineer with interests around automated workflows, microservices, test-driven development, design patterns, and databases. Currently, I work at Jumia Services, the leading e-commerce fulfillment and delivery in Africa. If you can’t find me building software, I bet you’ll find me playing videogames or surfing waves in Portugal’s Northern Coast.
My favorite papers2021-08-17T00:00:00+00:002021-08-17T00:00:00+00:00https://ordep.dev/posts/my-favorite-papers<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Sometimes people ask me which computer science papers they should read and I can't really answer that question, but I can list the papers I've enjoyed reading over the past years.</p>— Pedro Tavareλ (@ordepdev) <a href="https://twitter.com/ordepdev/status/1426499455218450435?ref_src=twsrc%5Etfw">August 14, 2021</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Following the tweet above, I’ve decided to do a <em>thread dump</em> of my favorite computer science papers.</p>
<p>This is not a <em>you should read these papers</em> kind of post, it’s a curated list of great
computer science papers that I’ve enjoyed reading and re-reading over the past years.</p>
<p>(I think you should read them as well!)</p>
<h3 id="-the-design-and-implementation-of-a-log-structured-file-system">📃 <a href="https://people.eecs.berkeley.edu/~brewer/cs262/LFS.pdf">The Design and Implementation of a Log-Structured File System</a></h3>
<p>💡 You’ll learn about a technique called a log-structured file system that
writes all modifications to disk sequentially, thereby speeding up both file
writing and crash recovery.</p>
<p><img src="https://pbs.twimg.com/media/E8vxs24VIAUh-a4?format=png" alt="The Design and Implementation of a Log-Structured File System" /></p>
<h3 id="-the-ubiquitous-b-tree">📃 <a href="http://carlosproal.com/ir/papers/p121-comer.pdf">The Ubiquitous B-Tree</a></h3>
<p>💡 You’ll learn about a disk-based index structure called B-Tree and its
different variations. The paper does quite a good job of explaining why
they have been so successful over the years.</p>
<p><img src="https://pbs.twimg.com/media/E8vxtaxVoAYjdme?format=png&name=large" alt="The Ubiquitous B-Tree" /></p>
<h3 id="-the-log-structured-merge-tree">📃 <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf">The Log-Structured Merge-Tree</a></h3>
<p>💡 You’ll continue to learn about low-cost indexing for a file experiencing
a high rate of record inserts over an extended period. The paper also provides
a nice comparison of LSM-tree and B-tree I/O costs.</p>
<p><img src="https://pbs.twimg.com/media/E8vxuCgVUAUFwoz?format=png&name=large" alt="" /></p>
<h3 id="-kafka-a-distributed-messaging-system-for-log-processing">📃 <a href="http://notes.stephenholiday.com/Kafka.pdf">Kafka: a Distributed Messaging System for Log Processing</a></h3>
<p>💡 You’ll learn about log processing, Kafka’s architecture, and design principles
including producers, brokers, and consumers.</p>
<p><img src="https://pbs.twimg.com/media/E8vxujhVcAAOQlD?format=png&name=large" alt="" /></p>
<h3 id="-zookeeper-wait-free-coordination-for-internet-scale-systems">📃 <a href="https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf">ZooKeeper: Wait-free coordination for Internet-scale systems</a></h3>
<p>💡 You’ll learn about the ZooKeeper wait-free coordination kernel and a lot of
distributed systems concepts that are nicely described in the paper.</p>
<p><img src="https://pbs.twimg.com/media/E8vxvGIUYAQ_xu3?format=png&name=large" alt="" /></p>
<h3 id="-a-certified-digital-signature">📃 <a href="http://www.merkle.com/papers/Certified1979.pdf">A Certified Digital Signature</a></h3>
<p>💡 You’ll learn about one-way functions, the Lamport-Diffie one-time signature,
and a new “tree-signature” also known as Merkle tree.</p>
<p><img src="https://pbs.twimg.com/media/E8vxvodVIAAYyLv?format=png&name=large" alt="" /></p>
<h3 id="-time-clocks-and-the-ordering-of-events-in-a-distributed-system">📃 <a href="https://www.microsoft.com/en-us/research/uploads/prod/2016/12/Time-Clocks-and-the-Ordering-of-Events-in-a-Distributed-System.pdf">Time, Clocks and the Ordering of Events in a Distributed System</a></h3>
<p>💡 Leslie Lamport’s most cited paper. You’ll learn about logical clocks,
real-time synchronization, and concepts such as “total ordering” and “happened-before”.</p>
<p><img src="https://pbs.twimg.com/media/E8vxwPuUcAESScD?format=png&name=large" alt="" /></p>
<h3 id="-harvest-yield-and-scalable-tolerant-systems">📃 <a href="https://github.com/papers-we-love/papers-we-love/blob/master/distributed_systems/harvest-yield-and-scalable-tolerant-systems.pdf">Harvest, Yield, and Scalable Tolerant Systems</a></h3>
<p>💡 You’ll learn about strategies for improving a system’s overall availability
while tolerating some kind of graceful degradation.</p>
<p><img src="https://pbs.twimg.com/media/E8vxw1CVgAQsZk0?format=png&name=large" alt="" /></p>
<h3 id="-the-byzantine-generals-problem">📃 <a href="http://www.andrew.cmu.edu/course/15-749/READINGS/required/resilience/lamport82.pdf">The Byzantine Generals Problem</a></h3>
<p>💡 You’ll learn about reliability in computer systems, whenever it has to cope
with the failure of one or more of its components.</p>
<p><img src="https://pbs.twimg.com/media/E8vxxX0VUAM4MQu?format=png&name=large" alt="" /></p>
<h3 id="-linearizability-a-correctness-condition-for-concurrent-objects">📃 <a href="http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf">Linearizability: A Correctness Condition for Concurrent Objects</a></h3>
<p>💡 You’ll learn about a strong correctness condition for concurrent objects that
guarantees a strict time ordering of read and write operations in a multi-threaded environment.</p>
<p><img src="https://pbs.twimg.com/media/E8vxyOlVEAUjqbu?format=png&name=large" alt="" /></p>
<h3 id="-conflict-free-replicated-data-types">📃 <a href="https://arxiv.org/abs/1805.06358">Conflict-free Replicated Data Types</a></h3>
<p>💡 You’ll learn about a data structure that makes the eventual consistency of a
distributed object possible without coordination between replicas.</p>
<p><img src="https://pbs.twimg.com/media/E8vxyw1UcAMlVpS?format=png&name=large" alt="" /></p>
<h3 id="-delta-state-replicated-data-types">📃 <a href="https://arxiv.org/abs/1603.01529">Delta State Replicated Data Types</a></h3>
<p>💡 You’ll learn about an optimization made to state-based CRDTs that ensure convergence
by disseminating only recently applied changes, instead of the entire (possibly large) state.</p>
<p><img src="https://pbs.twimg.com/media/E8vxzQAUYAIdq6Q?format=jpg&name=large" alt="" /></p>
<h3 id="-making-reliable-distributed-systems-in-the-presence-of-software-errors">📃 <a href="https://erlang.org/download/armstrong_thesis_2003.pdf">Making reliable distributed systems in the presence of software errors</a></h3>
<p>💡 You’ll learn about Erlang, concurrent programming, message passing, fault-tolerance,
and the concept of “let it crash”.</p>
<p><img src="https://pbs.twimg.com/media/E8v5esPXIAcbCaW?format=jpg&name=large" alt="" /></p>
<h3 id="looking-for-more-papers">Looking for more papers?</h3>
<p>These are my favorites.</p>
<p>I might be missing a few papers, for sure.</p>
<p>You can still find a lot of curated papers for you to read at
<a href="https://twitter.com/papers_we_love">@papers_we_love</a>,
<a href="https://twitter.com/intensivedata">@intensivedata</a>,
and <a href="https://twitter.com/therealdatabass">@therealdatabass</a>.</p>Sometimes people ask me which computer science papers they should read and I can't really answer that question, but I can list the papers I've enjoyed reading over the past years.— Pedro Tavareλ (@ordepdev) August 14, 2021Stop fighting2021-08-09T00:00:00+00:002021-08-09T00:00:00+00:00https://ordep.dev/posts/stop-fighting<p>You want to change everything and everyone around you, not just because you want to - because you believe it’s the <em>best</em> thing to do.</p>
<p>If you’re like me, working with computers daily for an organization with one or more products, you have to work under a certain amount of <em>rules</em>.</p>
<p>Some of them were in place way before you joined the organization and the rest of them were set up during your tenure.</p>
<p>You don’t agree with all of them, and that’s fine since we can’t please everyone but you are a true warrior, and you never settle until you fight for your <strong>beliefs</strong>, either it’s a technical, cultural, or behavioral matter.</p>
<p>Sadly, <strong>you’ll lose</strong> most of the <em>battles</em>.</p>
<p>Not because your <em>beliefs</em> suck. Not because the organization won’t benefit from them. You’ll lose because <strong>you’re against someone</strong> else’s <em>beliefs</em>.</p>
<p>I know you want to (continue to) work remotely. I know you want to do some functional programming. I know you want to focus on infrastructure and performance problems. Yet it’s <strong>not going to happen there</strong>.</p>
<p>Have a deep breath, <strong>stop fighting and embrace it</strong>, you still have the choice to leave and move on!</p>You want to change everything and everyone around you, not just because you want to - because you believe it’s the best thing to do.Being on-call2021-03-29T00:00:00+00:002021-03-29T00:00:00+00:00https://ordep.dev/posts/being-on-call<p>Being on-call, why, what and how.</p>
<h2 id="why-should-you-be-on-call"><strong>Why should you be on-call?</strong></h2>
<p>I’m assuming you’re in a software development team and you write software that gets deployed into production. I’m also assuming you have paying customers, otherwise, your on-call program wouldn’t exist in the first place. If all this is true, you should be on-call, at least for the stuff you write.</p>
<h3 id="responsibility"><strong>Responsibility</strong></h3>
<p>Being on-call gives a sense of responsibility and accountability for the code you deploy into production. You build it, you run it, you own it. You know that you need to be super careful with your code, otherwise, you’ll get paged at night. You don’t want it, I don’t want it as well.</p>
<h3 id="learning"><strong>Learning</strong></h3>
<p>Incidents are the best learning platform; you have to debug your system looking for clues. You get to truly know your system while running it in production. Setting up meaningful logs and tracing provides you a way to track what your system is doing. Setting up metrics and alerts provides you a way to track how your system is performing. If you have an incident in production, you will re-evaluate these things. You’ll come up with more fine-grain logs and metrics, that will tell you even more about your system.</p>
<h3 id="money"><strong>Money</strong></h3>
<p>You shouldn’t be on-call <em>just</em> for money, but if you’re in a company that provides you a stipend for being on-call, it’s a win-win. At the end of the day, they pay you for writing great systems and for being responsible for them out of working hours.</p>
<h2 id="what-should-you-expect-from-being-on-call"><strong>What should you expect from being on-call?</strong></h2>
<p>I’ll tell you what your organization expects from you: “can you fix it?”.</p>
<h3 id="you-shouldnt-just-land-there"><strong>You shouldn’t</strong> <strong>just</strong> <strong>land there</strong></h3>
<p>Even if you’re a seasoned member, you shouldn’t <em>just</em> land into an on-call rotation without experience it from others. Each system is unique and has its own traits. It takes time to pinpoint the possible root causes of an unknown system, so, at first, you should observe.</p>
<p>Someone has to coach you for one or more rotations. You should try to shadow on-call engineers. You’ll get familiar with a given system if you receive their own alerts, watch them looking for the root cause, and being the first to read the post-mortem.</p>
<p>After you get a sense of how things behave, you should be ready to start your first on-call rotation.</p>
<p>I call this <strong>observe, collect, act later</strong>.</p>
<h2 id="how-to-perform-during-an-on-call-rotation"><strong>How to perform during an on-call rotation?</strong></h2>
<p>You need to act fast, be accurate, and communicate a lot. A lot of text, during an incident, is not a lot of text. Make yourself a favor by communicating a lot during an incident, even if it’s a <em>small thing</em>.</p>
<p>You shouldn’t deal with incidents alone during working hours. Everyone should be able to contribute and you must keep them posted.</p>
<p>You shouldn’t hide what you’re doing to solve a production incident. Everyone interested should be able to jump on a call with you. Having extra eyes looking into the problem will help you solve the issue. Additionally, you’ll get at least a free pair of eyes to review the <em>post-mortem. E</em>veryone involved in the incident should contribute to it.</p>
<p>Make sure you behave the same way during out of working hours incidents. It may sound creepy writing for yourself at 3AM. Once your team is up, they will thank you for keeping them posted.</p>
<h2 id="how-can-you-improve-your-performance-during-an-on-call-rotation"><strong>How can you improve your performance during an on-call rotation?</strong></h2>
<h3 id="weekly-drills"><strong>Weekly drills</strong></h3>
<p>If you belong to an on-call rotation, you must be able to solve incidents. If you have few incidents per rotation, how do you stay up-to-date with the system? Will you be able to solve a given incident two months from now without practicing? Google calls this “Operational Underload”.</p>
<blockquote>
<p>Being on-call for a quiet system is blissful, but what happens if the system is too quiet or when SREs are not on-call often enough? An operational underload is undesirable for an SRE team. Being out of touch with production for long periods can lead to confidence issues, both in terms of overconfidence and underconfidence, while knowledge gaps are discovered only when an incident occurs.</p>
</blockquote>
<p>You build confidence in your system and on your on-call rotation if you practice a lot beforehand. A great solution to build confidence in your rotation is to take part in <em>fire-drills</em>. It shouldn’t be difficult to set them in place every week to match each rotation. This way, everyone on-call can solve at least one incident per rotation.</p>
<p>Having weekly <em>fire-drills</em> will help you keep your <em>runbooks</em> updated. If your organization lack <em>runbooks, fire-drills</em> will get you started. Make sure you don’t skip the creation of meaningful logs and metrics after an incident or a fire drill. All this combined will boost your confidence to solve incidents after hours on your own.</p>
<h3 id="debugging-skills"><strong>Debugging skills</strong></h3>
<p>Sometimes, your metrics and logs won’t tell everything about your system during an incident. You’ll have to debug a given service in production. Make sure you practice how to do it during the <em>fire-drills</em>, otherwise you’ll struggle a bit. Know your stack, from top to bottom, and learn how to debug things in production, you won’t regret it.</p>
<h3 id="writing-skills"><strong>Writing skills</strong></h3>
<p>Good communicators benefit the whole organization, during and after an incident. Make sure you communicate during the incident, frequent and clear. Use others to review your <em>post-mortems</em>. Learn from the best contributors you know. Copy their writing style and keep improving. Everyone in the organization will thank you for being an excellent writer.</p>
<p>Keep practicing, keep writing 🖖</p>Being on-call, why, what and how.Standups2021-01-13T00:00:00+00:002021-01-13T00:00:00+00:00https://ordep.dev/posts/standups<p>Do you like your team standups? What about your team, do they like their standups?
Are they useful? You might think they are, but you might also be missing the point.</p>
<h2 id="whats-wrong">What’s wrong?</h2>
<p>I don’t want to sound harsh (dammit, I did it again), but… I bet your team is doing
things wrong. It’s not hard at all to mess things up and continuously thinking that
everything is fine. It happens a lot with shy teams; they might talk to each other about
these things, but they avoid raising the flag.</p>
<p>So, my advice here is, don’t trust your feelings; pause and observe their mood and
interactions during the meeting (take this advice to the next level and apply it to
every other meeting, you might see the same patterns) and ask yourself a simple question:
are they worth it?</p>
<h2 id="how-they-really-are-most-of-the-times">How they really are (most of the times)</h2>
<ul>
<li>Waste of time;</li>
<li>Unnecessary pressure;</li>
<li>Live status report.</li>
</ul>
<p>If your standup looks like a status report (you might have to pause and observe first),
you landed the right post, I guess. It’s not cool to have status reports, not cool at all.
It’s stressful and depressing. Not the right way to start the day, if it’s your case.</p>
<h2 id="you-dont-need-another-status-report">You don’t need another status report</h2>
<p>There’s a common practice of using standups to ask developers how long it takes to finish
a given feature. “Are you done yet?”, “And now, it’s finished?”, “How long it will take?”.</p>
<p>I don’t like these types of questions. It triggers feelings of pressure, insecurity, doubt,
you name it; I can’t find any valid reason to ask these kinds of questions during a standup.</p>
<p>A lot of people fear the spotlight, and it’s fine. A lot of people fear missing behind, and
again, it’s fine. It might be stressful to watch your teammates delivering multiple features
while you’re struggling with something complex. What if you struggle for two days or more?</p>
<p>In a standard status report, you either say that you finished something yesterday, or that
you’ll try to finish it today; you don’t say you’re struggling, you don’t ask for help, but
you should, and your teammates should offer their help.</p>
<h2 id="yesterday-yesterday-yesterday">Yesterday, yesterday, yesterday</h2>
<p>There’s another common practice of <em>just</em> telling what you did yesterday. Do you really need
to tell your teammates what you did yesterday? Unless you’re not making any kind of progress,
or you’re doing solo work (which you shouldn’t, if you work on a team), your teammates should
know already what you did yesterday.</p>
<p>We live in a world where every single line of code you do, is probably linked up to the related
feature. There are <em>issues</em>, <em>commits</em>, <em>pull-requests</em>, <em>merges</em>, <em>validations</em>, <em>deploys</em>,
<em>releases</em>. There are notifications and logs for each of these actions, and all of them are linked,
all of them should involve multiple people (if you work in a small team, ideally it should involve
all of your team), so the question is, why should you repeat yourself every single day?</p>
<p>The answer is: you shouldn’t; everyone in your team should know already what you did yesterday.
You did your work, you were able to communicate with your teammates, you even shipped the feature
into production, everyone at the company should know by now what you’ve been doing!</p>
<h2 id="what-about-writing-instead-of-speaking">What about writing instead of speaking?</h2>
<p>I like to write things down:</p>
<ul>
<li>It helps me formalize my thoughts</li>
<li>It serves as a personal diary</li>
<li>Tends to improve my writing skills</li>
<li>I can share them in multiple mediums</li>
<li>I can look back and see what I wrote a few weeks ago.</li>
</ul>
<p>What if you wrote down your daily/weekly/monthly progress, instead of giving a live status report
every single day? If you value outstanding documentation, you’re probably surrounded by people
with outstanding writing skills. Writing should not be a problem for these kinds of individuals.</p>
<p>Everything you need for a <em>status report</em> like standup is already available on your digital collaboration
platform. You have multiple features, multiple work items, multiple states (todo, in-progress, done)
for each work item, multiple assignees for each work item, plus their corresponding commits, pull-requests,
deploys…</p>
<h2 id="tracking-disperse-but-valuable-information">Tracking disperse (but valuable) information</h2>
<p>You can make use of this valuable (but disperse) information and track it on your team channel. At the
end of the day, everyone in the team should know that Pedro took an item from <em>todo</em>, had a discussion
with another two teammates in the issue itself before merging three detailed commits into <em>master</em> that
were deployed into production a few minutes later.</p>
<p>If you want to give daily detailed updates, you can still write a compacted version of the activity log.
Give the activity log a try, and go with weekly updates; encourage your team to write their macro
accomplishments of the week; bundle those together and give a nice presentation to the whole company.
There are so many companies and open-source projects out there that deliver amazing products without
syncing every single day, why should you?</p>
<h2 id="i-need-human-contact">I need human contact</h2>
<p>If you want to have some kind of human contact, write down your status and use those standups for things
that really matter: collaboration, collaboration, and collaboration. Check for dependencies, ask for help,
offer help, unblock the blockers, discuss and plan your day (especially if you’re pairing/mobbing), and for
f*cks sake, say hi to your teammates…</p>
<h2 id="is-there-any-action-point">Is there any action point?</h2>
<p>Optimize your daily team routine, drop those f*cking status reports and use that slot to start the day
(or week) in the right mood!</p>Do you like your team standups? What about your team, do they like their standups? Are they useful? You might think they are, but you might also be missing the point.Tales from running Kafka Streams in Production2019-10-30T00:00:00+00:002019-10-30T00:00:00+00:00https://ordep.dev/posts/tales-from-running-kafka-streams-in-production<h2 id="why-using-kafka-streams-in-the-first-place">Why using Kafka Streams in the first place?</h2>
<p>If you’re running a Kafka cluster, Kafka Streams gets handy mainly for three
reasons: (1) it’s an high level wrapping of consumers/producers on top of Kafka,
(2) it supports statefull streams using RocksDB, and (3) supports partition
assignments across your processing nodes.</p>
<h2 id="plan-the--of-partitions-in-advance">Plan the # of partitions in advance</h2>
<p>In order to scale your processing you can either (1) move towards a stronger CPU,
more memory, and faster disk, or (2) increase the number of processing instances,
but always have in mind that a single Kafka partition can only be processed by
one consumer - that’s why the number of partitions is important - if the number
of partitions is low, maybe a consumer can’t handle your throughput. Make sure you
plan the number of partitions in advance or your consumer lag will grow.</p>
<h2 id="be-careful-with-your-persisted-schemas">Be careful with your persisted schemas</h2>
<p>When processing and storing specific events in a state store you must be very
careful with the event schema, specially if you rely on JSON format. Making
breaking changes to the event schema means that the processing layer will fail
to parse the JSON when reading from the state store, and it will probably lead
to lost data if you ignore the event or into a crash loop if you retry the
processing.</p>
<h2 id="dont-rely-on-internal-changelogs-for-downstream-processing">Don’t rely on internal changelogs for downstream processing</h2>
<p>For each state store, it maintains a replicated changelog Kafka topic in which
it tracks any state updates. Every time we insert a <em>key-value</em> into our state
store, a Kafka message is sent to the corresponding topic. In case of any crash,
our application will rebuild the state store from the corresponding changelog
topic. If we want to apply another processing layer, either in the current application
or another one downstream, we should always use <code class="language-plaintext highlighter-rouge">context.forward(k, v)</code> (using the
<code class="language-plaintext highlighter-rouge">Processor API</code>) to forward our processor output to a given <em>sink</em> topic.</p>
<h2 id="state-stores-dont-have-ttl">State Stores don’t have TTL</h2>
<p>While the state store changelog topic has <em>log compaction</em> so that old data can be
purged to prevent the topics from growing indefinitely, the state store itself don’t
have this kind of mechanism and yes, it will grow forever, unless the application
crashes and the state store is rebuilt using a now, shorter version of the changelog.</p>
<p>(<a href="https://stackoverflow.com/questions/50622369/kafka-streams-is-it-possible-to-have-compact-delete-policy-on-state-stores">https://stackoverflow.com/questions/50622369/kafka-streams-is-it-possible-to-have-compact-delete-policy-on-state-stores</a>)</p>
<blockquote>
<p>Log compaction ensures that Kafka will always retain at least the last known
value for each message key within the log of data for a single topic partition.
It addresses use cases and scenarios such as restoring state after application
crashes or system failure, or reloading caches after application restarts during
operational maintenance. Let’s dive into these use cases in more detail and then
describe how compaction works.</p>
</blockquote>
<p>[Edit 1] According to <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-258%3A+Allow+to+Store+Record+Timestamps+in+RocksDB">KIP-258</a> there is an ongoing effort to add TTL
to state stores. Record timestamps were <a href="https://issues.apache.org/jira/browse/KAFKA-6521">added</a>
to <code class="language-plaintext highlighter-rouge">Ktable</code> allowing to move forward with this initiative. If you’re asking yourself
why <code class="language-plaintext highlighter-rouge">TTL State Stores</code> are <em>not yet</em> supported in Kafka Streams is mainly because it
relies on <em>changelogs</em> as the source of truth, not the state stores. The two must be in
sync, otherwise, if we delete old state store records, it might happen that we restore all of them
from the changelog during a rebalance for example.</p>
<h2 id="the-restore-process">The restore process</h2>
<p>Every time the application starts, or in the worst case restarts, the states stores
will be restored using the corresponding changelog. If we pay attention, we’ll notice
that a given processor that relies on a given state store, doesn’t start processing
while the state store is recovering. In order to give more awareness, we might want to
add a custom <code class="language-plaintext highlighter-rouge">StateRestoreListener</code> to track the state store restore process.</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="nn">com.typesafe.scalalogging.Logger</span>
<span class="k">import</span> <span class="nn">org.apache.kafka.common.TopicPartition</span>
<span class="k">import</span> <span class="nn">org.apache.kafka.streams.processor.StateRestoreListener</span>
<span class="k">class</span> <span class="nc">LoggingStateRestoreListener</span> <span class="k">extends</span> <span class="nc">StateRestoreListener</span> <span class="o">{</span>
<span class="k">val</span> <span class="nv">logger</span> <span class="k">=</span> <span class="nc">Logger</span><span class="o">(</span><span class="n">classOf</span><span class="o">[</span><span class="kt">LoggingStateRestoreListener</span><span class="o">])</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">onRestoreStart</span><span class="o">(</span><span class="n">topicPartition</span><span class="k">:</span> <span class="kt">TopicPartition</span><span class="o">,</span>
<span class="n">storeName</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span>
<span class="n">startingOffset</span><span class="k">:</span> <span class="kt">Long</span><span class="o">,</span>
<span class="n">endingOffset</span><span class="k">:</span> <span class="kt">Long</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="nv">logger</span><span class="o">.</span><span class="py">info</span><span class="o">(</span><span class="n">s</span><span class="s">"Restore started for $storeName and partition ${topicPartition.partition}..."</span><span class="o">)</span>
<span class="nv">logger</span><span class="o">.</span><span class="py">info</span><span class="o">(</span><span class="n">s</span><span class="s">"Total records to be restored ${endingOffset - startingOffset}."</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">onBatchRestored</span><span class="o">(</span><span class="n">topicPartition</span><span class="k">:</span> <span class="kt">TopicPartition</span><span class="o">,</span>
<span class="n">storeName</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span>
<span class="n">batchEndOffset</span><span class="k">:</span> <span class="kt">Long</span><span class="o">,</span>
<span class="n">numRestored</span><span class="k">:</span> <span class="kt">Long</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="nv">logger</span><span class="o">.</span><span class="py">info</span><span class="o">(</span>
<span class="n">s</span><span class="s">"Restored batch $numRestored for $storeName and partition ${topicPartition.partition}."</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">onRestoreEnd</span><span class="o">(</span><span class="n">topicPartition</span><span class="k">:</span> <span class="kt">TopicPartition</span><span class="o">,</span>
<span class="n">storeName</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span>
<span class="n">totalRestored</span><span class="k">:</span> <span class="kt">Long</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="nv">logger</span><span class="o">.</span><span class="py">info</span><span class="o">(</span><span class="n">s</span><span class="s">"Restore completed for $storeName and partition ${topicPartition.partition}."</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p>You’ll be amazed, or outraged, with the amount of time wasted on restoring large
changelogs. Note to self: if you’re running Kafka Streams on top of Kubernetes, make
sure you have persistent storage, otherwise these restore processes will kill your SLAs.</p>
<p>In a disaster scenario, when a particular instance crashes, configuring
<code class="language-plaintext highlighter-rouge">num.standby.replicas</code> may minimize the restore process by introducing shadow copies
of the local state stores.</p>
<h2 id="oh-the-memory-overhead">Oh, the memory overhead</h2>
<p>Assigning large heaps to the JVM sounds reasonable at first, although Kafka Streams
utilize lots of <em>off-heap</em> memory when using RocksDB which eventually leads to crashed
applications lacking free memory.</p>
<p>RocksDB stores data at least in four data structures (1) <code class="language-plaintext highlighter-rouge">memstore</code>, (2) <code class="language-plaintext highlighter-rouge">bloomfilter</code>,
(3) <code class="language-plaintext highlighter-rouge">index</code>, and (4) <code class="language-plaintext highlighter-rouge">blockcache</code>. Besides that, it has lots of configurable
properties which makes a difficult job to properly tune it.</p>
<blockquote>
<p>Unfortunately, configuring RocksDB optimally is not trivial. Even we as RocksDB
developers don’t fully understand the effect of each configuration change. If
you want to fully optimize RocksDB for your workload, we recommend experiments
and benchmarking, while keeping an eye on the three amplification factors.</p>
</blockquote>
<p>(<a href="https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide">https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide</a>)</p>
<p>On the other hand, if you don’t tune it, the memory usage of our application
will grow, and grow, and grow. So, we need to make sure we know the number of source
topics our application is consuming from, as the number of partitions and state stores.</p>
<blockquote>
<p>If you have many stores in your topology, there is a fixed per-store memory cost. E.g., if RocksDB is your default store, it uses some off-heap memory per store. Either consider spreading your app instances on multiple machines or consider lowering RocksDb’s memory usage using the RocksDBConfigSetter class.</p>
</blockquote>
<blockquote>
<p>If you take the latter approach, note that RocksDB exposes several important memory configurations. In particular, these settings include block_cache_size (16 MB by default), write_buffer_size (32 MB by default) write_buffer_count (3 by default). With those defaults, the estimate per RocksDB store (let’s call it estimate per store) is (write_buffer_size_mb * write_buffer_count) + block_cache_size_mb (112 MB by default).</p>
</blockquote>
<blockquote>
<p>Then if you have 40 partitions and using a windowed store (with a default of 3 segments per partition), the total memory consumption is 40 * 3 * estimate per store (in this example that would be 13440 MB).</p>
</blockquote>
<p>(<a href="https://docs.confluent.io/current/streams/sizing.html">https://docs.confluent.io/current/streams/sizing.html</a>)</p>
<p>Having configured <code class="language-plaintext highlighter-rouge">ROCKSDB_BLOCK_CACHE_SIZE_MB</code>, <code class="language-plaintext highlighter-rouge">ROCKSDB_BLOCK_SIZE_KB</code>, <code class="language-plaintext highlighter-rouge">ROCKSDB_WRITE_BUFFER_SIZE_MB</code>, and <code class="language-plaintext highlighter-rouge">ROCKSDB_WRITE_BUFFER_COUNT</code> to the best possible values, we’re able
to estime the cost of a single store. Obviously, if we have lots of streams with lots
of stores, it will require lots of memory.</p>
<p>[Edit 2] <a href="https://issues.apache.org/jira/browse/KAFKA-8323">KAFKA-8323: Memory leak of BloomFilter Rocks object</a>
and <a href="https://issues.apache.org/jira/browse/KAFKA-8215">KAFKA-8215: Limit memory usage of RocksDB</a>
from <a href="https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html">Kafka 2.3.0</a> might help/solve some
memory issues.</p>
<h2 id="dont-forget-the-disk-space">Don’t forget the disk space</h2>
<p>Consuming from large source topics and performing processing that requires storing <code class="language-plaintext highlighter-rouge">n</code>
records in RocksDB for each message, will lead to a fairly large amount of data stored
in disk. Without the proper monitoring, it is very easy to run out of space.</p>
<h2 id="its-ok-to-do-external-lookups">It’s ok to do external lookups</h2>
<p>Well, most people will say to load all the needed data into a <code class="language-plaintext highlighter-rouge">Stream</code> or <code class="language-plaintext highlighter-rouge">Store</code> and
perform the <code class="language-plaintext highlighter-rouge">joins</code> or <em>local lookups</em> while processing our messages. Sometimes it’s easier
to perform external lookups, ideally to a <em>fast</em> database. And I must say that’s fine. Obviously,
it depends on your load, how many external lookups are performed per message, and how fast your
database can handle those lookups. Making this type of external calls inside a stream
may introduce extra latency which could have an impact on the consumer lag of downstream systems,
so please, use it carefully.</p>
<blockquote>
<p>Data locality is critical for performance. Although key lookups are typically very fast,
the latency introduced by using remote storage becomes a bottleneck when you’re working at scale.</p>
</blockquote>
<blockquote>
<p>The key point here isn’t the degree of latency per record retrieval, which may be minimal.
The important factor is that you’ll potentially process millions or billions of records through
a streaming application. When multiplied by a factor that large, even a small degree of network
latency can have a huge impact.</p>
</blockquote>
<p>The cool thing about having to lookup for data from an <em>external state store</em> is that we can
abstract our <em>external state store</em> as a simple <em>StateStore</em> and use it like the others, without
changing any existing code.</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="nn">org.apache.kafka.streams.processor.</span><span class="o">{</span><span class="nc">ProcessorContext</span><span class="o">,</span> <span class="nc">StateRestoreCallback</span><span class="o">,</span> <span class="nc">StateStore</span><span class="o">}</span>
<span class="k">class</span> <span class="nc">LookupStore</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">storeName</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">pool</span><span class="k">:</span> <span class="kt">Pool</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span> <span class="k">extends</span> <span class="nc">StateStore</span> <span class="o">{</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">init</span><span class="o">(</span><span class="n">context</span><span class="k">:</span> <span class="kt">ProcessorContext</span><span class="o">,</span> <span class="n">root</span><span class="k">:</span> <span class="kt">StateStore</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="nv">context</span><span class="o">.</span><span class="py">register</span><span class="o">(</span><span class="n">root</span><span class="o">,</span> <span class="k">new</span> <span class="nc">StateRestoreCallback</span><span class="o">()</span> <span class="o">{</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">restore</span><span class="o">(</span><span class="n">key</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span class="kt">Byte</span><span class="o">],</span> <span class="n">value</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span class="kt">Byte</span><span class="o">])</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{}</span>
<span class="o">})</span>
<span class="o">}</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">name</span><span class="o">()</span><span class="k">:</span> <span class="kt">String</span> <span class="o">=</span> <span class="k">this</span><span class="o">.</span><span class="py">storeName</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">flush</span><span class="o">()</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">()</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">close</span><span class="o">()</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="nf">pool</span><span class="o">().</span><span class="py">close</span><span class="o">()</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">persistent</span><span class="o">()</span><span class="k">:</span> <span class="kt">Boolean</span> <span class="o">=</span> <span class="kc">true</span>
<span class="k">override</span> <span class="k">def</span> <span class="nf">isOpen</span><span class="k">:</span> <span class="kt">Boolean</span> <span class="o">=</span> <span class="o">!</span><span class="nf">pool</span><span class="o">().</span><span class="py">isClosed</span>
<span class="k">def</span> <span class="nf">pool</span><span class="o">()</span><span class="k">:</span> <span class="kt">Pool</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="k">=</span> <span class="n">pool</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Note to self: every partition from a processor that’s using this <em>custom store</em>
starts a new connection to the database, which is not sustainable for the amount
of processors and partitions that may exist. It’s advised to use a shared connection
pool to reduce and control the available connections.</p>
<h2 id="streams-may-die-a-lot">Streams may die, a lot</h2>
<p>Within a Kafka cluster, there are <em>leader elections</em> amoung the available nodes. The
bad thing about them is that <em>sometimes</em> it can have a negative impact on the current
processing streams.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2019-07-08 14:27:54 [my-stream-d98754ff-6690-4040-ae3c-fbe51f9cf39f-StreamThread-2] WARN o.apache.kafka.clients.NetworkClient - [Consumer clientId=my-stream-d98754ff-6690-4040-ae3c-fbe51f9cf39f-StreamThread-2-consumer, groupId=my-stream] 266 partitions have leader brokers without a matching listener, including [my-topic-15, my-topic-9, my-topic-3, my-topic-10, my-topic-16, my-topic-4, __transaction_state-0, __transaction_state-30, __transaction_state-18,
__transaction_state-6]
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2019-07-08 14:29:51 [my-stream-d98754ff-6690-4040-ae3c-fbe51f9cf39f-StreamThread-2] WARN o.apache.kafka.streams.KafkaStreams - stream-client [my-stream-d98754ff-6690-4040-ae3c-fbe51f9cf39f] All stream threads have died. The instance will be in error state and should be closed.
</code></pre></div></div>
<p>The logs above show that <code class="language-plaintext highlighter-rouge">266</code> partitions lost their <em>leader</em> and eventually the corresponding
stream just stopped. Without proper monitoring and alerts configured, mainly consumer lag and number
of available streams, we can stay without processing messages for a long time. Having some kind of
retry mechanism may help as well. Just don’t trust that your stream will be up and running all the time,
bad things happen.</p>
<h2 id="timeouts-and-rebalances">Timeouts and rebalances</h2>
<p>From time to time, applications get stuck in a rebalancing state leading to several
timeouts.</p>
<p>Often it’s caused by processing very large batches of messages that take more
than the five minutes threshold to commit. When the processing is finished, the stream
is already considered <em>dead</em> and the message can’t be committed. On rebalance, the same
message is fetched once again from Kafka and the same error would occur.</p>
<p>Reducing the <code class="language-plaintext highlighter-rouge">max.poll.records</code> value, often to <code class="language-plaintext highlighter-rouge">1</code> would sometimes <em>alleviate</em> this
specific issue ¯_(ツ)_/¯.</p>
<h2 id="still-">Still, …</h2>
<p>Building <em>real-time</em> applications with Kafka Streams is quick, easy, powerful, and very
natural after grasping the non-trivial stuff that comes with the full package.</p>Why using Kafka Streams in the first place?Papers, Love, and Meetup2019-09-19T00:00:00+00:002019-09-19T00:00:00+00:00https://ordep.dev/posts/papers-love-and-meetup<p>This should be titled <em>“Why I didn’t renewed Papers We Love @ Porto subscription on Meetup.com”</em> but you get the idea. Last August I didn’t pay for another half year subscription to have a placeholder for this scientific community; you might wonder why.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Yikes! Your Meetup Group will shut down in 48 hours!
Hi Pedro Tavares,
Nobody has stepped up yet to become the new Organizer of Papers We Love @ Porto. There are only 48 hours left before the group is shut down and removed from Meetup.com forever.
As the new Organizer you get to:
Determine the future direction of the Meetup Group
Change the name, description, group colors, etc...
Schedule Meetups
Decide who can join the group
Make it whatever you and your members want!
</code></pre></div></div>
<p>Let’s dissect what I get to do as an Organizer besides, well, organize:</p>
<ul>
<li><em>Determine the future direction of the group</em>, which is the same as when I started - promote a slot where Porto’<a href="s">s</a> tech companies and universities could chat about interesting computer science topics.</li>
<li><em>Change the name, description, and group colors</em> - not going to happen.</li>
<li><em>Schedule Meetups</em> - oh, yeah, the critical one.</li>
<li><em>Decide who can join the group</em> - there are no gatekeepers around here, the group is open to everyone!</li>
<li><em>Make it whatever you and your members want</em> - hmm, a place to share and talk about interesting computer science topics!</li>
</ul>
<p>None of these bullets actually convinced me to stay around this platform for another six months mainly because they doesn’t cover the bad things that come with the full package - bad pricing and <em>RSVPs</em>.</p>
<h2 id="pricing-and-rsvps">Pricing and RSVPs</h2>
<p>The <em>coolest</em> thing about this platform is that we’re able to create a brand new community, make it discoverable around the locals, schedule some events, and get the full list of attendees.</p>
<p>The truth is that we’re able to do all of this for free and without getting upset about the # of RSVPs vs the # of people that actually show up for the events - yeah, that’s the saddest thing of this platform, people who continuously RSVP and never show up (come on folks, what are you doing?).</p>
<p>I may sound a bit extreme but without sponsorships (I don’t want them, seriously.) and as a solo organizer I’m really not interested in paying a fee for a slot in a website that enables people to interact via <em>false</em> RSVPs.</p>
<p>It would be cool if they provide free plans for communities that don’t rely on any kind of sponsorship and don’t use the platform to expose the brand behind it but it seems that’s not an option, maybe because it would lead to a vast number of zombie meetups (the ones that are removed after the first 30 <em>free</em> days).</p>
<h2 id="some-stats-so-far">Some stats so far…</h2>
<p>The meetup is going great, not as regular as I wanted though, but still great. So far I’ve hosted <strong>10</strong> successful events and <strong>15</strong> amazing talks. I had to step up as a speaker for two times in order to keep the <em>momentum</em> going - I don’t mind, really, but it’s not supposed to :) <em>Community-wise</em>, we’re <strong>194</strong> on twitter and <strong>438</strong> on meetup - great numbers for such a <em>specific</em> community.</p>
<h2 id="plans-for-the-future">Plans for the future</h2>
<p>I’ll rely solo on <a href="https://twitter.com/pwlporto">@pwlporto</a> twitter account to announce future events and cool papers that we should all read. It might be a daunting task to reach all the interested ones because not everyone has a twitter account but it’s free after all! Maybe adding a companion website together with a simple newsletter does the trick.</p>This should be titled “Why I didn’t renewed Papers We Love @ Porto subscription on Meetup.com” but you get the idea. Last August I didn’t pay for another half year subscription to have a placeholder for this scientific community; you might wonder why.Diving into Merkle Trees2019-02-20T00:00:00+00:002019-02-20T00:00:00+00:00https://ordep.dev/posts/diving-into-merkle-trees<blockquote>
<p>This is a transcript of my talk on Diving into Merkle Trees that I will give
at Lambda Days and ScaleConf Colombia. Slides and video should be up soon!</p>
</blockquote>
<p><img src="/assets/images/thesis.png" alt="thesis" /></p>
<p>Introduced in 1979 by Ralph C. Merkle in his Thesis: Secrecy, Authentications,
and Public Key Systems, the Merkle Tree, also known as a binary hash tree, is a
data structure used for efficiently <em>summarizing and verifying the integrity of
large sets of data</em> enabling users to verify the authenticity of their received
responses.</p>
<blockquote>
<p>“The general idea in the new system is to use an
infinite tree of one-time signatures. […] Each node of the tree performs three
functions: (1) it authenticates the left sub-node (2) it authenticates the right
sub-node (3) it signs a single message.”</p>
</blockquote>
<p><img src="/assets/images/tree.png" alt="tree" /></p>
<p>Initially, it was used for the purpose of one-time signatures and authenticated
public key distribution, namely providing authenticated responses as to the
validity of a certificate.</p>
<p><img src="/assets/images/digital-signature.png" alt="digital-signature" /></p>
<p>Ralph C. Merkle described a new digital signature that was able to sign an
unlimited number of messages and the signature size would increase
logarithmically as a function of the number of messages signed.</p>
<p>At this point we can identify the two main purposes of a Merkle Tree: (1)
summarize large sets of data and (2) verify that a specific piece of data
belongs to a larger data set.</p>
<h2 id="one-way-hashing-functions">One-Way Hashing Functions</h2>
<p>Before diving into <em>one-time signatures</em> lets first get confortable with
<em>one-way functions</em>. Usually, a one-way function is a mathematical algorithm
that take inputs and provide unique outputs such as MD5, SHA-3, or SHA-256.</p>
<p><img src="/assets/images/one-way-hashing-functions.png" alt="one-way-hashing-functions" /></p>
<blockquote>
<p>A one-way function <em>F</em> is a function that is
easy to compute, but difficult to invert. Given <em>x</em> and <em>F</em>, it is easy to
compute <em>y=F(x)</em>, but given <em>y</em> and <em>F</em>, it is effectively impossible to compute
<em>x</em>.</p>
</blockquote>
<p>One-way hashing functions are especially useful within Merkle Trees for two
obvious reasons; <em>storage</em> and <em>privacy</em>.</p>
<p>With systems that contain massive amounts of data, the benefits of being
able to store and identify data with a fixed length output can create vast
storage savings and help to increase efficiency.</p>
<p>The person who computes <em>y=F(x)</em> is the only person who knows <em>x</em>. If <em>y</em> is
publicly revealed, only the originator of <em>y</em> knows <em>x</em>, and can choose to
reveal or conceal <em>x</em> at his whim!</p>
<h2 id="one-time-signatures">One-time Signatures</h2>
<p>Also in 1979, Leslie Lamport published his concept of <em>One-time Signatures</em>.
Most signature schemes rely in part on <em>one-way functions</em>, typically hash
functions, for their security proofs. The beauty of Lamport scheme was that this
signature was only relying on the security of these one-way functions!</p>
<blockquote>
<p>One time signatures are practical between a single pair
of users who are willing to exchange the large amount of data necessary but they
are not practical for most applications without further refinements.</p>
</blockquote>
<blockquote>
<p>If 1000 messages are to be signed before new public
authentication data is needed, over 20,000,000 bits or 2.5 megabytes must be
stored as public information.</p>
</blockquote>
<p>If <em>B</em> had to keep 2.5 megabytes of data for 1000 other users, <em>B</em> would have to
store 2.5 gigabytes of data. With further increases in the number of users, or
in the number of message each user wants to be able to sign, the system would
eventually become burdensome.</p>
<h2 id="improving-one-time-signatures">Improving One-time Signatures</h2>
<p>Merkle focused on how to eliminate the huge storage requirements in the Lamport
method and proposed an improved <em>One-time Signature</em> that reduced the size of
signed messages by almost a <em>factor of 2</em>.</p>
<p>This improved method was easy to implement and cutted the size of the signed
message almost in half, although this was still too large for most applications;
instead of storing <code class="language-plaintext highlighter-rouge">2.5 gigabytes</code> of data, B only had to store <code class="language-plaintext highlighter-rouge">1.25 gigabytes</code>.</p>
<p><img src="/assets/images/tree-authentication.png" alt="tree-authentication" /></p>
<p>The method is called tree authentication because it’s computation forms a binary
tree of recursive calls. Using this method, requires only <code class="language-plaintext highlighter-rouge">log2 n</code>
transmissions. A close look at the algorithm will reveal that half the
transmissions are redundant since we’re able to compute a given parent node <code class="language-plaintext highlighter-rouge">A</code>
from their children <code class="language-plaintext highlighter-rouge">A1</code> and <code class="language-plaintext highlighter-rouge">A2</code>, so there’s really no need to send <code class="language-plaintext highlighter-rouge">A</code>.</p>
<h2 id="how-to-compute-a-merkle-root">How to compute a Merkle Root?</h2>
<p>Given that we have a data file represented by a set of <em>blocks</em> <code class="language-plaintext highlighter-rouge">[L1, L2]</code>.</p>
<p><img src="/assets/images/building-blocks-01.png" alt="building-blocks-01" /></p>
<p>We start by applying a <em>one-way hashing function</em> to <code class="language-plaintext highlighter-rouge">L1</code>, <code class="language-plaintext highlighter-rouge">h(📄L1) = 9ec4</code>.</p>
<p><img src="/assets/images/building-blocks-02.png" alt="building-blocks-02" /></p>
<p>The next step is to apply the same function to <code class="language-plaintext highlighter-rouge">L2</code>, <code class="language-plaintext highlighter-rouge">h(📄L2) = 7e6a</code>.</p>
<p><img src="/assets/images/building-blocks-03.png" alt="building-blocks-03" /></p>
<p>To calculate the parent node, we need <em>always</em> to concatenate both child hashes
<code class="language-plaintext highlighter-rouge">h(📄L1)</code> and <code class="language-plaintext highlighter-rouge">h(📄L2)</code> before applying, once again, the <em>one-way hashing
function</em>, <code class="language-plaintext highlighter-rouge">h(h(📄L1) || h(📄L2)) = aea9</code>.</p>
<p><img src="/assets/images/building-blocks-04.png" alt="building-blocks-04" /></p>
<p>At this point we know the building blocks of a Merkle Tree; let’s represent it
in Elixir.</p>
<h2 id="building-a-merkle-tree">Building a Merkle-Tree</h2>
<p>In order to build a Merkle Tree, we need to define three new types: <code class="language-plaintext highlighter-rouge">Leaf</code>,
<code class="language-plaintext highlighter-rouge">Node</code>, and the <code class="language-plaintext highlighter-rouge">MerkleTree</code> itself. Let’s start by defining <code class="language-plaintext highlighter-rouge">Leaf</code> – it
should contain the <code class="language-plaintext highlighter-rouge">hash</code> and the <code class="language-plaintext highlighter-rouge">value</code> of a given data block.</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">defmodule</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span> <span class="k">do</span>
<span class="k">defstruct</span> <span class="p">[</span><span class="ss">:hash</span><span class="p">,</span> <span class="ss">:value</span><span class="p">]</span>
<span class="nv">@type</span> <span class="n">hash</span> <span class="p">::</span> <span class="no">String</span><span class="o">.</span><span class="n">t</span>
<span class="nv">@type</span> <span class="n">value</span> <span class="p">::</span> <span class="no">String</span><span class="o">.</span><span class="n">t</span>
<span class="nv">@type</span> <span class="n">t</span> <span class="p">::</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">hash:</span> <span class="n">hash</span><span class="p">,</span>
<span class="ss">value:</span> <span class="n">value</span>
<span class="p">}</span>
<span class="k">end</span>
</code></pre></div></div>
<p>The next type is <code class="language-plaintext highlighter-rouge">Node</code> – it should contain the <code class="language-plaintext highlighter-rouge">left</code> and <code class="language-plaintext highlighter-rouge">right</code> <em>child</em>
nodes, and the <code class="language-plaintext highlighter-rouge">hash</code> value of the concatenation of both child <em>hashes</em>.</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">defmodule</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span> <span class="k">do</span>
<span class="k">defstruct</span> <span class="p">[</span><span class="ss">:hash</span><span class="p">,</span> <span class="ss">:left</span><span class="p">,</span> <span class="ss">:right</span><span class="p">]</span>
<span class="nv">@type</span> <span class="n">hash</span> <span class="p">::</span> <span class="no">String</span><span class="o">.</span><span class="n">t</span>
<span class="nv">@type</span> <span class="n">left</span> <span class="p">::</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span><span class="o">.</span><span class="n">t</span> <span class="o">|</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="o">.</span><span class="n">t</span>
<span class="nv">@type</span> <span class="n">right</span> <span class="p">::</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span><span class="o">.</span><span class="n">t</span> <span class="o">|</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="o">.</span><span class="n">t</span>
<span class="nv">@type</span> <span class="n">t</span> <span class="p">::</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span><span class="p">{</span>
<span class="ss">hash:</span> <span class="n">hash</span><span class="p">,</span>
<span class="ss">left:</span> <span class="n">left</span><span class="p">,</span>
<span class="ss">right:</span> <span class="n">right</span>
<span class="p">}</span>
<span class="k">end</span>
</code></pre></div></div>
<p>And, finally, the Merkle Tree itself.</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">defmodule</span> <span class="no">MerkleTree</span> <span class="k">do</span>
<span class="k">defstruct</span> <span class="p">[</span><span class="ss">:root</span><span class="p">]</span>
<span class="nv">@type</span> <span class="n">root</span> <span class="p">::</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span><span class="o">.</span><span class="n">t</span>
<span class="nv">@type</span> <span class="n">t</span> <span class="p">::</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="p">{</span>
<span class="ss">root:</span> <span class="n">root</span>
<span class="p">}</span>
<span class="k">end</span>
</code></pre></div></div>
<h3 id="hashing-the-data-blocks">Hashing the data blocks</h3>
<p>The first step to build a Merkle Tree is to <em>hash</em> the data blocks and
converting them to <em>leaves</em>. In order to hash something we need to define a new
module <code class="language-plaintext highlighter-rouge">Crypto</code> with a single function <code class="language-plaintext highlighter-rouge">hash</code> that should accept an input and is
responsible for encoding it using the appropriate <em>one-way hashing function</em>.</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">defmodule</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Crypto</span> <span class="k">do</span>
<span class="k">def</span> <span class="n">hash</span><span class="p">(</span><span class="n">input</span><span class="p">,</span> <span class="n">type</span> <span class="p">\\</span> <span class="ss">:sha256</span><span class="p">)</span> <span class="k">do</span>
<span class="n">type</span>
<span class="o">|></span> <span class="ss">:crypto</span><span class="o">.</span><span class="n">hash</span><span class="p">(</span><span class="s2">"</span><span class="si">#{</span><span class="n">input</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="o">|></span> <span class="no">Base</span><span class="o">.</span><span class="n">encode16</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Having <code class="language-plaintext highlighter-rouge">blocks = ["L1", "L2", "L3", "L4"]</code>, the expected output would be:</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span>
<span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L1"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"DFFE8596427FC50E8F64654A609AF134D45552F18BBECEF90B31135A9E7ACAA0"</span>
<span class="p">},</span>
<span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L2"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"D76354D8457898445BB69E0DC0DC95FB74CC3CF334F8C1859162A16AD0041F8D"</span>
<span class="p">},</span>
<span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L3"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"842983DE8FB1D277A3FAD5C8295C7A14317C458718A10C5A35B23E7F992A5C80"</span>
<span class="p">},</span>
<span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L4"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"4A5A97C6433C4C062457E9335709D57493E75527809D8A9586C141E591AC9F2C"</span>
<span class="p">}</span>
<span class="p">]</span>
</code></pre></div></div>
<p>By defining a function <code class="language-plaintext highlighter-rouge">new</code> that accepts <code class="language-plaintext highlighter-rouge">blocks</code>, we should be able to hash
the data blocks and convert them into <code class="language-plaintext highlighter-rouge">Leafs</code>.</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">defmodule</span> <span class="no">MerkleTree</span> <span class="k">do</span>
<span class="n">alias</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span>
<span class="n">alias</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Crypto</span>
<span class="k">def</span> <span class="n">new</span><span class="p">(</span><span class="n">blocks</span><span class="p">)</span> <span class="k">do</span>
<span class="n">blocks</span>
<span class="o">|></span> <span class="no">Enum</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="o">&</span><span class="p">%</span><span class="no">Leaf</span><span class="o">.</span><span class="n">build</span><span class="p">(</span><span class="nv">&1</span><span class="p">,</span> <span class="no">Crypto</span><span class="o">.</span><span class="n">hash</span><span class="p">(</span><span class="nv">&1</span><span class="p">)))</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Where <code class="language-plaintext highlighter-rouge">Leaf.build/2</code> is just:</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">defmodule</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span> <span class="k">do</span>
<span class="k">def</span> <span class="n">build</span><span class="p">(</span><span class="n">value</span><span class="p">,</span> <span class="n">hash</span><span class="p">)</span> <span class="k">do</span>
<span class="p">%</span><span class="no">Leaf</span><span class="p">{</span><span class="ss">value:</span> <span class="n">value</span><span class="p">,</span> <span class="ss">hash:</span> <span class="n">hash</span><span class="p">}</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Calling the function above <code class="language-plaintext highlighter-rouge">MerkleTree.new ["L1", "L2", "L3", "L4"]</code> should
yield the expected output. Although, we’re not done yet.</p>
<h3 id="hashing-the-nodes">Hashing the nodes</h3>
<p>Remember that for creating a <code class="language-plaintext highlighter-rouge">Node</code> we need to join both child <em>hashes</em> and
apply the <em>one-way hashing function</em> once again?</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">defmodule</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span> <span class="k">do</span>
<span class="n">alias</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Crypto</span>
<span class="k">def</span> <span class="n">new</span><span class="p">(</span><span class="n">nodes</span><span class="p">)</span> <span class="k">do</span>
<span class="n">nodes</span>
<span class="o">|></span> <span class="no">Enum</span><span class="o">.</span><span class="n">map_join</span><span class="p">(</span><span class="o">&</span><span class="p">(</span><span class="nv">&1</span><span class="o">.</span><span class="n">hash</span><span class="p">))</span>
<span class="o">|></span> <span class="no">Crypto</span><span class="o">.</span><span class="n">hash</span>
<span class="o">|></span> <span class="n">build</span><span class="p">(</span><span class="n">nodes</span><span class="p">)</span>
<span class="k">end</span>
<span class="k">def</span> <span class="n">build</span><span class="p">(</span><span class="n">hash</span><span class="p">,</span> <span class="p">[</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">])</span> <span class="k">do</span>
<span class="p">%</span><span class="no">Node</span><span class="p">{</span><span class="ss">hash:</span> <span class="n">hash</span><span class="p">,</span> <span class="ss">left:</span> <span class="n">left</span><span class="p">,</span> <span class="ss">right:</span> <span class="n">right</span><span class="p">}</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>That’s basically what <code class="language-plaintext highlighter-rouge">new</code> is doing before calling <code class="language-plaintext highlighter-rouge">build(nodes)</code>. Once we
have the <code class="language-plaintext highlighter-rouge">Node</code> <em>hash</em> value, we’re ready to create a new <code class="language-plaintext highlighter-rouge">Node</code> with <code class="language-plaintext highlighter-rouge">hash</code>,
<code class="language-plaintext highlighter-rouge">left</code>, and <code class="language-plaintext highlighter-rouge">right</code>. As an example, by calling the function above with these two
<em>leaves</em>:</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span><span class="o">.</span><span class="n">new</span><span class="p">([</span>
<span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L1"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"DFFE8596427FC50E8F64654A609AF134D45552F18BBECEF90B31135A9E7ACAA0"</span>
<span class="p">},</span>
<span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L2"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"D76354D8457898445BB69E0DC0DC95FB74CC3CF334F8C1859162A16AD0041F8D"</span>
<span class="p">}</span>
<span class="p">])</span>
</code></pre></div></div>
<p>Would yield the following <code class="language-plaintext highlighter-rouge">Node</code>:</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span><span class="p">{</span>
<span class="ss">hash:</span> <span class="s2">"8C569660D98A20D59DE10E134D81A8CE55D48DD71E21B8919F4AD5A9097A98C8"</span><span class="p">,</span>
<span class="ss">left:</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L1"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"DFFE8596427FC50E8F64654A609AF134D45552F18BBECEF90B31135A9E7ACAA0"</span>
<span class="p">},</span>
<span class="ss">right:</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L2"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"D76354D8457898445BB69E0DC0DC95FB74CC3CF334F8C1859162A16AD0041F8D"</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="all-the-way-up">All the way up</h3>
<p>Having a way to build a <code class="language-plaintext highlighter-rouge">Node</code> from a pair of <em>nodes</em>, we can now make use of
recursion to calculate the remaining nodes up to the <em>Merkle root</em>. Let’s
complete the <code class="language-plaintext highlighter-rouge">new</code> function:</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">defmodule</span> <span class="no">MerkleTree</span> <span class="k">do</span>
<span class="n">alias</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span>
<span class="n">alias</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Crypto</span>
<span class="k">def</span> <span class="n">new</span><span class="p">(</span><span class="n">blocks</span><span class="p">)</span> <span class="k">do</span>
<span class="n">blocks</span>
<span class="o">|></span> <span class="no">Enum</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="o">&</span><span class="p">%</span><span class="no">Leaf</span><span class="o">.</span><span class="n">build</span><span class="p">(</span><span class="nv">&1</span><span class="p">,</span> <span class="no">Crypto</span><span class="o">.</span><span class="n">hash</span><span class="p">(</span><span class="nv">&1</span><span class="p">)))</span>
<span class="o">|></span> <span class="n">build</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Currently, this <code class="language-plaintext highlighter-rouge">new</code> function it’s yielding a list of <em>leaves</em>. Let us define a
<code class="language-plaintext highlighter-rouge">build</code> function that accepts that list of leaf <em>nodes</em> with the goal of
grouping it into several pairs of leaf <em>nodes</em> in order to build the parent
<code class="language-plaintext highlighter-rouge">Node</code> by concatenating both <code class="language-plaintext highlighter-rouge">left</code> and <code class="language-plaintext highlighter-rouge">right</code> child <em>hashes</em>.</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">defp</span> <span class="n">build</span><span class="p">(</span><span class="n">nodes</span><span class="p">)</span> <span class="k">do</span>
<span class="n">nodes</span>
<span class="o">|></span> <span class="no">Enum</span><span class="o">.</span><span class="n">chunk_every</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="o">|></span> <span class="no">Enum</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="o">&</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="nv">&1</span><span class="p">))</span>
<span class="o">|></span> <span class="n">build</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Note that we’re making use of <em>tail recursion</em> to build our Merkle Tree from the
ground up to the <em>root</em>. Still, we need to stop that recursive processing.</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">defp</span> <span class="n">build</span><span class="p">([</span><span class="n">root</span><span class="p">])</span> <span class="k">do</span>
<span class="p">%</span><span class="no">MerkleTree</span><span class="p">{</span><span class="ss">root:</span> <span class="n">root</span><span class="o">.</span><span class="n">hash</span><span class="p">,</span> <span class="ss">tree:</span> <span class="n">root</span><span class="p">}</span>
<span class="k">end</span>
<span class="k">defp</span> <span class="n">build</span><span class="p">(</span><span class="n">nodes</span><span class="p">)</span> <span class="k">do</span>
<span class="n">nodes</span>
<span class="o">|></span> <span class="no">Enum</span><span class="o">.</span><span class="n">chunk_every</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="o">|></span> <span class="no">Enum</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="o">&</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="nv">&1</span><span class="p">))</span>
<span class="o">|></span> <span class="n">build</span>
<span class="k">end</span>
</code></pre></div></div>
<p>By pattern matching a single element <code class="language-plaintext highlighter-rouge">root</code> in the list of nodes, we’re now able
to stop the processing and return a <code class="language-plaintext highlighter-rouge">MerkleTree</code>.</p>
<p>The final <code class="language-plaintext highlighter-rouge">new</code> function would look like this:</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">defmodule</span> <span class="no">MerkleTree</span> <span class="k">do</span>
<span class="n">alias</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span>
<span class="n">alias</span> <span class="no">MerkleTree</span><span class="o">.</span><span class="no">Crypto</span>
<span class="k">def</span> <span class="n">new</span><span class="p">(</span><span class="n">blocks</span><span class="p">)</span> <span class="k">do</span>
<span class="n">blocks</span>
<span class="o">|></span> <span class="no">Enum</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="o">&</span><span class="p">%</span><span class="no">Leaf</span><span class="o">.</span><span class="n">build</span><span class="p">(</span><span class="nv">&1</span><span class="p">,</span> <span class="no">Crypto</span><span class="o">.</span><span class="n">hash</span><span class="p">(</span><span class="nv">&1</span><span class="p">)))</span>
<span class="o">|></span> <span class="n">build</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Finally, by calling the function <code class="language-plaintext highlighter-rouge">MerkleTree.new ["L1", "L2", "L3", "L4"]</code> would
yield the following result:</p>
<div class="language-elixir highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">%</span><span class="no">MerkleTree</span><span class="p">{</span>
<span class="ss">root:</span> <span class="s2">"B8EC6582F5B8ED1CDE7712275C02C8E4CC0A2AC569A23F6B7738E6B69BF132E6"</span><span class="p">,</span>
<span class="ss">tree:</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span><span class="p">{</span>
<span class="ss">hash:</span> <span class="s2">"B8EC6582F5B8ED1CDE7712275C02C8E4CC0A2AC569A23F6B7738E6B69BF132E6"</span><span class="p">,</span>
<span class="ss">left:</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span><span class="p">{</span>
<span class="ss">hash:</span> <span class="s2">"8C569660D98A20D59DE10E134D81A8CE55D48DD71E21B8919F4AD5A9097A98C8"</span><span class="p">,</span>
<span class="ss">left:</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L1"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"DFFE8596427FC50E8F64654A609AF134D45552F18BBECEF90B31135A9E7ACAA0"</span>
<span class="p">},</span>
<span class="ss">right:</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L2"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"D76354D8457898445BB69E0DC0DC95FB74CC3CF334F8C1859162A16AD0041F8D"</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="ss">right:</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Node</span><span class="p">{</span>
<span class="ss">hash:</span> <span class="s2">"29C5146A0AABBC4444D91087D91D2637D8EB4620A686CF6179CCD7A0BFB9B8EF"</span><span class="p">,</span>
<span class="ss">left:</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L3"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"842983DE8FB1D277A3FAD5C8295C7A14317C458718A10C5A35B23E7F992A5C80"</span>
<span class="p">},</span>
<span class="ss">right:</span> <span class="p">%</span><span class="no">MerkleTree</span><span class="o">.</span><span class="no">Leaf</span><span class="p">{</span>
<span class="ss">value:</span> <span class="s2">"L4"</span><span class="p">,</span>
<span class="ss">hash:</span> <span class="s2">"4A5A97C6433C4C062457E9335709D57493E75527809D8A9586C141E591AC9F2C"</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="audit-proof">Audit Proof</h2>
<p>As we already know, we can use a Merkle Tree to verify that a specific piece of
data belongs to a larger data set. The Merkle <em>audit proof</em> is the missing nodes
required to compute all of the nodes between the data block and the Merkle root.
If a Merkle <em>audit proof</em> fails to produce a <em>root hash</em> that matches the
original Merkle <em>root hash</em>, it means that our data block is not present in the
tree.</p>
<p><img src="/assets/images/audit-proof.png" alt="audit-proof" /></p>
<p>In this example, we need to provide a <em>proof</em> that the data block <code class="language-plaintext highlighter-rouge">L1</code> exists in
the tree. Since we already know the <em>hash value</em> of <code class="language-plaintext highlighter-rouge">L1</code>, we’ll need the hash
value of <code class="language-plaintext highlighter-rouge">L2</code> in order to compute <code class="language-plaintext highlighter-rouge">P1</code>. Now that we are able to compute <code class="language-plaintext highlighter-rouge">P1</code> we
finally need to get <code class="language-plaintext highlighter-rouge">P2</code> to compute <code class="language-plaintext highlighter-rouge">R</code>. In this specific case the Merkle <em>audit
proof</em> is a list of nodes <code class="language-plaintext highlighter-rouge">[H2, P2]</code>.</p>
<blockquote>
<p>The use of tree authentication is now fairly clear. A given user
A transmits R to another user B. A then transmits the authentication path for
Yi. B knows R, the root of the authentication tree, by prior arrangement. B can
then authenticate Yi, and can accept any Ynth from A as genuine.</p>
</blockquote>
<h2 id="how-they-are-useful">How they are useful?</h2>
<p>Merkle trees are especially useful in distributed, peer-to-peer systems where
the same data should exist in multiple places. By using Merkle Trees we can
detect inconsistencies between replicas, reduce the amount of transferred data
enabling peer-to-peer file sharing, and maintaining several versions of the same
tree, also called <em>persistent</em> data-structures.</p>
<h3 id="detect-inconsistencies">Detect inconsistencies</h3>
<p>Having a data file represented by a data structure we’re able to <strong>detect
inconsistencies between replicas of that same tree</strong>. Take for example three
replicas of the same Merkle Tree – just comparing the root nodes we can make
sure that those trees are not the same, or in this case, there are
inconsistencies between them.</p>
<p><img src="/assets/images/replicas-00.png" alt="replicas-00" /></p>
<p>By using an <em>Anti-entropy mechanism</em>, we’re able to notice that both trees have
inconsistent data and that triggers a process that copies <em>only</em> the data needed
to repair the inconsistent tree.</p>
<p><img src="/assets/images/replicas-01.png" alt="replicas-01" /></p>
<p>To compare the state of two nodes, they exchange the corresponding Merkle Trees
by levels, only descending further down the tree if the corresponding hashes are
different. If two corresponding leaf nodes have different hashes, then there are
objects which must be repaired.</p>
<p><img src="/assets/images/replicas-02.png" alt="replicas-02" /></p>
<p>This is actually used by Dynamo, Riak, and Cassandra to repair bad replicas!</p>
<h3 id="peer-to-peer-file-sharing">Peer-to-peer file sharing</h3>
<p>The principal advantage of Merkle Trees is that each branch of the tree can be
checked independently without requiring nodes to download the entire data set.
This makes <em>peer-to-peer file sharing</em> another good use for Merkle Trees, where
we start by fetching the root of the tree from a <em>trusted</em> source to access a
given file.</p>
<p><img src="/assets/images/peer-to-peer-01.png" alt="peer-to-peer-01" /></p>
<p>Since we can fetch single parts of a tree, <strong>reducing the amount of transferred
data</strong>, we then fetch chunks of data from untrusted sources.</p>
<p><img src="/assets/images/peer-to-peer-02.png" alt="peer-to-peer-02" /></p>
<p>We start by fetching <code class="language-plaintext highlighter-rouge">L3</code> and deriving its <em>hash</em>, <code class="language-plaintext highlighter-rouge">b2d0</code>. To allow us to get to
the root, we must fetch the <em>hash</em> value from the right leaf, <code class="language-plaintext highlighter-rouge">8f14</code>. With these
two nodes, we can derive the next <em>hash</em> value, <code class="language-plaintext highlighter-rouge">165f</code>. By fetching the last
<em>hash</em>, <code class="language-plaintext highlighter-rouge">e831</code>, we can use it, alongside with <code class="language-plaintext highlighter-rouge">165f</code>, to derive the <em>root hash</em>,
which is indeed <code class="language-plaintext highlighter-rouge">9cee</code>.</p>
<p><img src="/assets/images/peer-to-peer-03.png" alt="peer-to-peer-03" /></p>
<p>We were building a <em>partial tree</em> having just the <em>root hash</em> and a given data
block. If the root computed from the audit path matches the true root, then the
audit path is <em>proof</em> that the data block exists in the tree.</p>
<h3 id="copy-on-write">Copy-On-Write</h3>
<p>Copy-on-write data structures are also called persistent data structures, since
the old version is preserved. The idea is to share the same tree between both
the copy and the original tree, instead of taking a full copy.</p>
<p><img src="/assets/images/copy-on-write-01.png" alt="copy-on-write-01" /></p>
<p>Having a given tree that suffers an update to a single data block <code class="language-plaintext highlighter-rouge">L4</code>, the
branch that links to it must calculate new <em>hashes</em> all the way up to the
<em>root</em>, although, the other branches stay intact.</p>
<p><img src="/assets/images/copy-on-write-02.png" alt="copy-on-write-02" />
<img src="/assets/images/copy-on-write-03.png" alt="copy-on-write-03" /></p>
<p>If we take a closer look we can see that we’re adding only a new data block
and three new hashes for the new version of the data structure. All the other
data blocks, eventually, gigabytes of data are being shared between both versions!</p>
<h2 id="wrapping-up">Wrapping up</h2>
<p><img src="/assets/images/wrapping-up.png" alt="wrapping-up" /></p>
<p>Merkle trees are <em>just</em> binary trees containing an infinite number of
<em>cryptographic hashes</em> where <em>leaves</em> contain hashes of data blocks and <em>nodes</em>
contain hashes of their children. They also produce a <em>Merkle Root</em> that summarizes
the entire data set and that’s publicly distributed across the network and can easily
prove that a given data block exists in the tree.</p>
<p>You can find them in Cassandra, IPFS, Riak, Ethereum, Bitcoin, Open ZFS, and
much more. Also, you have lots of papers to read as well if you want to dive
even deeper.</p>
<p>Have fun!</p>
<h2 id="references">References</h2>
<ul>
<li><a href="http://www.merkle.com/papers/Thesis1979.pdf">Secrecy, Authentications, and Public Key Systems</a></li>
<li><a href="http://www.merkle.com/papers/Certified1979.pdf">A Certified Digital Signature</a></li>
<li><a href="http://people.eecs.berkeley.edu/~raluca/cs261-f15/readings/merkle.pdf">A Digital Signature Based on a Conventional Encryption Function</a></li>
<li><a href="http://people.eecs.berkeley.edu/~raluca/cs261-f15/readings/merkleodb.pdf">Providing Authentication and Integrity in Outsourced Databases using Merkle
Hash Tree’s</a></li>
<li><a href="https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesManualRepair.html">Manual repair: Anti-entropy repair</a></li>
<li><a href="http://docs.basho.com/riak/kv/2.2.3/learn/concepts/active-anti-entropy/">Active Anti-Entropy</a></li>
</ul>This is a transcript of my talk on Diving into Merkle Trees that I will give at Lambda Days and ScaleConf Colombia. Slides and video should be up soon!Half a year of Papers We Love @ Porto2018-07-02T00:00:00+00:002018-07-02T00:00:00+00:00https://ordep.dev/posts/papers-we-love<h2 id="how-it-all-started">How it all started</h2>
<p>Back to September 2017. Porto’s tech scene was booming, still, it was lacking
tech events backed by theoretical computer science.</p>
<p>At that time, I was
following <a href="https://twitter.com/papers_we_love">@papers_we_love</a> and started slowly
moving from reading toxic blog posts and threads on
<a href="https://www.reddit.com/r/programming/">/r/programming</a> to read computer science
papers and watching <a href="https://www.youtube.com/user/PapersWeLove">PWL talks on Youtube</a>.
It was hard to start, papers are hard to read and digest. But after a while, I was
able to read papers, extract some cool content, and explain it to others.</p>
<p>After an amazing edition of <a href="https://pixels.camp">Pixels Camp</a>, I was waiting for
a connection flight from Lisbon to Porto and started talking about Papers We Love.
I was looking at the list of chapters and thought out loud: “We should start
something like this in Porto”.</p>
<p>After a few months, Papers We Love @ Porto was a real thing… :)</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Hello <a href="https://twitter.com/hashtag/Porto?src=hash&ref_src=twsrc%5Etfw">#Porto</a>! We are the latest chapter of <a href="https://twitter.com/papers_we_love?ref_src=twsrc%5Etfw">@papers_we_love</a>. We're interested in reading and sharing ideas from computer science papers. Joins us!</p>— Papers We Love @ Porto (@pwlporto) <a href="https://twitter.com/pwlporto/status/953775150540890113?ref_src=twsrc%5Etfw">January 17, 2018</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>During the past six months, I was able to meet very kind and interesting
people. The first one was <a href="https://twitter.com/@xmal">Carlos Baquero</a>, a
Distributed Systems Professor from Minho University and Co-creator of CRDTs.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Good move! I and others had been entertaining the idea of doing a PWL Braga/Porto, but maybe now it’s better just to join efforts. We should get in touch.</p>— Carlos Baquero (@xmal) <a href="https://twitter.com/xmal/status/953920048678277120?ref_src=twsrc%5Etfw">January 18, 2018</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Carlos introduced me to <a href="https://twitter.com/old_sound">Alvaro Videla</a> and
after a few emails, we managed to schedule the very first session of <a href="https://twitter.com/@pwlporto">@pwlporto</a>. We had dinner on the night before and talked about the gap
between industry and academia and how this group could help with that. It was really
nice to meet and chat with Carlos and Alvaro!</p>
<blockquote class="twitter-tweet" data-cards="hidden" data-lang="en"><p lang="en" dir="ltr">Great news! Alvaro Videla (<a href="https://twitter.com/old_sound?ref_src=twsrc%5Etfw">@old_sound</a>) will be presenting Harmful GOTOs, Premature Optimizations and Programming Myths are The Root of All Evil on February 22nd at the very first <a href="https://twitter.com/papers_we_love?ref_src=twsrc%5Etfw">@papers_we_love</a> <a href="https://twitter.com/hashtag/Porto?src=hash&ref_src=twsrc%5Etfw">#Porto</a> - we're excited, you should be too!<br /><br />RSVP: <a href="https://t.co/IsqTLNRJUI">https://t.co/IsqTLNRJUI</a></p>— Papers We Love @ Porto (@pwlporto) <a href="https://twitter.com/pwlporto/status/964938414528266240?ref_src=twsrc%5Etfw">February 17, 2018</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Alvaro introduced some of the myths in our industry with the objective of promoting a culture of reading more books, academic papers and related material
that refers to the history of our field and programming in general. It was the
best start that we could have, thanks to <a href="https://twitter.com/old_sound">Alvaro Videla</a> and <a href="https://twitter.com/@xmal">Carlos Baquero</a>!</p>
<p>We had 15 attendees at the very first edition of Papers We Love. I remember that we
were making bets on how many people would attend. Only one person sharing and
spreading the word, 99% of tech people were only interested in hot topics, papers what?
Numbers ranged from 5 to 15. I was the optimist, and we had very nice folks attending!</p>
<h2 id="why-do-we-need-this-kind-of-events-in-the-tech-community">Why do we need this kind of events in the tech community?</h2>
<p>Nowadays, developers only care about trends: languages, frameworks, databases,
and methodologies. That’s the sad thing about the current state of our tech
industry. We need to promote developers to read good literature, instead of
random blog posts that eventually will lead to another internet flame war.</p>
<p>My goal for this chapter of Papers We Love is to have more and more people
reading computer science books/papers and giving talks about it. You don’t
really need to be an expert to give a talk. If you like a specific computer
science topic or even a small algorithm, step up and give a talk!</p>
<h2 id="topics-presented-so-far">Topics presented so far…</h2>
<p>Five sessions and seven talks, that’s our record so far! I truly want to
express my gratitude to these folks for stepping up and sharing their
knowledge. This wouldn’t be possible without them!</p>
<blockquote>
<p>All Aboard the Natural Language Processing Train</p>
</blockquote>
<p>From the depts of intuition to practice, <a href="https://www.linkedin.com/in/josemarcelino1">José Marcelino</a> guided us through the valley
of Natural Language Processing field. Sequence labeling tasks, such as
Part-of-Speech Tagging or Named Entity Recognition, represents a crucial
step to understand language.</p>
<blockquote>
<p>Performance testing of open-source HTTP web frameworks</p>
</blockquote>
<p><a href="https://www.linkedin.com/in/michaelapdomingues">Michael Domingues</a> presented
his study on performance testing against three web frameworks written in Go.
The results have shown that Gin contributed to the fastest response times for a set
of requests that vary on processing and retrieved data complexity.</p>
<blockquote>
<p>Causality is simple</p>
</blockquote>
<p><a href="https://twitter.com/@xmal">Carlos Baquero</a> brought back the intuition on
causality and showed that keeping in mind some simple concepts, allows
us to understand how version vectors and vector clocks work, and where they
differ, and how to use more sophisticated mechanisms to handle millions of
concurrent clients in modern distributed data stores.</p>
<blockquote>
<p>Things you should know about Database Storage and Retrieval</p>
</blockquote>
<p>It was a hard month. Everyone I’ve invited was busy, so I had to step up
and give a talk :) We discussed and examined some core data structures such as Hash Indexes, SSTables, LSM-Trees, and B-Trees, that are used in the traditional relational databases and NoSQL databases.</p>
<blockquote>
<p>Knee Deep Into P2P</p>
</blockquote>
<p><a href="https://twitter.com/fribmendes">Fernando Mendes</a> introduced some simple P2P
topologies such as gossip and trees. Then he moved to more complex ones
such as Gnutella2, HyParView, and Plumtrees and analyzed their problems and
took a look at what CRDTs are and how awesome they can be for shared data.</p>
<blockquote>
<p>Ethereum: A secure decentralised generalised transaction ledger</p>
</blockquote>
<p><a href="https://www.linkedin.com/in/hugopeixoto">Hugo Peixoto</a> presented the
Ethereum original paper, by introducing the core concepts of blockchain
and how it relates to Bitcoin.</p>
<blockquote>
<p>Visualising graphs with millions of edges using edge bundling</p>
</blockquote>
<p><a href="https://www.linkedin.com/in/dmoura">Daniel Moura</a> explained us the
rationale of graph bundling based on kernel density estimation, proposed
simplifications that make the algorithm more efficient and easier to
implement, and showed how the base algorithm can be extended to handle
different clustering criteria and to produce videos of evolving graphs.</p>
<h2 id="a-very-big-thanks-to-these-companies">A very big thanks to these companies!</h2>
<p>People like hot topics. Companies too. So far, our meetups were hosted
by Farfetch, XING, Subvisual, and Veniam. Kudos to them! I’m running this
meetup without any sort of sponsor and it’s great to have support from local
companies to help this group to grow!</p>
<h2 id="looking-forward">Looking forward…</h2>
<p>As I’m writing this post, we are 171 members on
<a href="https://www.meetup.com/Papers-We-Love-Porto">meetup.com</a> spread between
Porto and Braga! I really want to keep these two cities as close as possible.</p>
<p>We’re growing, slowly, but growing!</p>How it all startedWhat you should know about database storage and retrieval.2018-05-07T00:00:00+00:002018-05-07T00:00:00+00:00https://ordep.dev/posts/what-you-should-know-about-database-storage-and-retrieval<p>This post is a transcript of the talk I gave at <a href="https://www.meetup.com/Papers-We-Love-Porto/events/248728411/">Papers We Love @ Porto</a>.</p>
<h2 id="log-structured-file">Log-Structured File</h2>
<p>In 1991, Mendel Rosenblum and John K. Ousterhout introduced a new technique
for disk storage management called <em>log-structured file system</em>.</p>
<blockquote>
<p>A log-structured file system writes all modifications to disk sequentially
in a log-like structure, thereby speeding up both file writing and crash
recovery.</p>
</blockquote>
<p><img src="/assets/images/log-structured-file.png" alt="log-structured-file" /></p>
<p>The idea of the log-structured file system is to collect large amounts of
new data in a file cache in main memory, then write the data to disk in a
single large I/0.</p>
<h2 id="how-do-we-avoid-running-out-of-space">How do we avoid running out of space?</h2>
<p><img src="/assets/images/log-structured-file-compactation.png" alt="log-structured-file-compactation" /></p>
<p>The log-structured system solution is to break the log into several immutable
segments of a certain size by closing a segment file when it reaches a certain size,
and making subsequent writes to a new segment file.</p>
<p><img src="/assets/images/log-structured-file-merging-and-compactation.png" alt="log-structured-file-merging-and-compactation" /></p>
<p>Since compactation makes segments much smaller, we can also merge several
segments together at the same time as performing the compactation. This process
can be done in a background thread while we can still serve read and write
requests using the old segment files.</p>
<h2 id="why-using-an-append-only-log">Why using an append-only log?</h2>
<p>The immutable append-only design turns out to be good for several reasons!</p>
<ol>
<li>Sequential write operations are much faster than random writes;</li>
<li>Concurrency and crash recovery are much simpler;</li>
<li>Merging old segments avoids fragmentation over time.</li>
</ol>
<h2 id="how-do-we-find-the-value-of-a-given-key">How do we find the value of a given key?</h2>
<p>The solution is to introduce an additional data structure that is derived from
data: the <em>index</em>; the idea is to keep some additional metadata on the side that
helps to locate the data. Although, maintaining such structures incurs
overhead, especially on write!</p>
<h2 id="hash-indexes">Hash Indexes</h2>
<p>The simplest possible indexing strategy is to keep an in-memory hash map where
each key is mapped to a byte offset in the data file. The hash map is used to
find the offset in the data file, seek to that location, and read the value.</p>
<p><img src="/assets/images/hash-indexes.png" alt="hash-indexes" /></p>
<p>Note that the hash map is always updated when a new key-value pair is appended
to the file in order to reflect the current data offset.</p>
<p>This solution may sound too simplistic, right? Actually, it is a viable approach.
In 2010, Basho Technologies, a distributed systems company that developed a
key-value NoSQL database technology, Riak, wrote a paper that introduced Bitcask,
the default storage engine of Riak. Bitcask uses this concept of an in-memory hash map
and offers high-performance reads and writes. The tradeoff is that all keys must fit
in memory.</p>
<p><img src="/assets/images/bitcask-paper.png" alt="bitcask-paper" /></p>
<blockquote>
<p>When a write occurs, the keydir is atomically updated with the location of
the newest data. The old data is still present on disk, but any new reads will
use the latest version available in the keydir.</p>
</blockquote>
<p>Although, the in-memory hash map strategy has some limitations. As we saw from
the Bitcask example, all keys must fit in the in-memory hash map, so this
indexing strategy is not suitable for a very large number of keys and since
the keys are not sorted, scanning over a range of keys it’s not efficient —
it would be necessary to look up each key individually in the in-memory
hash maps.</p>
<h2 id="sorted-string-tables">Sorted-String Tables</h2>
<p>In 2006, Google wrote the Bigtable paper where introduced among other things,
the <em>SSTable</em> - <em>Sorted String Table</em>; a sequence of key-value pairs that are
sorted by key.</p>
<p><img src="/assets/images/bigtable-paper.png" alt="bigtable-paper" /></p>
<blockquote>
<p>An SSTable provides a persistent, ordered immutable map from keys to values,
where both keys and values are arbitrary byte strings.</p>
</blockquote>
<blockquote>
<p>A lookup can be performed by first finding the appropriate block with a
binary search in the in-memory index, and then reading the appropriate block
from disk.</p>
</blockquote>
<p><img src="/assets/images/sparse-in-memory-index.png" alt="sparse-in-memory-index" /></p>
<p>The recently committed records are stored in memory in a sorted buffer called
a <em>memtable</em>. The <em>memtable</em> maintains the updates on a row-by-row basis, where each
row is copy-on-write to maintain row-level consistency. Older updates are
stored in a sequence of immutable <em>SSTables</em>.</p>
<p>As recently committed records are being stored in the <em>memtable</em>, it’s size
increases and when the size reaches a threshold:</p>
<ol>
<li>the <em>memtable</em> is frozen;</li>
<li>a new <em>memtable</em> is created;</li>
<li>and the frozen <em>memtable</em> is converted to a <em>SSTable</em> and persisted on disk.</li>
</ol>
<p>A lookup can be performed with a single disk seek by first finding the
appropriate block by performing a binary search in the in-memory index and then
reading the appropriate block from disk.</p>
<p><img src="/assets/images/sstable-merging-and-compactation.png" alt="sstable-merging-and-compactation" /></p>
<p>Since the segments are sorted by key, the merging approach is like the one used
in the mergesort algorithm:</p>
<ol>
<li>we start reading the segment files side by side;</li>
<li>look at the first key in each file;</li>
<li>copy the lowest key to the new segment file;</li>
<li>and repeat.</li>
</ol>
<h2 id="lsm-tree">LSM-Tree</h2>
<p>Storage engines that are based on this principle of merging and compacting sorted
files are often called LSM storage engines. This concept was introduced in 2006
in The Log-Structured Merge-Tree paper; a disk-based data structure designed to
provide low-cost indexing experiencing a high rate of record inserts
over an extended period.</p>
<p><img src="/assets/images/lsm-tree-paper.png" alt="lsm-tree-paper" /></p>
<blockquote>
<p>The LSM-tree uses an algorithm that defers and batches index changes, cascading
the changes from a memory-based component through one or more disk components in
an efficient manner reminiscent of merge sort.</p>
</blockquote>
<h2 id="what-about-performance">What about performance?</h2>
<p>The algorithm can be slow when looking for keys that do not exist in the database.
Before we can make sure that the key does not exist, we first we need to check
the memtable and the segments all the way back to the oldest. The solution is
to introduce another data structure: the <em>Bloom Filter</em> - a memory-efficient
data structure for approximating the contents of a set.</p>
<blockquote>
<p>A Bloom filter allows us to ask whether an SSTable might contain any data
for a specified row/column pair. For certain applications, the small amount
of tablet server memory used for storing Bloom filters drastically reduces
the number of disk seeks required for read operations. Our use of Bloom
filters also avoids disk accesses formost lookups of non-existent rows
or columns.</p>
</blockquote>
<p>Basically, the <em>Bloom Filter</em> can tell us if a key does not exist in the
database, saving many unnecessary disk reads for non-existent keys. However,
because the <em>Bloom Filter</em> is a probabilistic function, it can result in
false positives.</p>
<h2 id="b-trees">B-Trees</h2>
<p>In 1970, a paper on <em>Organization and Maintenance of Large Ordered Indices</em>,
introduced the concept of <em>B-Trees</em> that less than 10 years have become, <em>de
facto</em>, a standard for file organization. They still remain the standard
index implementation in almost all relational databases, and many non
relational databases use them too as well.</p>
<p><img src="/assets/images/b-trees-papers.png" alt="b-trees-papers" /></p>
<blockquote>
<p>The index is organized in pages of a fixed size capable of holding up to
2k keys, but pages need only be partially filled.</p>
</blockquote>
<p>Basically, <em>B-trees</em> break the database down into fixed-size pages, and
read or write one page at a time!</p>
<p><img src="/assets/images/b-trees.png" alt="b-trees" /></p>
<p>Each page can be identified using an address that allows one page to refer
another page. One of those pages is designated as the <em>root</em> of the <em>B-tree</em> and
whenever we want to look up a key in the index, we start from there and traverse
the tree recursively down to the leaves.</p>
<blockquote>
<p>The retrieval algorithm is simple logically, but to program it for a computer
one would use an efficient technique, e.g., a binary search, to scan a page.</p>
</blockquote>
<p><img src="/assets/images/b-trees-animation.gif" alt="b-trees-animation" /></p>
<p>In order to add a new key, we need to find the page within the key range
and split into two pages if there’s no space to accommodate it. This is
the only way in which the height of the tree can increase.</p>
<h2 id="what-about-resilience">What about resilience?</h2>
<p>If we need to split a page because an insertion caused it to be overfull,
we need to write the two pages that were split, and also overwrite their
parent page to update the references to the two child pages. Overwritten
several pages at once it’s a dangerous operation that can result in a
corrupted <em>index</em> if the database crashes.</p>
<p>The solution to tackle this problem is to introduce an additional structure -
the <em>Write-Ahead Log</em> (WAL); where all modifications will be written before it
can be applied to the tree itself. If the database crashes, the WAL will be
used to restore the <em>tree</em> back to a consistent state.</p>
<p>But writing all modifications to the WAL introduces other problem - <em>write
amplification</em>; when one write to the database results in multiple writes
to disk which has a direct performance cost.</p>
<h2 id="wrapping-up">Wrapping up!</h2>
<p>With this brief introduction on several types of data storage engines, we can take
some conclusions:</p>
<ul>
<li><em>B-Trees</em> are <em>mutable</em> and allow in-place updates;</li>
<li><em>LSM-Trees</em> are <em>immutable</em> and require complete file rewrites;</li>
<li>Writes are slower on <em>B-Trees</em> since they must write every piece of data at
least twice;</li>
<li>Reads are slower on <em>LSM-Trees</em> since they have to check the <em>memtable</em>,
<em>bloom filter</em>, and possibly multiple <em>SSTables</em> with different sizes;</li>
<li><em>LSM-Trees</em> are able to sustain higher write throughput due to lower <em>write
amplification</em> and sequential writes, but they can consume lots of resources
on merging and compaction processes, especially if the throughput is very high
and it’s size is getting bigger.</li>
</ul>
<h2 id="whats-next">What’s next?</h2>
<blockquote>
<p>There is no quick and easy rule for determining which type of storage engine
is better for your use case, so it is worth testing empirically.</p>
</blockquote>
<p>You should read papers, and this list may help you with that:</p>
<ul>
<li><a href="https://people.eecs.berkeley.edu/~brewer/cs262/LFS.pdf">Log-Strucured File System</a></li>
<li><a href="https://github.com/basho/bitcask/blob/develop/doc/bitcask-intro.pdf">Bitcask</a></li>
<li><a href="https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf">Bigtable: A Distributed Storage System for Structured Data</a></li>
<li><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf">The log-structured merge-tree</a></li>
<li><a href="http://www.inf.fu-berlin.de/lehre/SS10/DBS-Intro/Reader/BayerBTree-72.pdf">Organization and Maintenance of Large Ordered Indices</a></li>
<li><a href="http://www.ezdoum.com/upload/14/20020512204603/TheUbiquitousB-Tree.pdf">Ubiquitous B-Tree</a></li>
</ul>This post is a transcript of the talk I gave at Papers We Love @ Porto.Don’t be fooled by 100% code coverage.2018-01-14T00:00:00+00:002018-01-14T00:00:00+00:00https://ordep.dev/posts/code-coverage<blockquote>
<p>A program with high test coverage, measured as a percentage, has had more of its
source code executed during testing which suggests it has a lower chance of containing
undetected software bugs compared to a program with low test coverage.</p>
</blockquote>
<blockquote>
<p>“Let’s make it clear, then: don’t set goals for code coverage. You may think that
it could make your code base better, but asking developers to reach a certain code
coverage goal will only make your code worse.”</p>
<p>— <cite>Mark Seemann</cite></p>
</blockquote>
<h2 id="why-its-bad-to-use-high-code-coverage-as-a-goal">Why it’s bad to use high code coverage as a goal?</h2>
<p>In the snippet below we have a function <code class="language-plaintext highlighter-rouge">divide</code> that accepts two <code class="language-plaintext highlighter-rouge">float</code>
arguments, <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> and performs a division between them. Note that we don’t
have any kind of guards on out code.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="nf">divide</span><span class="o">(</span><span class="kt">float</span> <span class="n">x</span><span class="o">,</span> <span class="kt">float</span> <span class="n">y</span><span class="o">)</span> <span class="o">{</span>
<span class="k">return</span> <span class="n">x</span> <span class="o">/</span> <span class="n">y</span><span class="o">;</span>
<span class="o">}</span>
</code></pre></div></div>
<p>With the <code class="language-plaintext highlighter-rouge">divide</code> function we also provide a simple unit test that is making
sure that our function does the job. With this test, we have 100% code coverage.
It means that our code is bullet proof, right?</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Test</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">divide_with_valid_arguments</span><span class="o">()</span> <span class="o">{</span>
<span class="n">assertThat</span><span class="o">(</span><span class="k">new</span> <span class="nc">Calculator</span><span class="o">().</span><span class="na">divide</span><span class="o">(</span><span class="mi">10</span><span class="o">,</span> <span class="mi">2</span><span class="o">)).</span><span class="na">isEqualTo</span><span class="o">(</span><span class="mi">5</span><span class="o">);</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Nope. We have 100% coverage, that’s a fact. But the code itself is not correct.
Also, we’re testing one scenario; a positive and limited scenario. We should,
always, test for failure. In this particular case, what happens if we try to
divide by zero? We should check if the <code class="language-plaintext highlighter-rouge">y</code> is equal to zero and throw a proper
exception.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="nf">divide</span><span class="o">(</span><span class="kt">float</span> <span class="n">x</span><span class="o">,</span> <span class="kt">float</span> <span class="n">y</span><span class="o">)</span> <span class="o">{</span>
<span class="k">if</span> <span class="o">(</span><span class="n">y</span> <span class="o">==</span> <span class="mi">0</span><span class="o">)</span> <span class="o">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nf">ArithmeticException</span><span class="o">(</span><span class="s">"Can't divide by zero."</span><span class="o">);</span>
<span class="o">}</span>
<span class="k">return</span> <span class="n">x</span> <span class="o">/</span> <span class="n">y</span><span class="o">;</span>
<span class="o">}</span>
</code></pre></div></div>
<p>What’s the problem with adding decision branches? Coverage drops and
there’s no time to write another test. Having a good code coverage may be a sign
that we have a solid test suite, but if we’re using it as a mandatory target,
the codebase will eventually suffer.</p>
<p>Humans always take shortcuts. When we have two possible choices, we always
choose the easier one. If the coverage value is part of the merging process,
developers will adapt the code to meet those requirements. A stronger, and
still meeting the 100% code coverage criteria, test suite, would like the
snippet below.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Test</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">divide_with_valid_arguments</span><span class="o">()</span> <span class="o">{</span>
<span class="n">assertThat</span><span class="o">(</span><span class="k">new</span> <span class="nc">Calculator</span><span class="o">().</span><span class="na">divide</span><span class="o">(</span><span class="mi">10</span><span class="o">,</span> <span class="mi">2</span><span class="o">)).</span><span class="na">isEqualTo</span><span class="o">(</span><span class="mi">5</span><span class="o">);</span>
<span class="o">}</span>
<span class="nd">@Test</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">divide_with_invalid_arguments_should_throw_exception</span><span class="o">()</span> <span class="o">{</span>
<span class="n">assertThatThrownBy</span><span class="o">(()</span> <span class="o">-></span> <span class="k">new</span> <span class="nc">Calculator</span><span class="o">().</span><span class="na">divide</span><span class="o">(</span><span class="mi">10</span><span class="o">,</span> <span class="mi">0</span><span class="o">))</span>
<span class="o">.</span><span class="na">isInstanceOf</span><span class="o">(</span><span class="nc">ArithmeticException</span><span class="o">.</span><span class="na">class</span><span class="o">)</span>
<span class="o">.</span><span class="na">hasMessageContaining</span><span class="o">(</span><span class="s">"Can't divide by zero."</span><span class="o">);</span>
<span class="o">}</span>
</code></pre></div></div>
<p>For a simple division function, we can successfully achieve 100% code coverage
with these two tests but, are we really done with the testing? Do our tests
cover a reasonable number of scenarios that make us feel confident about our
code? When testing, the hardest question is when to stop. For some, a shiny
100% code coverage is the answer to that question. It is important to look
for other quality factors than code coverage. Check if the test case is useful,
and is intended to find failures in the system. If you’re looking only to code
coverage as a quality criteria, the test bellow would do the job.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Test</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">divide_with_valid_arguments</span><span class="o">()</span> <span class="o">{</span>
<span class="n">assertThat</span><span class="o">(</span><span class="k">new</span> <span class="nc">Calculator</span><span class="o">().</span><span class="na">divide</span><span class="o">(</span><span class="mi">10</span><span class="o">,</span> <span class="mi">2</span><span class="o">)).</span><span class="na">isNotZero</span><span class="o">();</span>
<span class="o">}</span>
</code></pre></div></div>
<blockquote>
<p>“I don’t know if they did code coverage analysis on this project, but of
course you can do this and have 100% code coverage - which is one reason
why you have to be careful on interpreting code coverage data.”</p>
<p>— <cite>Martin Fowler</cite></p>
</blockquote>
<h2 id="amplifying-the-scenarios-with-parameterized-tests">Amplifying the scenarios with parameterized tests</h2>
<p><em>Parameterized tests</em> are a <em>data-driven testing</em> technique that uses test inputs
and expected outcomes as data, normally in a tabular format, so that a single
driver script can execute all of the designed test cases. A suitable scenario
where <em>data-driven testing</em> can be applied is when two or more test cases requires
the same instructions but different inputs and different expected outcomes.</p>
<p>A nice technique to evaluate the different test inputs is to perform an
<em>equivalence class partitioning</em>, where we divide all possible inputs into
classes such that there is a finite number of input equivalence classes. Once,
they’re set, we may assume that:</p>
<ul>
<li>the program behaves analogously for inputs in the same class;</li>
<li>one test with a representative value from a class is sufficient;</li>
<li>if the representative detects a defect, then other class members would detect the same defect.</li>
</ul>
<p>For <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> we can divide the inputs in four partitions: <code class="language-plaintext highlighter-rouge">{1, 20}</code>,
<code class="language-plaintext highlighter-rouge">{1.0, 20.0}</code>, <code class="language-plaintext highlighter-rouge">{-20, -1}</code>, and <code class="language-plaintext highlighter-rouge">{-20.0, -1.0}</code>. By combining
<em>equivalence class partitioning</em> and <em>parameterized tests</em> we can write
a single test with multiple scenarios like the snippet below.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@ParameterizedTest</span>
<span class="nd">@CsvSource</span><span class="o">({</span>
<span class="s">"10, 5, 2"</span><span class="o">,</span>
<span class="s">"-10, -5, 2"</span><span class="o">,</span>
<span class="s">"10.5, 5.25, 2"</span><span class="o">,</span>
<span class="s">"-5.0, -10.0, 0.5"</span>
<span class="o">})</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">divide_with_valid_fields</span><span class="o">(</span><span class="kt">float</span> <span class="n">x</span><span class="o">,</span> <span class="kt">float</span> <span class="n">y</span><span class="o">,</span> <span class="kt">float</span> <span class="n">z</span><span class="o">)</span> <span class="o">{</span>
<span class="n">assertThat</span><span class="o">(</span><span class="k">new</span> <span class="nc">Calculator</span><span class="o">().</span><span class="na">divide</span><span class="o">(</span><span class="n">x</span><span class="o">,</span> <span class="n">y</span><span class="o">)).</span><span class="na">isEqualTo</span><span class="o">(</span><span class="n">z</span><span class="o">);</span>
<span class="o">}</span>
</code></pre></div></div>
<p>What is the coverage of this <em>parameterized test</em>? 100%. The same coverage
as the test bellow. A test that is not looking for failures in the system.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@Test</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">divide_with_valid_arguments</span><span class="o">()</span> <span class="o">{</span>
<span class="n">assertThat</span><span class="o">(</span><span class="k">new</span> <span class="nc">Calculator</span><span class="o">().</span><span class="na">divide</span><span class="o">(</span><span class="mi">10</span><span class="o">,</span> <span class="mi">2</span><span class="o">)).</span><span class="na">isNotZero</span><span class="o">();</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Parameterized tests contribute to a much more solid test suite, since we’re
testing multiple scenarios with some edge cases, although it doesn’t increase
code coverage.</p>
<h2 id="summary">Summary</h2>
<blockquote>
<p>“Designing your initial test suite to achieve 100% coverage is an even worse
idea. It’s a sure way to create a test suite weak at finding those all-important
faults of omission.”</p>
<p>— <cite>Brian Marick</cite></p>
</blockquote>
<p>High code coverage is not directly related with code quality and cannot be used
neither as a key metric or a goal. Try to look for other metrics, expand the test
scenarios, and don’t stop when you reach that shiny 100% mark.</p>
<hr />
<p>More readings on the topic.</p>
<ul>
<li><a href="http://www.exampler.com/testing-com/writings/coverage.pdf">How to Misuse Code Coverage</a></li>
<li><a href="http://blog.ploeh.dk/2015/11/16/code-coverage-is-a-useless-target-measure/">Code coverage is a useless target measure</a></li>
<li><a href="https://softwareengineering.stackexchange.com/questions/216301/are-there-any-formalized-mathematical-theories-of-software-testing">Are there any formalized/mathematical theories of software testing?</a></li>
</ul>A program with high test coverage, measured as a percentage, has had more of its source code executed during testing which suggests it has a lower chance of containing undetected software bugs compared to a program with low test coverage.