29 October 2019

Lambda Architecture

About four years ago, in 2015, Nathan Marz and James Warren published a book (https://www.amazon.com/Big-Data-Principles-practices-scalable/dp/1617290343) introducing a relatively fresh view on how reliable data-processing architectures can be designed leveraging both batch and stream processing at the same time. This concept was named Lambda Architecture.

Today the concepts introduced in this book are used in many companies, from small to large, but the book itself can be considered a little outdated. It basically describes a certain technology stack (Apache Storm, Hadoop), which is still pretty good, but sometimes it may give you the wrong impression that Lambda Architecture means using this exact stack with minimal changes. Also, when I read this book, I would really like to know more about hidden complexity and problems that our team would need to solve while implementing this approach. Anyway,  it's still a good introduction on topic.

So, it all starts with collecting data from various sources, like rivers flowing into a lake. That is, in fact, how this pool of data is often referred to - a Data Lake. And Lambda Architecture can be a part of it, helping your analysts to swim in that lake. We'll get to it.

I'm not really going to walk through different instruments for collecting entries from various sources - every company has their own processes. Usually it's just some kind of ETL or processing framework, like Sqoop, Flume, Talend, Pentaho and so on. It may even be some incredibly simply self-written tool or native DB connector. Storing the data is also not a problem - it's getting cheaper every day, and you probably already have some OLTP database as a backend for your website as well as something like HDFS for long-term storage. In the modern world, the latter is often replaced by a full-featured Data Warehouse, like Amazon Redshift, MemSQL, Vertica or ClickHouse. They give some more flexibility in terms of what to store in memory and how to optimize your queries instead of dealing with tons of MapReduce jobs.

Let’s say you’re selling IoT devices. We can think of something well known, like a fit bracelet which collects data over time and gives some health recommendations based on this information. This means, the user agrees to send the data over to your servers for analysis and gets a nice online dashboard in return.

This most likely means that you’ll need some Event Bus (aka Data Bus) for collecting the data from these devices. As a business requirement, you may now want to count the daily number of steps per customer, to be able to recommend spending more time walking around, if needed. Also, as another service for elderly people, you may monitor some vital parameters and generate an emergency alarm if something is out of the ordinary. The question is, how will you implement these features?

There are a lot of products that can be used as a Data Bus, starting from Redis PUB/SUB and ending with Apache Kafka or cloud-powered tools, like AWS SQS or Kinesis. So, as a first approach, it seems to be a good idea to take a simple key-value storage to keep track of current counter values, and to spawn some microservice listening to the stream and incrementing these values. Some infrastructural issues can be mitigated using various safety mechanisms (I would highly recommend reading this cool book on how this can be done - https://www.manning.com/books/streaming-data).

Seems viable? Well, almost, until at some point (after a while) you get a bug report saying that there is a small issue with the step counters being increased twice instead of once for some customers. Not a big deal, all people make mistakes, but this problem is almost unsolvable purely on streaming level, because we have no information about historical data, only some aggregated numbers that are not valid anymore at this point.

So, we need a long term storage and we also need to be able to query all the data. This will be our single source of truth, which will always give us the precise result running any query across the whole dataset. There is another issue here though - it takes huge amount of time. Simple MapReduce jobs may easily take minutes or even hours to look through all the data and give you the precise result. This isn’t a big deal for people getting their health recommendations once or twice a day, but it is absolutely unacceptable for the second business case, when immediate response to the anomaly in transmitted data may be crucial.

So, from bird’s view, we have a situation where we need both data circuits simultaneously:

 1) Long-term batch storage with possibility to query all the data we have at once and get results after some period of time (whether it is running a MapReduce job or an SQL query over Data Warehouse). Slow, but reliable source of truth. This is a Batch Layer.

2) Short-living, but fast streaming bus, allowing us to do aggregations almost in real time. Those are very approximate, but really fast to calculate. This is a Speed Layer.

These two things seem independent, but also you can notice the one is weak exactly in those same spots where the other one is strong, and vice versa. Is there a way to combine them somehow? Well, this way exists and it is called Lambda Architecture. Finally!

So, one circuit can clarify and update the results of the other. Speed Layer is calculating aggregations the same way it was doing it before, but from time to time their precision is being updated with trustworthy results from Batch Layer. Besides, it's really helpful to be able to rewrite the data generated on Speed Layer from Batch Layer completely if anything goes wrong with it (like, we find some bugs in how steps are counted again). Also, we may use some precomputed Batch views on Speed Layer to make decisions (say, filtering), that means, if Batch Layer is temporarily unavailable, our Speed Layer is still able to show at least something meaningful to our customers to not lose track of what's going on while we're troubleshooting.

Looks like now we can handle both of our business cases at the same time! Also, if we suddenly decide to implement some new realtime feature, now we don’t need to start counting from zero again, because we have some prior knowledge. Adding another counter is also a problem which is not really solvable just on Speed Layer, but it works well when we have both.

Unfortunately, the main power of the Lambda Architecture is also its weakness. It seems reasonable to have these two circuits simultaneously, but this also means we'll have duplicated logic, doing the same, inside both of them. They will most likely even be written in different languages (say, Go and SQL). Actually, this is yet another thing that the initial book doesn't cover in detail, but we can still try and solve this by encapsulating small pieces of these computations into separate microservices. Whether it is a cron-powered SQL query doing health recommendations or a Golang binary running anomaly detection 24/7, it's for the best to keep things small and simple to be able to play around and replace the pieces of the puzzle without bothering the whole pipeline. That definitely requires some proper CI/CD workflow to orchestrate things flawlessly, so in our projects we usually use a combination of Jenkins, Groovy pipelines (IaaC approach) and Kubernetes to move containers around.

Okay, so where exactly do results of Speed Layer and Batch Layer go? In our IoT case they are just stored as aggregates in a key-value storage where our data scientists can find them. But apart from that, in order to do some more research on how to develop your business, they will need more queries and more aggregates than are actually used in production right now. Still, if we keep running analytical queries on all of our data at once, we may end up using huge amounts of resources and doing the same computations over and over again.

Let’s add one more feature to our customer dashboard - we may recommend some of our customers who are into running to meet up with other people in the same district that are also following similar routes and behavior patterns. To do that, we don’t need to run some totally custom queries on top of our analytic database, but we can just add one more cron job for periodically recalculating more aggregates. In this case, for example, it may be a vector of features like daily running distance grouped by district for every customer, to calculate similarity.

So, this mysterious magic layer with precalculated views that data scientists are looking at is usually referred to as Serving Layer. But it's not only for data scientists, this is the same layer (may be the same in-memory cache) that stores batch-calculated views used on Speed Layer as a source of historical truth, which I mentioned above.

So, these three layers are exactly those main logical building blocks of what's called Lambda Architecture.

Oh, by the way, have you ever heard of Kappa Architecture? Some engineers claim that with modern versions of Apache Kafka and other streaming frameworks like Samza it's possible to work around those possible issues with Speed Layer, removing the need for Batch Layer. This is also an interesting approach, but this also means you’re tying yourself to some specific magical frameworks and implementation details (like, relying on Kafka storing offsets in Zookeeper and querying data by those offsets), which is vendor locking.

So, I hope these concepts help you think in more details of what to do with your data. As a great mind once said, "Everything should be made as simple as possible, but not simpler." In my experience, one of the checks for simplicity and consistency could be those people saying "Ah, THAT is how it's called! We're using this approach already!" after they hear some of these terms. I had several of such people when I did my talk on this topic at DataFest event in Moscow on May 11th 2019, so it should be a good sign. Anyway, would really like to know your thoughts. See you on conferences!