Lambda Architecture
Making Sense of it All
Building a well-designed, reliable and functional big data application that caters to a variety of end-user latency requirements can be an extremely challenging proposition. It can be daunting enough to just keep up with the rapid pace of technology innovation happening in this space, let alone building applications that work for the problem at hand. “Start slow and build one application at a time” is perhaps the most common advice given to beginners today. However, there are certain high-level architectural constructs that can help you mentally visualize how different types of applications fit into the big data architecture and how some of these technologies are transforming the existing enterprise software landscape.
Lambda Architecture
Lambda Architecture is a useful framework to think about designing big data applications.Nathan Marz came up with the term Lambda Architecture (LA) for a generic, scalable and fault-tolerant data processing architecture, based on his experience working on distributed data processing systems at Backtype and Twitter.
The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.
Here’s how it looks like, from a high-level perspective:
The Lambda Architecture as seen in the picture has three major components.
-
Batch layer that provides the following functionality
- managing the master dataset, an immutable, append-only set of raw data
- pre-computing arbitrary query functions, called batch views.
-
Serving layer—This layer indexes the batch views so that they can be queried in ad hoc with low latency.
-
Speed layer—This layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only.
- All data entering the system is dispatched to both the batch layer and the speed layer for processing.
- The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
- The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
- The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
- Any incoming query can be answered by merging results from batch views and real-time views.
In the case of failure or orderly exit of a topology, the new version of the bolt can read this log and reconstruct the necessary state of the bolt very quickly. Once the log is read, tuples coming from the spout can be processed as if nothing ever happened. Since all tuples that arrived after the last record in the log have not been acknowledged, the spout will replay them so the bolt will get a complete set of tuples.