Session Title: Intelligent Lakehouse: Think Beyond Transactional Data Lake with Apache Hudi
Speaker(s): Sagar Sumit
Abstract:The rise of cloud computing, decoupling of storage and compute, and the need to store and process large amounts of unstructured or semi-structured data led to the data lake movement a decade ago. And yet the promise of that movement has been unfulfilled. The lack of transaction and its isolation in the data lake forced users to maintain both a lake and data warehouse in their data architecture. As if that was not enough, the problem of data quality and governance has grown with the data lake movement. To bridge some of these gaps, Apache Hudi, created in 2016 at Uber, pioneered the modern transactional data lake, aka lakehouse, architecture movement. In this session, we will discuss the design of an embedded database kernel that is Hudi, and how it unlocks transactional updates/deletes with tunable concurrency control and incremental change streams from tables directly on lake storage. Then, we will go one step beyond and discuss what it takes to build a lake that is not only transactional but also intelligent. With its rich set of table services and indexing schemes to offer the advantage of data locality, Hudi attempts to balance the tradeoff between incremental ingestion and query latency in a novel way. Today, Hudi already powers data lakes of varying scales from a few terabytes to exabytes.
500+ sessions are now available on-demand from Data Platform Summit 2022, 2021 & 2020 at no cost. Browse all sessions.
Stay tuned, more learning coming your way.