Session Title: Building Stream Pipeline With Azure Databricks Using Structured Stream And Delta Lake
Speaker: Mohit Batra
Abstract: Modern data pipelines not just work with batch processing of data, but it often includes streaming data that needs to be processed in real-time. However, many a times, requirements go much beyond that. The processed data may then be consumed by downstream batch, as well as another streaming pipelines. There are several challenges that you need to address here.
First, to use a common platform to build unified batch and streaming pipelines. Second, to store the processed data into a Data Lake, and still make sure that downstream systems can consume it reliably, without any consistency issues. And third, to have a unified environment for development and deployment. And this is what we are going to address here.
This session will take you through three components.
Structured Streaming stream processing component of Apache Spark to build reliable streaming pipelines
Delta Lake an open-source storage layer that brings ACID transactions and reliability to your data lakes.
And Azure Databricks a unified analytics platform that runs on Azure, handles the infrastructure, and provides a collaborative environment for doing development.
We will first talk about the end-to-end architecture. Then we will spend time to understand these three components, the challenges they are addressing, and see it in action in the demos. And finally, you will see demos on how they come together and make it extremely easy to build data pipelines for many common scenarios.
300+ sessions are now available on-demand from Data Platform Summit 2021 & 2020 at no cost. Browse all sessions.
Stay tuned, more learning coming your way.