REPLACE Big Data Show logo

WORKSHOPS


APRIL 10 WORKSHOPS

Build Your Own Data Lakehouse (Starburst)

11:00am – 12:30pm
Room 404

The data lakehouse architecture has taken the analytics world by storm, applying critical data warehouse-like capabilities to the cloud data lake, and enabling an entirely new era of possibilities over the traditional data warehouse or legacy data lake architectures of the past. To achieve this desired result, you need to select a key component to power your lakehouse - a query engine. In this workshop, you will easily build and manage an open data lakehouse architecture powered by the query engine Trino to support your growing analytics. Trino is an open source highly parallel and distributed query engine built from the ground up at Facebook for efficient, low-latency analytics. You will then combine your query engine with a modern table format, low-cost object storage, and proper security and governance to build out a model data lakehouse. In this session, you will configure and build your own data lakehouse by connecting to multiple data sources, transforming your data, and producing a final output ready to be utilized by downstream consumers. 


Understanding Modern Table Formats (Starburst)

3:00pm - 4:30pm
Room 404

Table formats provide an abstraction layer which helps to interact with the files in the data lake by defining schema, columns and datatypes. Limitations to Apache Hive, the first generation table format, drove the formation of modern open-source table formats like Apache Iceberg, Delta Lake, and Hudi. These table formats store the metadata on the data lake alongside the actual data, and provide ACID transactions for data transformation statements thereby enabling a full-featured data lakehouse. Utilizing modern table formats in your architectural design allows for data warehouse-like functionality on the data lake, aids in increased performance and scalability, and also incorporates new versioning features such as time-travel and rollbacks. In addition to providing a base knowledge of the leading modern table formats, the workshop provides a hands-on exercise utilizing Apache Iceberg within your data lakehouse architecture to see firsthand the benefits that modern table formats offer. 


APRIL 11 WORKSHOPS

Build SQL Data Pipelines and Data Products (Starburst)

11:00am – 12:30pm
Room 404

Data products are a curated collection of purpose-built and reusable data sets with business-approved metadata designed to solve specific, targeted business problems. The best part about data products? They can be built with the standard ANSI SQL we all know and love. In this hands-on, instructor-led lab, you will build SQL federated data pipelines within your data lake and create curated data products ready to be distributed to downstream consumers. Experience a full end-to-end SQL data pipeline from data discovery, data transformations, the creation of data products, and implementing the security and governance required to publish those data products through role-based access control.


Build Python Data Pipelines and Data Products (Starburst)

3:00pm – 4:30pm
Room 404

Data products are a curated collection of purpose-built and reusable data sets with business-approved metadata designed to solve specific, targeted business problems. The best part about data products? A powerful feature is that they can be built with SQL, but data engineers can also construct them with Python. The PyStarburst API allows data engineers to leverage their existing Dataframe API skills to surface data products as well. In this hands-on, instructor-led lab, you will build Python-based federated data pipelines for your data lake and create curated data products ready to be distributed to downstream consumers. After using data discovery features available in Starburst Galaxy, participants will build an end-to-end programmatic data pipeline of data transformations and the creation of data products. 



Building an Apache Iceberg Data Lakehouse with Nessie/Dremio on Your Laptop (Dremio)

Presenter:  Alex Merced - Developer Advocate, Dremio
12:30pm – 1:30pm 
Room 406

In this hands-on workshop, participants will embark on a journey to construct their very own data lakehouse platform using their laptops. 

The workshop is designed to introduce and guide through the setup and utilization of three pivotal tools in the data lakehouse architecture: Dremio, Nessie, and Apache Iceberg. Each of these tools plays a crucial role in enabling the flexibility of data lakes with the efficiency and ease of use of data warehouses, aiming to simplify and economize data management. Participants will start by setting up a Docker environment to run all necessary services, including a notebook server, Nessie for catalog tracking with Git-like versioning, Minio as an S3-compatible storage layer, and Dremio as the core lakehouse platform. 

The workshop will provide a practical, step-by-step guide to federating data sources, organizing and documenting data, and performing queries with Dremio; tracking table changes and branching with Nessie; and creating, querying, and managing Apache Iceberg tables for an ACID-compliant data lakehouse. Prerequisites for the workshop include having Docker installed on your laptop. 

Attendees will be taken through the process of creating a docker-compose file to spin up the required services, configuring Dremio to connect with Nessie and Minio, and finally, executing SQL queries to manipulate and query data within their lakehouse. This immersive session aims not just to educate but to empower attendees with the knowledge and tools needed to experiment with and implement their data lakehouse solutions. By the end of the workshop, participants will have a functional data lakehouse environment on their laptops, enabling them to explore further and apply what they have learned to real-world scenarios. 

Whether you're looking to improve your data management strategies or curious about the data lakehouse architecture, this workshop will provide a solid foundation and practical experience.