Being Data Driven At Stripe With Trino And Iceberg

This episode explores Stripe's utilization of Trino and Iceberg for its data lakehouse architecture. Against the backdrop of outgrowing Redshift's capabilities, Stripe adopted Trino for its read-heavy business analytics, contrasting it with Spark's use for data transformation. More significantly, the discussion delves into the challenges of scaling to petabytes of data and the concurrency issues arising from numerous simultaneous queries. For instance, the transition from Hive to Iceberg is highlighted, emphasizing the cost and efficiency improvements offered by Iceberg's metadata management over Hive's reliance on expensive S3 listing operations. As the discussion pivoted to infrastructure specifics, the ongoing migration from Hive Metastore to the Iceberg REST catalog was detailed, showcasing the benefits of decoupling the query engine from the metadata layer for improved flexibility and scalability. The interview concludes with insights into the innovative ways Stripe leverages Trino and Iceberg for metadata analysis and platform observability, enabling efficient table deprecation and resource management, ultimately demonstrating the power of a well-integrated data lakehouse ecosystem.

Outlines

Part 1: Introduction and Transition

Part 2: REST Catalog and Advanced Analytics

Part 3: Future, Challenges, and Community

Sign in to continue reading, translating and more.

Continue

Data Engineering Podcast

Part 1: Introduction and Transition

Introduction and Overview of Trino and Iceberg at Stripe

Bottlenecks and the Transition from Hive to Iceberg

Infrastructure Details and Migration from Hive Metastore

Part 2: REST Catalog and Advanced Analytics

Motivation for Developing the Python Iceberg REST Catalog

Leveraging Trino and Iceberg for Advanced Analytics and Observability

Use Cases Enabled by Enhanced Metadata and Query Analysis

Managing Multi-Tool Data Ecosystems and the Importance of the REST Catalog

Part 3: Future, Challenges, and Community

Future of the REST Catalog and Addressing Pain Points

Unexpected Uses of Trino and Iceberg and Community Collaboration

When Trino and Iceberg are Not the Right Choice and Addressing Latency Issues

Lessons Learned and Future Directions

Being Data Driven At Stripe With Trino And Iceberg

Data Engineering Podcast

Part 1: Introduction and Transition

00:51Introduction and Overview of Trino and Iceberg at Stripe

Introduction and Overview of Trino and Iceberg at Stripe

04:36Bottlenecks and the Transition from Hive to Iceberg

Bottlenecks and the Transition from Hive to Iceberg

09:07Infrastructure Details and Migration from Hive Metastore

Infrastructure Details and Migration from Hive Metastore

Part 2: REST Catalog and Advanced Analytics

12:02Motivation for Developing the Python Iceberg REST Catalog

Motivation for Developing the Python Iceberg REST Catalog

14:11Leveraging Trino and Iceberg for Advanced Analytics and Observability

Leveraging Trino and Iceberg for Advanced Analytics and Observability

21:13Use Cases Enabled by Enhanced Metadata and Query Analysis

Use Cases Enabled by Enhanced Metadata and Query Analysis

24:10Managing Multi-Tool Data Ecosystems and the Importance of the REST Catalog

Managing Multi-Tool Data Ecosystems and the Importance of the REST Catalog

Part 3: Future, Challenges, and Community

27:01Future of the REST Catalog and Addressing Pain Points

Future of the REST Catalog and Addressing Pain Points

30:59Unexpected Uses of Trino and Iceberg and Community Collaboration

Unexpected Uses of Trino and Iceberg and Community Collaboration

34:26When Trino and Iceberg are Not the Right Choice and Addressing Latency Issues

When Trino and Iceberg are Not the Right Choice and Addressing Latency Issues

40:40Lessons Learned and Future Directions

Lessons Learned and Future Directions