Streaming Ecosystem Complexities and Cost Management // Rohit Agrawal // #302

This episode explores the complexities of the modern streaming data ecosystem and the challenges companies face in leveraging real-time data effectively for machine learning applications. Against the backdrop of readily available streaming solutions like Kafka and Kinesis, the discussion highlights the difficulty in utilizing this data for applications, due to fragmented tools and required expertise. More significantly, the conversation delves into the high costs associated with maintaining these systems, including infrastructure and human resources, with examples of tens to hundreds of thousands of dollars spent on simple streaming pipelines. For instance, the podcast discusses how aggressive checkpointing settings can significantly inflate costs, and how optimizing data schemas and storage choices (e.g., using S3 instead of local disks) can lead to substantial savings. As the discussion pivoted to organizational aspects, the interview reveals the communication gaps between data scientists, data engineers, and SRE teams, emphasizing the need for better collaboration to optimize pipelines. In contrast to the complexity of building and maintaining these systems, the podcast also highlights the emergence of managed services offered by vendors like Confluent and Databricks, which simplify operations but come with their own cost considerations. Ultimately, the episode underscores the importance of choosing the right tools for the job, balancing cost, freshness, and product velocity, and the potential of open standards like Iceberg to reduce vendor lock-in and improve interoperability in the future.

Outlines

Sign in to continue reading, translating and more.

Continue

MLOps.community

Introduction and Rohit Agrawal's Background at Tecton

The Fragmented Streaming Data Ecosystem and its Challenges

Optimizing Streaming Pipelines: Best Practices and Cost Reduction

Data Schema, Stakeholder Collaboration, and Product Velocity

Managed Services, Industry Consolidation, and the Future of Data Storage

Streaming Ecosystem Complexities and Cost Management // Rohit Agrawal // #302

MLOps.community

00:00Introduction and Rohit Agrawal's Background at Tecton

Introduction and Rohit Agrawal's Background at Tecton

04:46The Fragmented Streaming Data Ecosystem and its Challenges

The Fragmented Streaming Data Ecosystem and its Challenges

14:15Optimizing Streaming Pipelines: Best Practices and Cost Reduction

Optimizing Streaming Pipelines: Best Practices and Cost Reduction

24:15Data Schema, Stakeholder Collaboration, and Product Velocity

Data Schema, Stakeholder Collaboration, and Product Velocity

34:32Managed Services, Industry Consolidation, and the Future of Data Storage

Managed Services, Industry Consolidation, and the Future of Data Storage