This episode explores the complexities of the modern streaming data ecosystem and the challenges companies face in leveraging real-time data effectively for machine learning applications. Against the backdrop of readily available streaming solutions like Kafka and Kinesis, the discussion highlights the difficulty in utilizing this data for applications, due to fragmented tools and required expertise. More significantly, the conversation delves into the high costs associated with maintaining these systems, including infrastructure and human resources, with examples of tens to hundreds of thousands of dollars spent on simple streaming pipelines. For instance, the podcast discusses how aggressive checkpointing settings can significantly inflate costs, and how optimizing data schemas and storage choices (e.g., using S3 instead of local disks) can lead to substantial savings. As the discussion pivoted to organizational aspects, the interview reveals the communication gaps between data scientists, data engineers, and SRE teams, emphasizing the need for better collaboration to optimize pipelines. In contrast to the complexity of building and maintaining these systems, the podcast also highlights the emergence of managed services offered by vendors like Confluent and Databricks, which simplify operations but come with their own cost considerations. Ultimately, the episode underscores the importance of choosing the right tools for the job, balancing cost, freshness, and product velocity, and the potential of open standards like Iceberg to reduce vendor lock-in and improve interoperability in the future.
Sign in to continue reading, translating and more.
Continue