Operationalizing AI Agents: From Experimentation to Production // Databricks Roundtable

The panel explores the challenges and strategies for deploying AI agents in production, balancing excitement with the need to prevent potential damage. The discussion highlights the shift in software engineering due to agents, where tasks that previously required extensive engineering resources can now be handled by AI. Panelists emphasize the importance of narrowing the scope of agent projects initially, focusing on specific, manageable tasks to ensure reliability and build trust. They advocate for eval-driven development, incorporating unit tests, integration tests, and production telemetry to monitor agent performance and align agent behavior with domain expertise. The panel also cautions against granting agents direct database access, recommending read-only access to sandboxed environments to mitigate risks.

Outlines

Part 1: Context, Use Cases

Part 2: Strategy, Scope

Part 3: Quality, Monitoring

Part 4: Evaluation, Testing

Part 5: Product Thinking, Alignment

Part 6: Tools, Resources

Part 7: Architecture, Determinism

Part 8: Judges, Security

Sign in to continue reading, translating and more.

Continue

MLOps.community

Part 1: Context, Use Cases

00:02The Excitement and Reality of Productionizing AI Agents

The Excitement and Reality of Productionizing AI Agents

02:30Internal AI Agents Revolutionizing Data Access and Productivity

Internal AI Agents Revolutionizing Data Access and Productivity

04:05AI Consulting: Automating Workflows and Integrating AI Across Domains

AI Consulting: Automating Workflows and Integrating AI Across Domains

05:26MLflow's Role in Ensuring Quality and Governance of AI Agents

MLflow's Role in Ensuring Quality and Governance of AI Agents

Part 2: Strategy, Scope

07:23Automating Tasks and Granting Access with AI Agents

Automating Tasks and Granting Access with AI Agents

09:52Cultural Shift: Empowering Employees with AI Agents

Cultural Shift: Empowering Employees with AI Agents

11:35Narrowing the Scope: Building AI Assistants for Non-Technical Users

Narrowing the Scope: Building AI Assistants for Non-Technical Users

13:41Communication and Context: Key to Successful AI Implementation

Communication and Context: Key to Successful AI Implementation

Part 3: Quality, Monitoring

15:25Quality Degradation: The Reality Check for AI Aspirations

Quality Degradation: The Reality Check for AI Aspirations

17:41Monitoring Quality: Observability and Reproducibility in AI Solutions

Monitoring Quality: Observability and Reproducibility in AI Solutions

19:31Tolerance for Error: Balancing Observability and Scope in AI Deployment

Tolerance for Error: Balancing Observability and Scope in AI Deployment

21:33Enforcing Visibility: Ensuring Human Oversight and Rigorous Testing

Enforcing Visibility: Ensuring Human Oversight and Rigorous Testing

22:27Preemptive Measures: CI and Sensory Alerts for AI Model Drift

Preemptive Measures: CI and Sensory Alerts for AI Model Drift

Part 4: Evaluation, Testing

23:55Eval-Driven Development: Ensuring Trust in AI Applications

Eval-Driven Development: Ensuring Trust in AI Applications

25:37Simplifying Problems: Avoiding Over-Abstraction in AI Development

Simplifying Problems: Avoiding Over-Abstraction in AI Development

27:27Unit Tests and Feedback Loops: Verifiable Goals for AI Agents

Unit Tests and Feedback Loops: Verifiable Goals for AI Agents

29:01Sub-Agent Code Review: Automating Quality Checks in Coding

Sub-Agent Code Review: Automating Quality Checks in Coding

30:40Verifiable Outputs: Breaking Down Problems for Effective Evaluation

Verifiable Outputs: Breaking Down Problems for Effective Evaluation

32:04Sneaky Issues: Asynchronous Race Conditions and Memory Leaks in Code

Sneaky Issues: Asynchronous Race Conditions and Memory Leaks in Code

Part 5: Product Thinking, Alignment

33:33Product Thinking: The Key to Effective AI Implementation

Product Thinking: The Key to Effective AI Implementation

35:06Domain Expertise: Aligning Evaluators with Real-World Expectations

Domain Expertise: Aligning Evaluators with Real-World Expectations

37:23Improving Judges: Using Feedback Loops to Enhance Evaluation

Improving Judges: Using Feedback Loops to Enhance Evaluation

38:42Robust Integration Tests: Ensuring Confidence in AI Code

Robust Integration Tests: Ensuring Confidence in AI Code

40:27Comprehensive Testing: Integrating Agent Behavior into CI

Comprehensive Testing: Integrating Agent Behavior into CI

Part 6: Tools, Resources

42:01Eval-Driven Development: Resources and Mindset for Success

Eval-Driven Development: Resources and Mindset for Success

44:54Tech Stack Recommendations: Pyrantic AI, Logfire, and Barmel

Tech Stack Recommendations: Pyrantic AI, Logfire, and Barmel

45:39Observability Tools: MLflow and Tracing Integrations

Observability Tools: MLflow and Tracing Integrations

Part 7: Architecture, Determinism

47:20Deterministic Controls: Checks and Policies for Agent Workflows

Deterministic Controls: Checks and Policies for Agent Workflows

48:25Determinism in LLMs: Limitations and Probabilistic Systems

Determinism in LLMs: Limitations and Probabilistic Systems

50:02Low-Level Systems: Control and Customization in AI Development

Low-Level Systems: Control and Customization in AI Development

52:45Observability and Security: Standard Practices for AI Systems

Observability and Security: Standard Practices for AI Systems

53:42Framework Preferences: BAML for AI Development

Framework Preferences: BAML for AI Development

Part 8: Judges, Security

54:55Judges as Classifiers: Binary Outputs and Framework Options

Judges as Classifiers: Binary Outputs and Framework Options

56:11Ground Truth Sets: Building Trust in Probabilistic Judges

Ground Truth Sets: Building Trust in Probabilistic Judges

57:33LLMs for Judging: Aligning and Validating with Ground Truth

LLMs for Judging: Aligning and Validating with Ground Truth