Cracking the AI black box: Challenges and solutions in testing generative AI

Testing generative AI is one of the hardest problems in tech today. Outputs are fuzzy, grading is subjective, and risks around bias and safety are real. 

In this talk, you’ll learn how to systematically evaluate generative AI applications using modern approaches like LLM-as-a-judge and structured evaluation toolkits. You’ll see how to balance innovation with responsibility, ensuring AI systems are not only powerful but also trustworthy and ethical.

Who is this talk for?

This talk is for QA professionals, responsible AI adoption consultants, test automation engineers, AI/ML practitioners, product owners, and engineering leaders who are working with—or about to adopt—generative AI in their products.

What will attendees take away?

  • A clear understanding of why testing AI is fundamentally different.

  • Practical strategies to grade and evaluate non-deterministic outputs.

  • Knowledge of AI Foundry and LLM-as-a-judge approaches.

  • A framework for responsible AI testing: balancing speed, scale, and ethics.

  • Insights into the future of QA in an AI-first world.

Deep Pandey

He/him
Senior Consultant in AI @ Planit

Deep Pandey is a technology leader with nearly 20 years’ experience in Quality Engineering and Solution Architecture. He has led AI-driven process transformation in banking and government, using Lean methods to modernize operations, enhance delivery, and scale quality practices.

Deep specialises in AI applications, LLM evaluation, and building structured generative AI pipelines. Known for bridging quality, architecture, and innovation, he combines operational excellence with responsible AI adoption to deliver scalable, resilient, and future-ready solutions.

LinkedIn