Standout Ideas from Lesson 7 of the AI Evals Course

AI Evals

In this blog post, I highlight standout ideas from the seventh lesson of the AI evals course by Hamel Husain and Shreya Shankar.

Author

Vishal Bakshi

Published

August 12, 2025

Lesson 7: Interfaces for Human Review

Idea 1: Custom UIs = 10x review throughput compared to reviewing in a spreadsheet. This because custom UIs allow a domain-aware view (emails structured like your inbox instead of a string of text) and hotkeys for navigation or one-click tags, and takes only 1 hr to prototype nowadays. A middle-ground between spreadsheets and custom UIs: jupyter notebook (a pseudo-interface) especially with the _repr_html_ method.

Idea 2: HCI Principles for UIs (Nielsen, 1994). Visibility of status (let your user know where they are), recognition over recall (assign tags instead of free-form text in second round of error analysis and beyond), match the real world (native end user display form; results in catching errors only apparent in this form), user control (pass/fail 1-key press, undo, tag select w/number keys, “defer” for uncertainty, goal: get the user into a flow state), minimalist first (expand on demand). Add a progress bar whenever you’re making a user wait for something. Overall principle: reduce friction.

Idea 3: Nerd-snipe your features. Shreya implemented a highlight feature where on the backend their app looks for semantic or keyword similarities with previous failed samples and highlights those words in the current example display to flag common issues for easier user identification. Super cool. Another similar example: batch-label similar traces after clustering to wipeout repeat bugs. I never considered integrating machine intelligence into error analysis before this!

Idea 4: Criteria drift happens! Reviewers’ definitions change over time so keep rubrics and labels editable. What you think was acceptable/unacceptable changes as you review real traces. Additionally, humans’ understanding of LLM capabilities also evolves over time (i.e. humans align with LLMs as LLMs align with humans).