LLM-as-judge — when the judge is another model (step 8/9) · eval-driven ai development

Build a rubric-style eval suite. Write run_judge_suite(cases) that:

Takes a list of dicts, each shaped {"question": str, "answer": str, "rubric": str}.
For each case, calls judge_rubric(question, answer, rubric). The judge returns {"passed": bool, "critique": str}.
Counts how many cases pass.
Returns a dict {"total": <count>, "passed": <count of passes>, "failed": <count of fails>, "pass_rate": <0.0-1.0 rounded to 2 places>}.

The script will run a 4-case suite. Expected output:

total=4 passed=2 failed=2 pass_rate=0.5

Build a rubric-style eval suite. Write run_judge_suite(cases) that:

Takes a list of dicts, each shaped {"question": str, "answer": str, "rubric": str}.
For each case, calls judge_rubric(question, answer, rubric). The judge returns {"passed": bool, "critique": str}.
Counts how many cases pass.
Returns a dict {"total": <count>, "passed": <count of passes>, "failed": <count of fails>, "pass_rate": <0.0-1.0 rounded to 2 places>}.

The script will run a 4-case suite. Expected output:

total=4 passed=2 failed=2 pass_rate=0.5

full-screen editor opens — close anytime to keep reading.

LLM-as-judge — when the judge is another model — step 8 of 9