Ask HN: How to unit test AI responses?
I am tasked to build a customer support chat. The AI should be trained with company docs. How can I be sure the AI will not hallucinate a bad response to a customer?
I am tasked to build a customer support chat. The AI should be trained with company docs. How can I be sure the AI will not hallucinate a bad response to a customer?
You need evals. I found this post extremely helpful in building out a set of evals for my AI product: https://hamel.dev/blog/posts/evals/
+1 to evals
https://github.com/anthropics/courses/tree/master/prompt_eva...
https://cookbook.openai.com/examples/evaluation/getting_star...
Your question is very general. "A customer support app" can mean many things from a faq to a case management interface.
If you 100% can not tolerate "bad" answers, only use the LLM in the front end to map the user's input onto a set of templated questions with templated answers. In the worst case, the user gets a right answer to the wrong question.
This is a good reply
To ensure the AI doesn't hallucinate bad responses, focus on the following steps:
Quality Training Data: Train the model on high-quality, up-to-date company documents, ensuring it reflects accurate information.
Fine-tuning: Regularly fine-tune the model on specific support use cases and real customer interactions.
Feedback Loops: Implement a system for human oversight where support agents can review and correct the AI's responses.
Context Awareness: Design the system to ask clarifying questions if uncertain, avoiding direct false information.
Monitoring: Continuously monitor and evaluate the AI’s performance to catch and address any issues promptly.
You can’t (practically) unit test LLM responses, at least not in the traditional sense. Instead, you do runtime validation with a technique called “LLM as judge.”
This involves having another prompt, and possibly another model, evaluate the quality of the first response. Then you write your code to try again in a loop and raise an alert if it keeps failing.
You don't. You have to separate concers between deterministic and stochastic code input/output. You need evals for the stochastic and mocking when the stochastic output is consumed in the deterministic code.