They say trust is earned, but how do you trust your AI?
On the Mechanical Orchard R&D team, we’ve been using AI as much as possible to help us understand and analyze complex legacy mainframe systems.[1] [2] I’ve asked DALL-E to illustrate our general approach:
It’s mostly right—sort of. It’s not great at spelling all the words, and step 3 traditionally comes after step 2, but you get the point. Or do you?
Therein lies the problem. While LLMs can dramatically assist us in our everyday work, we can only rely on them to the extent that they are, well… reliable. And AI can be notoriously unreliable.
We recently made a bespoke AI agent that can answer complex questions about the structure and statistics of the code in our client’s mainframe system. At first, it wasn’t very accurate or reliable. Here are some of the ways we fixed it so that we could trust what the agent says.
Through some of our prior AI efforts, we have a graph database of all of the jobs, source code, files, and database tables in the mainframe, including the relations between them and generated summaries of what they all do. It’s pretty big. Here is a view of the results from querying a subset of the nodes and relationships in our graph database (we use Memgraph):
We can write queries to find out the answers to interesting questions, like “Which source file is used by the most jobs?” It can be very powerful, but the querying syntax is a bit esoteric and the results can be hard to interpret.
Now imagine that instead of this, you might have a chat interface where you can just ask your questions in plain English and get the answers you need. Furthermore, this makes it effortless to ask follow-up questions, without losing your train of thought by context-switching to how to formulate the proper query. By removing the friction of crafting queries, users can stay in the flow and focus on the business value.
Sounds like a job for RAG––Retrieval Augmented Generation. RAG’s basic idea is to give the AI access to the graph database and let it figure out how to query it and interpret the results. It’s the bread and butter of the AI world, but it never quite seems to “just work.”
Let’s take a look at all of the places things could go wrong. Keep in mind: this is a multi-stage process and any errors in early stages propagate to later stages, compounding inaccuracies in the answer you are seeking.
We used several design patterns to give us the best chance of reliable results:
While we could have easily designed our agent as a single prompt, we decided to break it up into a series of steps in a chain, with separate LLMs for each step.
This architecture has 3 phases:
The colored boxes in the diagram represent each phase’s own LLM, complete with its custom prompt, memory context, model parameters, and underlying model.
By using separate phases, we can tailor each LLM to focus on a specific task for the best output, evaluating each phase individually and fine-tuning specific phases’ LLMs as needed.
Each LLM is only given the context it needs, for example, its relevant graph schema and the output from previous steps in a templated prompt. This keeps things more focused, as compared to having to share the full conversation history and all context with each LLM.
If you can’t trust your AI, hopefully, you can trust your users. We provide the latter with multiple opportunities to keep the LLM agent on the right track, like confirming or refining its understanding of the question before moving on.[3]
We also explicitly show the generated query at the end, so that the user can not only run it themselves if they want, but also so that they can eyeball it and see if it seems right.
We have a few tricks up our sleeves for getting better results. Restating the question in terms of the graph schema makes generating a query more likely to succeed, compared to only basing it directly on the user’s wording of the question.[4] [5]
We use the same pattern when handling errors internally, asking for the error to be explained first, which helps the LLM correctly fix the broken query.[6]
We also pulled out some decision-making logic from our main prompts and put it into an extra “hidden” LLM, whose sole job is to decide if it has enough context to answer a follow-up question or if it needs to run a new query.[7] This makes sure the flow loops back to the beginning instead of letting the agent get stuck in Phase 3 and hallucinate follow-up answers.
We’ve had good results so far. Based on our team’s experience, the agent has reliably helped save time and provide fresh insights. We have also set up an automated evaluation loop to identify and improve any lapses in reliability.
As an example, here is an actual conversation that happened:
Agent
Hi, I am an agent with access to the mainframe dependency graph. Try asking me a question about mainframe jobs and I will do my best to answer.
You
I'd like to know which 3-letter prefixes our jobs have
Agent
To ensure I understand your request correctly, you want to find out the unique three-letter prefixes that are present in the 'name' property of 'Job' nodes in our database, correct?
You
Yes. Oh, also can you include the number of jobs per prefix?
Agent
To clarify, you're asking for the distinct three-letter prefixes from the 'name' property of 'Job' nodes and also, you'd like to know the count of jobs that start with each of these specific prefixes in our database. Is this correct?
You
that's right
Agent
Great, give me a minute to try to answer that for you, please wait...
Agent
The original question was:
What are the unique three-letter prefixes used in the 'name' property of 'Job' nodes in our database and what is the count of jobs that use each of these prefixes?
The query used to get this information was:
MATCH (j:"Job")
WITH substring(j.name, 0, 3) AS Prefix, count(*) as JobsCount
RETURN Prefix, JobsCount
ORDER BY JobsCount DESC
Based on the results of the query, the three-letter prefixes used in the 'name' property of 'Job' nodes and the count of jobs that use each of these prefixes are as follows:
KLM: Used by 343 jobs
NIT: Used by 170 jobs
UUT: Used by 80 jobs
ACS: Used by 57 jobs
BAK: Used by 45 jobs
GGA: Used by 17 jobs
IRM: Used by 5 jobs
The prefix "KLM" is the most commonly used, appearing in the names of 343 jobs.
Perhaps the takeaway here is that trusting an LLM is not that different from trusting a human – as long as you give them a way to be honest, communicate clearly, rectify their mistakes, and listen to your needs.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
[1]: We experiment with the technology frequently, to see how we can weave Generative AI into our XP practices. Our CTO, Roberto Ostinelli, goes into how one might perform pair programming with an AI companion on our blog.
[2]: Our VP of Product, Dan Podsedly, explains how we use generative AI to help us with some of the heavy lifting with reverse engineering in our legacy modernization process.
[3]: “Flipped interaction” pattern – Lets the LLM ask the user questions. Flipped interactions often help capture the best context for the upcoming interactions.
[4]: “Question refinement” pattern – A common, best practice in RAG applications, question refinement patterns allow the LLM to restate your prompt in a format better suited for the upcoming retrieval step.
[5]: “Plan and execute” pattern – Improves generated results. "Plan-and-execute" first plans what to do, then executing the sub-tasks, inspired by research into agents like BabyAGI.
[6]: “Self-correcting” pattern – Gives the LLM a way to recover from an error by retrying a failed task with additional context.
[7]: “LLM Router” pattern – Lets the LLM decide the most appropriate task to take next based on the current context.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere. uis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
Delete