Debugging LLM Reasoning

See associated meme

Alex Padron

and

Josh Roy

Jul 02, 2025

Machine learning systems are notoriously difficult to debug for a few reasons:

Stochasticity: Given non-repeatable results, you must run experiments consisting of multiple trials in order to determine the true behavior: the mean and variance of the results
Attribution: When piping emails, documents, and bank transaction details into a machine learning system, it becomes very difficult to attribute importance to each factor. Ex: Would it be able to determine the email is important without the attached document? Without the related bank transaction information?

The combinations of both these factors lead to a lack of interpretability: it is very difficult to determine why an ML system acted the way it did. It could be information-theoretically incomplete (missing inputs), trained incorrectly, or simply just a random incorrect answer

These same problems still continue to apply to LLM-based systems: (1) LLMs may randomly answer incorrectly or even randomly hallucinate, and (2) LLMs definitely integrate different input factors from prompts, tools, or input files.

Finding true solutions to problem (1) is an open research problem, and solving (2) may not be feasible without imposing strong constraints on the system. However, it turns out that solving interpretability reliably enough to have useful results is quite simple.

Having LLMs explain their reasoning

{
	"result": "str"
	"reasoning": "str"
}

Adding a reasoning field to the structured output of any LLM call works surprisingly well. There are theoretical limits such as the possibility of deception or hallucination. However, in practice it is often sufficient to read the model’s reasoning and identify any missing context, or prompts that need to be updated.

Which states did the client file taxes in?

# Result

{
  "states": [
    "NY",
    "NJ"
  ]
}

# Reasoning

The <redacted> attachment, explicitly identified as containing the most recent available tax forms, shows filings in both New York (IT-204) and New Jersey (NJ-1065) for the <redacted> tax year. While other financial statements indicate ongoing tax obligations in New York through <redacted>, these are listed as 'payables' and the <redacted> returns are the latest explicitly identified as filed tax returns.

This is an example from a real AI pipeline within Bridger. The reasoning field clearly shows us why it thinks the client filed in NY and NJ.

Building User Trust

In addition to debugging, model reasoning can be critical in building user trust. Using the above example: A model telling an accountant the client filed in NY and NJ isn’t actually that helpful, since it is un-verifiable. That means that the accountant needs to either (1) blindly trust the model or (2) do all the work to gather the answer themselves.

In contrast, having the reasoning makes it much faster for a human to verify that the logic is correct. In this example, an accountant can look at the attachment, verify this is the right source of information for the question, and quickly verify the extracted states.

Bridger

Debugging LLM Reasoning

See associated meme

Having LLMs explain their reasoning

Building User Trust

Discussion about this post