develop extra complicated, conventional logging and monitoring fall brief. What groups really need is observability: the power to hint agent choices, consider response high quality routinely, and detect drift over time—with out writing and sustaining giant quantities of customized analysis and telemetry code.
Due to this fact, groups have to undertake the fitting platform for observability whereas they concentrate on the core activity of constructing and enhancing the brokers’ orchestration. And combine their software to the observability platform with minimal overhead to their purposeful code. On this article, I’ll exhibit how one can arrange an open-source AI observability platform to carry out the next utilizing a minimal-code method:
- LLM-as-a-Choose: Configure pre-built evaluators to attain responses for Correctness, Relevance, Hallucination and extra. Show scores throughout runs with detailed logs and analytics.
- Testing at scale: Arrange datasets to retailer regression check circumstances for measuring accuracy in opposition to anticipated floor fact responses. Proactively detect LLM and agent drift.
- MELT knowledge: Observe metrics (latency, token utilization, mannequin drift), occasions (API calls, LLM calls, instrument utilization), logs (person interplay, instrument execution, agent resolution making) with detailed traces – all with out detailed telemetry and instrumentation code.
We will likely be utilizing Langfuse for observability. It’s open-source and framework-agnostic and might work with standard orchestration frameworks and LLM suppliers.
Multi-agent software
For this demonstration, I’ve hooked up the LangGraph code of a Buyer Service software. The appliance accepts tickets from the person, classifies into Technical, Billing or Each utilizing a Triage agent, then routes it to the Technical Assist agent, Billing Assist agent or to each of them. Then a finalizer agent synthesizes the response from each brokers right into a coherent, extra readable format. The flowchart is as follows:
The code is hooked up right here
# --------------------------------------------------
# 0. Load .env
# --------------------------------------------------
from dotenv import load_dotenv
load_dotenv(override=True)
# --------------------------------------------------
# 1. Imports
# --------------------------------------------------
import os
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langchain_openai import AzureChatOpenAI
from langfuse import Langfuse
from langfuse.langchain import CallbackHandler
# --------------------------------------------------
# 2. Langfuse Consumer (WORKING CONFIG)
# --------------------------------------------------
langfuse = Langfuse(
host="https://cloud.langfuse.com",
public_key=os.environ["LANGFUSE_PUBLIC_KEY"] ,
secret_key=os.environ["LANGFUSE_SECRET_KEY"]
)
langfuse_callback = CallbackHandler()
os.environ["LANGGRAPH_TRACING"] = "false"
# --------------------------------------------------
# 3. Azure OpenAI Setup
# --------------------------------------------------
llm = AzureChatOpenAI(
azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
temperature=0.2,
callbacks=[langfuse_callback], # 🔑 allows token utilization
)
# --------------------------------------------------
# 4. Shared State
# --------------------------------------------------
class AgentState(TypedDict, complete=False):
ticket: str
class: str
technical_response: str
billing_response: str
final_response: str
# --------------------------------------------------
# 5. Agent Definitions
# --------------------------------------------------
def triage_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
title="triage_agent",
enter={"ticket": state["ticket"]},
) as span:
span.update_trace(title="Buyer Service Question - LangGraph Demo")
response = llm.invoke([
{
"role": "system",
"content": (
"Classify the query as one of: "
"Technical, Billing, Both. "
"Respond with only the label."
),
},
{"role": "user", "content": state["ticket"]},
])
uncooked = response.content material.strip().decrease()
if "each" in uncooked:
class = "Each"
elif "technical" in uncooked:
class = "Technical"
elif "billing" in uncooked:
class = "Billing"
else:
class = "Technical" # ✅ protected fallback
span.replace(output={"uncooked": uncooked, "class": class})
return {"class": class}
def technical_support_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
title="technical_support_agent",
enter={
"ticket": state["ticket"],
"class": state.get("class"),
},
) as span:
response = llm.invoke([
{
"role": "system",
"content": (
"You are a technical support specialist. "
"Provide a clear, step-by-step solution."
),
},
{"role": "user", "content": state["ticket"]},
])
reply = response.content material
span.replace(output={"technical_response": reply})
return {"technical_response": reply}
def billing_support_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
title="billing_support_agent",
enter={
"ticket": state["ticket"],
"class": state.get("class"),
},
) as span:
response = llm.invoke([
{
"role": "system",
"content": (
"You are a billing support specialist. "
"Answer clearly about payments, invoices, or accounts."
),
},
{"role": "user", "content": state["ticket"]},
])
reply = response.content material
span.replace(output={"billing_response": reply})
return {"billing_response": reply}
def finalizer_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
title="finalizer_agent",
enter={
"ticket": state["ticket"],
"technical": state.get("technical_response"),
"billing": state.get("billing_response"),
},
) as span:
components = [
f"Technical:n{state['technical_response']}"
for ok in ["technical_response"]
if state.get(ok)
] + [
f"Billing:n{state['billing_response']}"
for ok in ["billing_response"]
if state.get(ok)
]
if not components:
remaining = "Error: No agent responses accessible."
else:
response = llm.invoke([
{
"role": "system",
"content": (
"Combine the following agent responses into ONE clear, professional, "
"customer-facing answer. Do not mention agents or internal labels. "
f"Answer the user's query: '{state['ticket']}'."
),
},
{"function": "person", "content material": "nn".be a part of(components)},
])
remaining = response.content material
span.replace(output={"final_response": remaining})
return {"final_response": remaining}
# --------------------------------------------------
# 6. LangGraph Development
# --------------------------------------------------
builder = StateGraph(AgentState)
builder.add_node("triage", triage_agent)
builder.add_node("technical", technical_support_agent)
builder.add_node("billing", billing_support_agent)
builder.add_node("finalizer", finalizer_agent)
builder.set_entry_point("triage")
# Conditional routing
builder.add_conditional_edges(
"triage",
lambda state: state["category"],
{
"Technical": "technical",
"Billing": "billing",
"Each": "technical",
"__default__": "technical", # ✅ by no means dead-end
},
)
# Sequential decision
builder.add_conditional_edges(
"technical",
lambda state: state["category"],
{
"Each": "billing", # Proceed to billing if Each
"__default__": "finalizer",
},
)
builder.add_edge("billing", "finalizer")
builder.add_edge("finalizer", END)
graph = builder.compile()
# --------------------------------------------------
# 9. Predominant
# --------------------------------------------------
if __name__ == "__main__":
print("===============================================")
print(" Conditional Multi-Agent Assist System (Prepared)")
print("===============================================")
print("Enter 'exit' or 'give up' to cease this system.n")
whereas True:
# Get person enter for the ticket
ticket = enter("Enter your help question (ticket): ")
# Examine for exit command
if ticket.decrease() in ["exit", "quit"]:
print("nExiting the help system. Goodbye!")
break
if not ticket.strip():
print("Please enter a non-empty question.")
proceed
strive:
# --- Run the graph with the person's ticket ---
consequence = graph.invoke(
{"ticket": ticket},
config={"callbacks": [langfuse_callback]},
)
# --- Print Outcomes ---
class = consequence.get('class', 'N/A')
print(f"n✅ Triage Classification: **{class}**")
# Examine which brokers have been executed primarily based on the presence of a response
executed_agents = []
if consequence.get("technical_response"):
executed_agents.append("Technical")
if consequence.get("billing_response"):
executed_agents.append("Billing")
print(f"🛠️ Brokers Executed: {', '.be a part of(executed_agents) if executed_agents else 'None (Triage Failed)'}")
print("n================ FINAL RESPONSE ================n")
print(consequence["final_response"])
print("n" + "="*60 + "n")
besides Exception as e:
# That is necessary for debugging: print the exception sort and message
print(f"nAn error occurred throughout processing ({sort(e).__name__}): {e}")
print("nPlease strive one other question.")
print("n" + "="*60 + "n")
Observability Configuration
To arrange Langfuse, go to https://cloud.langfuse.com/, and arrange an account with a Billing tier (passion tier with beneficiant limits accessible), then arrange a Venture. Within the mission settings, you may generate the general public and secret keys which have to be supplied at first of the code. You additionally want so as to add the LLM connection, which will likely be used for the LLM-as-a-Choose analysis.

LLM-as-a-Choose setup
That is the core of the efficiency analysis setup for brokers. Right here you may configure numerous pre-built Evaluators from the Evaluator Library which is able to rating the responses on numerous standards corresponding to Conciseness, Correctness, Hallucination, Reply Critic and so forth. These ought to suffice for many use circumstances, else Customized Evaluators could be arrange additionally. Here’s a view of the Evaluator library:

Choose the evaluator, say Relevance, that you simply want to use. You may select to run it for brand new or present traces or for Dataset runs. As well as, overview the analysis immediate to make sure it satisfies your analysis goal. Most significantly, the question, era and different variables must be appropriately mapped to the supply (often, to the Enter and Output from the applying hint). For our case, these would be the ticket knowledge entered by the person and the response generated by the finalizer agent respectively. As well as, for Dataset runs, you may examine the generated responses to the Floor Reality responses saved as anticipated outputs (defined within the subsequent sections).
Right here is the configuration for the ‘GT Accuracy’ analysis I arrange for brand new Dataset runs, together with the Variable mapping. The analysis immediate preview can also be depicted. Many of the evaluators rating inside a variety of 0 to 1:


For the customer support demo, I’ve configured 3 evaluators – Relevance, Conciseness which run for all new traces, and GT Accuracy, which deploys for Dataset runs solely.

Datasets setup
Create a dataset to make use of as a check case repository. Right here, you may retailer check circumstances with the enter question and the perfect anticipated response. To create the dataset, there are 3 decisions: create one report at a time, add a CSV of queries and anticipated responses, or, fairly conveniently, add inputs and outputs instantly from the applying traces whose responses are adjudged to be of excellent high quality by human specialists.
Right here is the dataset I’ve created for the demo. These are a mixture of technical, billing, or ‘Each’ queries, and I’ve created all of the data from software traces:

That’s it! The configuration is finished and we’re able to run observability.
Observability Outcomes
The Langfuse Residence web page is a dashboard of a number of helpful charts. It exhibits the rely of execution traces, scores and averages at a look, traces by time, mannequin utilization and value and so forth.

MELT knowledge
Essentially the most helpful observability knowledge is accessible within the ‘Tracing’ possibility, which shows summarized and detailed views of all executions. Here’s a view of the dashboard depicting the time, title, enter, output and the essential latency and token utilization metrics. Notice that for each agent execution of our software, there are 2 analysis traces generated for the Conciseness and Relevance evaluators we arrange.


Let’s have a look at the main points of one of many executions of the Buyer Service software. On the left panel, the agent circulate is depicted each as a tree in addition to a flowchart. It exhibits the LangGraph nodes (brokers) and the LLM calls together with the token utilization. If our brokers had instrument calls or human-in-the-loop steps, they’d have been depicted right here as properly. Notice that the analysis scores for Conciseness and Relevance are additionally depicted on prime, that are 0.40 and 1 respectively for this run. Clicking on them exhibits the rationale for the rating and a hyperlink to take us to the evaluator hint.
On the fitting, for every agent, LLM and power name, we will see the Enter and generated output. As an example, right here we see that the question was categorized as ‘Each’, and due to this fact within the left chart, it exhibits each the technical and billing help brokers have been referred to as, which confirms our circulate is working as anticipated.

On prime of the fitting hand panel, there’s the ‘Add to datasets’ button. At any step of the tree, this button, when clicked, will open up a panel just like the one depicted beneath, the place you may add the enter and output of that step on to a check dataset created within the earlier part. It is a helpful function for human specialists so as to add steadily occurring person queries and good responses to the dataset throughout regular agent operations, thereby constructing a Regression check repository with minimal effort. In future, when there’s a main improve or launch to the applying, the Regression dataset could be run and the generated outputs could be scored in opposition to the Anticipated outputs (floor fact) recorded right here utilizing the ‘GT Accuracy’ evaluator we created in the course of the LLM-as-a-judge setup. This helps to detect LLM drift (or agent drift) early and take corrective steps.

Right here is without doubt one of the analysis traces (Conciseness) for this software hint. The evaluator offers the reasoning for the rating of 0.4 it adjudged this response to be.

Scores
The Scores possibility in Langfuse present an inventory of all of the analysis runs from the varied lively evaluators together with their scores. Extra pertinent is the Analytics dashboard, the place two scores could be chosen and metrics corresponding to imply and customary deviation together with development traces could be considered.


Regression testing
With Datasets, we’re able to run regression testing utilizing the check case repository of queries and anticipated outputs. Now we have saved 4 queries in our Regression dataset, with a mixture of technical, billing and ‘Each’ queries.
For this, we will run the hooked up code which will get the related dataset and runs the experiment. All of the check runs are logged together with the common scores. We will view the results of a particular check with Conciseness, GT Accuracy and Relevance scores for every check case in a single dashboard. And as wanted, the detailed hint could be accessed to see the reasoning for the rating.
You may view the code right here.
from langfuse import get_client
from langfuse.openai import OpenAI
from langchain_openai import AzureChatOpenAI
from langfuse import Langfuse
import os
# Initialize consumer
from dotenv import load_dotenv
load_dotenv(override=True)
langfuse = Langfuse(
host="https://cloud.langfuse.com",
public_key=os.environ["LANGFUSE_PUBLIC_KEY"] ,
secret_key=os.environ["LANGFUSE_SECRET_KEY"]
)
llm = AzureChatOpenAI(
azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME"),
api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
temperature=0.2,
)
# Outline your activity perform
def my_task(*, merchandise, **kwargs):
query = merchandise.enter['ticket']
response = llm.invoke([{"role": "user", "content": question}])
uncooked = response.content material.strip().decrease()
return uncooked
# Get dataset from Langfuse
dataset = langfuse.get_dataset("Regression")
# Run experiment instantly on the dataset
consequence = dataset.run_experiment(
title="Manufacturing Mannequin Check",
description="Month-to-month analysis of our manufacturing mannequin",
activity=my_task # see above for the duty definition
)
# Use format methodology to show outcomes
print(consequence.format())


Key Takeaways
- AI observability doesn’t have to be code-heavy.
Most analysis, tracing, and regression testing capabilities for LLM brokers could be enabled via configuration moderately than customized code, considerably lowering growth and upkeep effort. - Wealthy analysis workflows could be outlined declaratively.
Capabilities corresponding to LLM-as-a-Choose scoring (relevance, conciseness, hallucination, ground-truth accuracy), variable mapping, and analysis prompts are configured instantly within the observability platform—with out writing bespoke analysis logic. - Datasets and regression testing are configuration-first options.
Check case repositories, dataset runs, and ground-truth comparisons could be arrange and reused via the UI or easy configuration, permitting groups to run regression assessments throughout agent variations with minimal further code. - Full MELT observability comes “out of the field.”
Metrics (latency, token utilization, value), occasions (LLM and power calls), logs, and traces are routinely captured and correlated, avoiding the necessity for guide instrumentation throughout agent workflows. - Minimal instrumentation, most visibility.
With light-weight SDK integration, groups achieve deep visibility into multi-agent execution paths, analysis outcomes, and efficiency traits—releasing builders to concentrate on agent logic moderately than observability plumbing.
Conclusion
As LLM brokers turn into extra complicated, observability is now not non-compulsory. With out it, multi-agent techniques rapidly flip into black bins which can be tough to guage, debug, and enhance.
An AI observability platform shifts this burden away from builders and software code. Utilizing a minimal-code, configuration-first method, groups can allow LLM-as-a-Choose analysis, regression testing, and full MELT observability with out constructing and sustaining customized pipelines. This not solely reduces engineering effort but additionally accelerates the trail from prototype to manufacturing.
By adopting an open-source, framework-agnostic platform like Langfuse, groups achieve a single supply of fact for agent efficiency—making AI techniques simpler to belief, evolve, and function at scale.
Need to know extra? The Buyer Service agentic software offered right here follows a manager-worker structure sample, which doesn’t work in CrewAI. Examine how observability helped me to repair this well-known challenge with the manager-worker hierarchical strategy of CrewAI, by tracing agent responses at every step and refining them to get the orchestration to work because it ought to. Full evaluation right here: Why CrewAI’s Manager-Worker Architecture Fails — and How to Fix It
Join with me and share your feedback at www.linkedin.com/in/partha-sarkar-lets-talk-AI
All pictures and knowledge used on this article are synthetically generated. Figures and code created by me

