Introduction
an agentic AI community for my firm that advises manufacturing crops on the right way to mature their operations. The system was designed to be data-driven, permitting customers to add evaluation knowledge immediately via the chat interface. The primary working prototype was completed surprisingly rapidly, and at first look the outcomes appeared promising.
There was just one downside: Many of the outcomes have been improper!
Even worse, the AI rapidly realized which numerical ranges appeared believable and commenced producing convincing — however fabricated — outputs. Mixed with the eloquent language technology of the LLM, these outcomes might simply be mistaken for reality. And this conduct was not restricted to a single mannequin. Related patterns appeared throughout all examined methods: ChatGPT, Gemini Enterprise, DIA Mind, and Microsoft Copilot.
However, believable knowledge will not be sufficient, Enterprise AI methods require dependable knowledge!
Additional investigation revealed recurring failure modes. Even with “Code Interpreter” enabled, the methods:
- skipped rows or columns,
- utilized incorrect filters,
- returned an identical outcomes for various inputs,
- silently combined components of the dataset,
- or just collapsed beneath extra complicated analytical duties.
This led to an important realization:
Probabilistic reasoning is extraordinarily highly effective for interpretation and interplay — however foundational knowledge evaluation requires deterministic execution.
Desk of contents
1 The Use Case
2 The Hybrid Structure
3 The Evaluation Planner
4 The Evaluation Engine
5 An Finish-to-Finish instance
6 Why AI Structure Issues
1 The Use Case
Though the precise use case is of secondary significance, it’s briefly outlined right here to assist the sensible understanding of the underlying architectural problem.
The first activity of our agent is to advise manufacturing crops and worth streams on the right way to enhance their operational maturity: optimizing processes, bettering productiveness, decreasing stock ranges, and in the end reducing operational prices. To attain this, the session agent operates in two modes:
- It gives generic suggestions for bettering particular operational subjects primarily based on the retrieval of specialised “how-to” documentation and evaluation questionnaires.
- The agent is meant to research the present state of affairs of a plant or worth stream primarily based on evaluation outcomes and assessors’ written suggestions. Based mostly on this evaluation, it’s anticipated to offer extremely particular suggestions for the subsequent enchancment steps.
In each modes — as with most LLM-based AI fashions — the consumer can interactively focus on concepts and suggestions with the agent with a view to derive essentially the most appropriate motion plan.
For the second operation mode, it’s important that the agent can reliably course of and analyze evaluation knowledge. In our case, this knowledge is offered as an Excel export from a central database. Ideally, the agent ought to be capable to course of the file with none prior guide preparation.
The construction of the file, nevertheless, is difficult. Since all evaluation outcomes, intermediate calculations, metadata, and detailed evaluation questions are saved in separate columns, the worksheet comprises greater than 800 columns. The variety of rows corresponds to the variety of assessments within the database and may vary from one to a number of lots of (Fig. 1). Evaluation rankings are represented as integers from 0 to 4. As well as, the file comprises greater than 160 free-text fields with qualitative observations, strengths, weaknesses, and suggestions from the assessors.
The analytical duties of the agent embody filtering related rows and columns for a selected request, calculating averages, aggregating maturity scores, summarizing textual suggestions, and deriving significant enchancment strategies from the outcomes.
Initially, these duties seemed to be nicely inside the capabilities of contemporary LLM-based AI methods, particularly with “Code interpreter” mode enabled. As already talked about within the introduction, this assumption rapidly turned out to be a false impression.
2 The Hybrid Structure
The core thought for overcoming the analytical problem was to obviously separate deterministic knowledge evaluation from LLM-based reasoning and interpretation. Fig. 2 exhibits the chosen system structure after a number of enchancment iterations. The system was carried out in Microsoft Copilot Studio as a result of the platform permits deterministic workflow parts, akin to subjects and flows, to be mixed with LLM-based reasoning elements.

The guardian agent handles all communication with the consumer. It orchestrates the sub brokers and the analytics module, delegates duties to them, receives their responses, and composes the ultimate reply.
The sub brokers are specialised LLM-based modules with entry to particular information sources. These embody descriptions of maturity-level expectations for the worth streams, questionnaires with detailed evaluation questions, and extra common tips for operational excellence. The sub brokers are known as by the guardian agent in response to their particular capabilities and reply to the guardian agent quite than on to the consumer.
The analytics module is the principle focus of this text. It performs the deterministic knowledge evaluation and is designed to offer reproducible and dependable analytical outcomes. It receives an evaluation instruction in pure language from the guardian agent, known as Parent_Instruction. The analytics module itself consists of subjects, flows, and AI modules, that are known as “prompts” in Copilot Studio.
The subject T_receive_Excel_File handles the add and storage of evaluation information. It’s triggered when a file is uploaded within the chat window, indicated by the variable System.Exercise.Attachments having a price. The subject checks whether or not the uploaded file is an Excel file and, if that’s the case, shops it within the world variable Assessment_File.
The subject T_analyze_assessments is actively known as by the guardian agent if it has an analytics activity to conduct and receives Parent_Instruction as enter. A second enter is the evaluation knowledge saved within the world variable Assessment_File. The subject comprises the 2 core analytics elements: Analysis_Planner and Analysis_Engine. Each are embedded in agentic flows, F_Call_Analysis_Planner and F_Call_Analysis_Engine. These flows function connectors between the subject T_analyze_assessments and the AI prompts P_Analysis_Planner and P_Analysis_Engine.
F_Call_Analysis_Planner receives just one enter, Parent_Instruction, and forwards it to P_Analysis_Planner. This element generates the Selection_Rule, the core evaluation instruction to be executed by P_Analysis_Engine. The interior workings of P_Analysis_Planner are mentioned in Chapter 3.
F_Call_Analysis_Engine receives three inputs: the Selection_Rule from Analysis_Planner, a Mapping_File offered from SharePoint, and the Assessment_File. All three inputs are forwarded to the AI immediate P_Analysis_Engine, which conducts the info evaluation as specified by Analysis_Planner. The P_Analysis_Engine is mentioned intimately in Chapter 4.
3 The Evaluation Planner
The P_Analysis_Planner is the clever a part of the info evaluation pipeline and generates the evaluation instruction, known as Selection_Rule. This instruction is a translation of the pure language Parent_Instruction and is mostly distinctive for every request. With a purpose to reduce probabilistic variation, the interpretation course of is constrained by strict guidelines.
The Analysis_Planner doesn’t analyze the evaluation knowledge itself. Its sole accountability is to translate the probabilistic Parent_Instruction right into a deterministic evaluation specification.
Within the following, we’ll study chosen components of the instruction in additional element. You possibly can obtain the complete instruction here.
You're Analysis_Planner, an skilled assistant for translating natural-language evaluation evaluation requests into structured Selection_Rules.
Your activity is to create a Selection_Rule JSON object for the Analysis_Engine.
You obtain just one enter:
1. Parent_Instruction :
A natural-language evaluation request from the guardian agent (orchestrator).
You will need to analyze Parent_Instruction and decide:
- which sort of study is required,
- which evaluation content material classes are related,
- whether or not idea or execution maturity/findings are requested,
- whether or not particular chapters are requested,
- and whether or not row filters are required.
The Selection_Rule you generate will later be utilized by the Analysis_Engine along with:
- the true evaluation knowledge file,
- and the Mapping_File
to execute the evaluation deterministically.
The code field above exhibits the preliminary instruction for P_Analysis_Planner. It clearly defines goal and scope and explicitly separates planning from execution. The planner interprets the request, whereas the precise execution is delegated to the P_Analysis_Engine.
Subsequent follows an extended part describing the semantics of the evaluation knowledge. After all, this half is extremely particular to the person use case and dataset. It defines semantic classes used for row filtering and classes used to pick out the precise evaluation targets (TARGET CONTENT CATEGORIES and TARGET SELECTION ATTRIBUTES).
ASSESSMENT DATA SEMANTICS
The evaluation knowledge may be addressed via the next semantic classes.
ROW FILTER CATEGORIES
Use these classes just for row_filters:
- VS_Nr:
Distinctive identifier of the worth stream.
Use when filtering by worth stream quantity.
- Worth Stream:
Title of the worth stream.
Use when filtering by worth stream identify.
- ...
TARGET CONTENT CATEGORIES
Use these classes solely in target_selection_rules.data_category:
- chapter_score:
Numeric maturity rating.
Use for maturity calculations, rating evaluation, and common maturity evaluation.
- power:
Assessor statements describing strengths.
- ...
TARGET SELECTION ATTRIBUTES
Use these attributes solely inside target_selection_rules:
- data_category:
Defines which goal content material class is required.
- aggregation_allowed:
Use:
- imply for numeric maturity averages
- abstract for textual summaries
- ...
The planner by no means interacts immediately with bodily dataset columns. As an alternative, it operates on a semantic abstraction layer that decouples pure language from the underlying dataset construction.
This separation is essential as a result of the evaluation dataset comprises greater than 800 columns, together with:
- maturity rankings,
- textual assessor findings,
- metadata,
- organizational mappings,
- questionnaire variants,
- and idea/execution distinctions.
Deciding on the right goal columns subsequently turns into a essential a part of the evaluation course of.
Limiting the allowed evaluation varieties is equally essential. The planner is deliberately prevented from inventing arbitrary analytical operations. The part ANALYSIS TYPES subsequently defines the one legitimate evaluation varieties — at present simply two. This considerably improves the predictability and robustness of downstream execution. After all, the checklist can simply be prolonged for particular person use circumstances.
ANALYSIS TYPES
Use precisely one in all these analysis_type values:
- numeric_mean
Use for:
- common maturity
- imply maturity
- ...
- text_summary
Use for:
- strengths
- enchancment potentials
- ...
The subsequent part defines how the planner selects the related goal columns in an summary and deterministic manner. The principles distinguish between the 2 predefined evaluation varieties numeric_mean and text_summary and eventually decide which dataset columns are chosen for a selected request.
RULES FOR target_selection_rules
NUMERIC MATURITY ANALYSIS
For numeric maturity evaluation:
- analysis_type should be:
"numeric_mean"
- data_category should be:
["chapter_score"]
- ...
TEXT SUMMARY ANALYSIS
For textual abstract evaluation:
- analysis_type should be:
"text_summary"
- data_category:
embody solely requested classes:
- "power"
- "potential"
- "advice"
- "comment"
- ...
An identical logic applies to the row filtering course of.
RULES FOR row_filters
Use row_filters just for filtering rows within the evaluation dataset.
Allowed row filter keys are:
- VS_Nr
- Worth Stream
- ...
Do NOT use row_filters for:
- chapter_id
- ...
These belong solely to target_selection_rules.
Lastly, the instruction defines the required output construction along with a number of strict “do-not guidelines”. This part is especially essential as a result of the generated output is immediately forwarded to the P_Analysis_Engine and subsequently should observe a clearly outlined and machine-readable construction.
OUTPUT FORMAT
Return solely legitimate JSON.
Don't return markdown.
Don't return Python code.
...
Use precisely this construction:
{
"standing": "success",
"parent_instruction_summary": "",
"selection_rule": {
"analysis_type": "",
"target_selection_rules": {
"data_category": [],
"aggregation_allowed": [],
"concept_execution": null,
"chapter_id": null
},
"row_filters": {}
},
"warnings": []
}
If the request is unclear, the planner should explicitly return an error construction as an alternative of “guessing” a probably improper evaluation instruction.
If the duty is unclear, return:
{
"standing": "error",
"parent_instruction_summary": "",
"selection_rule": {
"analysis_type": null,
"target_selection_rules": {
"data_category": [],
"aggregation_allowed": [],
"concept_execution": null,
"chapter_id": null
},
"row_filters": {}
},
"warnings": [
"The analysis task is not clearly understood."
]
}
At this level, the planner has remodeled ambiguous pure language right into a deterministic evaluation specification. Nonetheless, the precise knowledge execution nonetheless has not occurred.
In chapter 5, we’ll observe an actual consumer request via the entire pipeline and study how P_Analysis_Planner generates the Selection_Rule and the way P_Analysis_Engine executes it on the evaluation dataset.
4 The Evaluation Engine
Not like the P_Analysis_Planner, the P_Analysis_Engine doesn’t purpose in regards to the activity. It solely executes the evaluation specification generated by P_Analysis_Planner.
As in chapter 3, we’ll focus solely on essentially the most related components of the instruction. The complete specification may be downloaded here.
The instruction of P_Analysis_Engine begins with the essential activity definition. In essence, the AI immediate is used as a managed Python execution setting. The code is predefined within the immediate instruction and should solely be executed, not modified.
You're Analysis_Engine, a deterministic pandas-based evaluation executor.
Your activity is to research an Excel evaluation dataset utilizing Code Interpreter.
You obtain three inputs:
1. doc
The Excel file containing the evaluation knowledge.
2. Mapping_File
The Excel file describing the columns of doc.
3. Selection_Rule
A JSON object that defines:
- which columns to pick out from Mapping_File
- which row filters to use to doc
- which sort of study to carry out
You will need to not reinterpret the unique consumer request.
You will need to not infer extra columns.
You will need to not change Selection_Rule.
You will need to not generate a brand new evaluation strategy.
You will need to solely execute the deterministic Python script beneath.
Use Code Interpreter to execute the Python script.
Return solely the JSON end result printed by the script.
Don't return markdown.
Don't clarify the code.
Don't add textual content earlier than or after the JSON end result.
P_Analysis_Engine receives three enter information:
- The
Assessment_Fileuploaded from the consumer within the chat interface. It’s saved within the prompt-internal variabledoc. - A
Mapping_Filewhich the movementF_Call_Analysis_Enginehundreds from SharePoint in preparation of the execution. - The
Selection_Rulegenerated byP_Analysis_Planner(see chapter 3).
The Mapping_File performs an important position in defining the semantics of the various columns in Assessment_File on the next stage of abstraction. With this abstraction layer, the Selection_Rule solely must specify which sort of data is required, whereas the P_Analysis_Engine selects the corresponding dataset columns throughout execution.

Mapping_File | picture by writerFig. 3 exhibits the construction of Mapping_File. It comprises a row for every column of Assessment_File, that’s probably related for the info evaluation. Knowledge columns which can be clearly irrelevant should not represented in Mapping_File and subsequently should not seen to P_Analysis_Engine. For every row the file specifies the choice standards:
data_category:
Useful which means of the column, e.g. maturity rating, power, plant identify, area, or season.chapter_id:
Distinctive identifier of the evaluation chapter.chapter_name:
Human-readable identify of the evaluation chapter.concept_execution:
Signifies whether or not the column belongs to idea or execution maturity.aggregation_allowed:
Defines which sort of aggregation is legitimate for the column, e.g.implyfor numeric maturity scores orabstractfor textual findings.
Subsequent in P_Analysis_Engine’s instruction comes a paragraph about the right way to interpret the Selection_Rule.
Guidelines for Selection_Rule:
- analysis_type = "numeric_mean":
Calculate arithmetic means for all chosen numeric goal columns.
- analysis_type = "text_summary":
Gather non-empty textual content entries from all chosen textual content goal columns.
- target_selection_rules:
Choose goal columns by matching Mapping_File attributes.
A rule worth of null means: don't filter by this attribute.
A listing means: preserve rows the place the Mapping_File attribute is within the checklist.
- row_filters:
Apply row filters to doc.
Keys are data_category values from Mapping_File, akin to "Plant", "Area", "Manufacturing Precept", "Season".
Values are lists of accepted values.
The choice specifies:
- which evaluation operation should be executed (
analysis_type), - how related goal columns are chosen from the
Mapping_File(target_selection_rules), - and the way the evaluation dataset is filtered earlier than the evaluation is carried out (
row_filters).
This instruction is deliberately deterministic. The P_Analysis_Engine will not be allowed to reinterpret the unique consumer request or invent extra analytical operations.
After the instruction block, the P_Analysis_Engine receives the precise Python script. The complete script comprises greater than 300 traces of code and is a part of the AI immediate instruction. It’s linked on the prime of this chapter and may be downloaded. Most of the code traces should not conceptually essential for the structure. They deal with sensible robustness: cleansing column names, normalizing enter values, dealing with lacking columns, changing Copilot wrapper objects, and returning structured error messages.
For the article, I’ll focus solely on the central logic.
The primary essential step is that the engine hundreds the uploaded evaluation knowledge (now accessible in doc) and the Mapping_File. From this level on, the LLM is now not decoding the consumer request. It solely executes the deterministic script primarily based on the Selection_Rule.
mapping_df = pd.read_excel(Mapping_File)
data_df = pd.read_excel(doc)
mapping_df = strip_column_names(mapping_df)
data_df = strip_column_names(data_df)
The important thing architectural aspect is the choice of goal columns. The P_Analysis_Engine by no means guesses which Excel columns could also be related. As an alternative, it filters the Mapping_File in response to the attributes outlined in target_selection_rules.
target_mapping = mapping_df.copy()
for attr, rule_value in target_selection_rules.objects():
values = normalize_rule_value(rule_value)
values = normalize_list_for_matching(values)
if values is None:
proceed
target_mapping = target_mapping[
target_mapping[attr]
.apply(normalize_for_matching)
.isin(values)
]
selected_target_columns = (
target_mapping["source_column_name"]
.dropna()
.tolist()
)
That is the purpose the place the summary evaluation instruction turns into concrete. For instance, a rule akin to chapter_id = ["3.5"], data_category = ["chapter_score"], and aggregation_allowed = ["mean"] is translated into the precise Excel columns containing the Idea and Execution maturity scores for chapter 3.5.
The identical precept is utilized to row filters. Once more, the engine doesn’t infer something from pure language. It solely applies the filters explicitly offered within the Selection_Rule.
filtered_df = data_df.copy()
for filter_category, filter_values in row_filters.objects():
filter_mapping = mapping_df[
mapping_df["data_category"]
.apply(normalize_for_matching)
== normalize_for_matching(filter_category)
]
filter_col = filter_mapping["source_column_name"].iloc[0]
filtered_df = filtered_df[
filtered_df[filter_col]
.apply(normalize_for_matching)
.isin(values)
]
After column choice and row filtering, the precise evaluation logic turns into deliberately simple. For numeric maturity evaluation, the engine calculates arithmetic means for all chosen numeric goal columns.
if analysis_type == "numeric_mean":
numeric_result = {}
for col in available_target_columns:
collection = pd.to_numeric(filtered_df[col], errors="coerce")
valid_count = int(collection.notna().sum())
numeric_result[col] = {
"imply": float(collection.imply()) if valid_count > 0 else None,
"valid_count": valid_count
}
end result["result"] = numeric_result
For textual evaluation, the engine collects non-empty assessor statements as an alternative of calculating values.
elif analysis_type == "text_summary":
text_result = {}
for col in available_target_columns:
values = [
clean_text_value(v)
for v in filtered_df[col].tolist()
]
values = [v for v in values if v is not None]
text_result[col] = {
"entries": values,
"entry_count": len(values)
}
end result["result"] = text_result
Lastly, the result’s returned as JSON. That is essential as a result of the output will not be but the ultimate user-facing reply. It’s the dependable analytical basis for the subsequent LLM step: interpretation from guardian agent.
print(json.dumps(end result, indent=2, ensure_ascii=False))
This design intentionally retains the P_Analysis_Engine “boring”. It doesn’t purpose, it doesn’t clarify, and it doesn’t enhance the evaluation. It solely executes. And that’s precisely the purpose. The extra deterministic this layer is, the extra belief may be positioned within the later LLM-generated interpretation.
5 Finish-to-Finish Instance
As an instance the entire workflow, allow us to observe a practical instance via the complete pipeline.
Triggered by the consumer interplay, the guardian agent would possibly increase the next Parent_Instruction to the analytics module:
“Summarize the principle enchancment potentials for chapter 1.4 Failure Prevention System in plant AbcP.”
The request appears easy for a human reader, however it already comprises a number of semantic duties:
- establish the requested evaluation chapters,
- detect the requested content material sort,
- apply a row filter,
- retrieve the right textual content columns,
- combination textual findings,
- and eventually generate a significant interpretation ( → guardian agent).
That is precisely the kind of activity the place a pure LLM-based evaluation turns into unreliable. The system subsequently separates the workflow into deterministic execution steps and probabilistic interpretation steps.
5.1 Translation from Evaluation Planner
Step one is carried out by P_Analysis_Planner.
It interprets the pure language request right into a deterministic Selection_Rule.
{
"standing": "success",
"parent_instruction_summary": "Summarize enchancment potentials for chapter 1.4 Failure Prevention System in plant AbcP.",
"selection_rule": {
"analysis_type": "text_summary",
"target_selection_rules": {
"data_category": ["potential"],
"aggregation_allowed": ["summary"],
"concept_execution": null,
"chapter_id": ["1.4"]
},
"row_filters": {
"Plant": ["AbcP"]
}
},
"warnings": []
}
The Selection_Rule already comprises the entire deterministic evaluation specification:
analysis_type = "text_summary"
signifies that textual assessor findings should be collected as an alternative of numeric calculations.data_category = ["potential"]
restricts the evaluation to enchancment potentials.chapter_id = ["1.4"]
limits the evaluation to the Failure Prevention System chapter.row_filters = {"Plant": ["AbcP"]}
restricts the dataset to the requested plant.
At this stage, no knowledge evaluation has occurred but. The result’s solely an execution instruction for the subsequent step.
5.2 Execution from Evaluation Engine
This Selection_Rule is handed over to P_Analysis_Engine for execution. First, the engine selects all matching goal columns from the Mapping_File.
target_mapping = target_mapping[
target_mapping[attr]
.apply(normalize_for_matching)
.isin(values)
]
This interprets the summary choice standards into actual dataset columns, for instance:
selected_target_columns = [
"1.4 CON L2 Improvement potentials",
"1.4 CON L3 Improvement potentials",
"1.4 EXE L2 Improvement potentials",
"1.4 EXE L3 Improvement potentials"
]
Subsequent, the row filters are utilized:
filtered_df = filtered_df[
filtered_df[filter_col]
.apply(normalize_for_matching)
.isin(values)
]
On this instance, the dataset is lowered to evaluation rows belonging to plant AbcP.
Lastly, the engine collects all non-empty textual content entries from the chosen columns.
values = [
clean_text_value(v)
for v in filtered_df[col].tolist()
]
values = [v for v in values if v is not None]
As we will see, the engine doesn’t interpret the findings. It solely retrieves and constructions them in response to the Python script.
The engine’s output is a group of assessors’ written statements in regards to the values stream’s enchancment potentials as a JSON object.
{
"entry_count": 6,
"entries": [
"Root causes are not systematically tracked.",
"Escalation rules for recurring failures are unclear.",
"Lessons learned are not transferred between shifts.",
"Preventive maintenance findings are not integrated into CIP activities.",
"Failure trends are visualized inconsistently.",
"Problem-solving activities focus mainly on symptoms instead of root causes."
]
}
At this level, the system has nonetheless not generated any suggestions. It has solely produced a dependable assortment of related evaluation findings. This JSON object is returned to the guardian agent for interpretation and technology of the ultimate response to the consumer.
5.3 Interpretation from Dad or mum Agent
Within the remaining step, the guardian agent collects all responses (probably extra responses from the sub brokers) and generates the ultimate output.
The collected findings point out that the Failure Prevention System is
at present extra reactive than preventive. Most gaps are associated to lacking
systematic root-cause administration and weak organizational studying throughout
shifts and groups. The very best leverage enhancements would probably come from
strengthening escalation routines, integrating preventive upkeep findings
into CIP actions, and establishing constant cross-shift studying
mechanisms.
To summarize the central architectural thought of the system:
The LLM now not creates the analytical basis itself. As an alternative, it interprets a deterministic set of already validated findings.
The probabilistic reasoning functionality of the LLM is used the place it creates worth: interpretation, prioritization, rationalization, and communication — not knowledge processing itself.
6 Why AI Structure Issues
Massive Language Fashions are naturally sturdy at interpretation, reasoning, and language technology, however nonetheless weak at dependable numerical analytics. Their optimization goal is plausibility, not deterministic reproducibility. Even with extensions akin to “Code Interpreter”, this weak spot stays seen in additional complicated analytical situations.
The excellent news is that this limitation can largely be compensated via clever system structure. The secret’s a transparent separation of duties: deterministic data-processing layers execute the analytical basis, whereas LLMs give attention to interpretation, prioritization, rationalization, and communication.
Within the introduced strategy, an important design determination was subsequently not including extra AI to the system. It was defining very rigorously the place probabilistic reasoning ought to finish and deterministic execution ought to start.
Dependable agentic methods will probably require precisely these sorts of hybrid architectures: combining the robustness of classical knowledge science pipelines with the inference capabilities of Massive Language Fashions.

