Beyond Prompting: The Power of Context Engineering

an LLM can see earlier than it generates a solution. This contains the immediate itself, directions, examples, retrieved paperwork, device outputs, and even the prior dialog historical past.

Context has a huge effect on reply high quality. For instance, for those who ask an LLM to jot down a SQL question with out offering the information schema, the consequence will virtually actually be suboptimal. Worse, if the mannequin has no entry to the database in any respect, it might merely hallucinate a question that doesn’t work. Even when instruments can be found, the mannequin nonetheless wants further effort and time to deduce the schema earlier than it may produce an accurate reply.

As a result of context performs such a central position in LLM-based functions, context engineering has emerged as a self-discipline targeted on systematically optimising what info goes right into a mannequin’s immediate. The aim is to construct “self-improving” techniques that be taught from expertise with out counting on costly fine-tuning (retraining fashions and updating hundreds of thousands of parameters).

Context engineering comes with a number of key benefits:

it’s cheaper and doesn’t require specialised fine-tuning experience;
context and directions stay clear, interpretable, and straightforward for people to change;
iteration cycles are a lot sooner, since updates might be made immediately with out retraining or redeploying fashions;
it’s extra agile, particularly when info must be forgotten for privateness or authorized causes.

With all these benefits, it’s not stunning that context engineering is gaining a lot consideration. What’s fascinating, although, is how shortly the approaches themselves are evolving. On this article, I’ll stroll via that evolution after which experiment with one of many newer frameworks for immediate optimisation: Agentic Context Engineering (ACE).

Evolution of context engineering approaches

Context engineering didn’t seem in a single day. It has advanced via a number of distinct phases.

The earliest stage was static prompting. Right here, prompts had been hand-crafted directions that by no means modified. A lot of the effort went into traditional immediate engineering: fastidiously selecting wording, construction, and formatting to squeeze higher efficiency out of the mannequin.

The subsequent main step was dynamic retrieval. As an alternative of counting on a set immediate, techniques started pulling in related info (paperwork, examples, or information) at inference time. Retrieval-Augmented Era (RAG) turned one of the crucial well-liked approaches on this class. By grounding responses in exterior knowledge, RAG considerably improved accuracy and lowered hallucinations, particularly for knowledge-heavy duties.

Extra not too long ago, the main target has shifted towards self-improving contexts. Reasonably than treating context as one thing that’s merely retrieved or injected, these approaches permit the system to replace and refine its personal context based mostly on previous efficiency. In different phrases, the immediate itself turns into adaptive, evolving via reflection and suggestions.

Quite a lot of frameworks have emerged round this concept. Under are among the most influential ones.

One of many earliest and most important works is “Reflexion: Language Agents with Verbal Reinforcement Learning” by Shinn et al. This analysis launched the concept language brokers can be taught from errors via pure language reflection somewhat than gradient-based updates. Reflexion brokers analyse suggestions from earlier makes an attempt, generate verbal reflections about what went flawed, and retailer these reflections in an episodic reminiscence buffer. These saved reflections then information higher decision-making in subsequent trials.
One other vital contribution is “TextGrad: Automatic Differentiation via Text” by Yuksekgonul et al. TextGrad borrows ideas from deep studying optimisation (akin to gradients, backpropagation, and gradient descent) however replaces numerical derivatives with pure language suggestions. On this framework, LLMs generate textual critiques describing how a variable ought to change to enhance the end result. These “textual gradients” are then propagated backwards via the system utilizing prompting, successfully performing a natural-language model of backpropagation throughout a compound AI system.
The paper “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning” by Agrawal et al. takes a distinct angle by combining evolutionary algorithms with language-based reflection. Prompts are handled like organisms: they mutate, compete, and evolve below choice strain. Over time, better-performing prompts survive and propagate. This method is carried out in DSPy, and Hugging Face offers a practical guide for making use of it in real-world use instances.
Lastly, “Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory” by Suzgun et al. explores test-time studying via persistent reminiscence. On this setup, a black-box LLM is given a pocket book the place it may write down helpful methods, patterns, and code snippets throughout inference. As an alternative of repeatedly rediscovering the identical insights, the mannequin accumulates and reuses information throughout duties. This adaptive reminiscence considerably improves efficiency with out requiring specific labels or human suggestions.

Agentic Context Engineering

Now that we’ve lined how context engineering has advanced, let’s take a more in-depth have a look at Agentic Context Engineering (ACE), one of many newer approaches and the primary focus of this text. ACE is launched within the paper “Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models” by Zhang et al., printed in 2025.

The paper begins by figuring out two key issues with present self-improving context strategies.

Brevity bias is the tendency for techniques to oversimplify vital particulars and progressively collapse towards brief, generic prompts. Whereas compact prompts are enticing, they typically lose the nuances that really drive good efficiency.
Context collapse. When techniques repeatedly rewrite the complete immediate, they have an inclination to overlook helpful information collected earlier. Over time, this results in instability and regressions somewhat than regular enchancment.

To deal with these points, the authors suggest Agentic Context Engineering (ACE), a framework designed for scalable and environment friendly context adaptation in each offline settings (akin to system immediate optimisation) and on-line situations (like test-time reminiscence adaptation). As an alternative of compressing information right into a single static immediate, ACE permits the mannequin to constantly evolve its context by accumulating profitable methods, reflecting on failures, and organising information in a structured manner. Conceptually, it resembles an AI assistant that improves over time by preserving detailed notes and refining its personal playbook.

On the core of ACE is an agentic studying loop that mirrors how people be taught via experimentation: strive, replicate, and consolidate. The framework consists of three elements:

Generator, which produces reasoning trajectories whereas fixing duties;
Reflector, which analyses successes and failures and distils actionable insights;
Curator, which integrates these insights into the shared context as small, incremental updates.

Reasonably than sustaining a single monolithic immediate, ACE organises context as a playbook made up of structured bullet factors. Every bullet comprises metadata (akin to a novel identifier and counters monitoring how typically it has been useful or dangerous) in addition to content material representing a small, reusable unit of information. This may be a common technique, a domain-specific idea, or a standard failure mode.

Determine from the paper Zhang et al, 2025 | source

The ACE workflow consists of a number of phases.

Era part. The Generator tackles new issues utilizing the present playbook, marking which bullets had been useful or deceptive.
Reflection part. The Reflector analyses the complete trajectory, extracting classes from each successes and failures via iterative refinement.
Curation part. The Curator turns these insights into compact “delta” updates — new or modified bullets which can be merged into the present playbook utilizing light-weight, non-LLM logic.
Develop-and-refine part. New bullets are appended, present ones are up to date in place, and periodic deduplication removes redundancy utilizing semantic embeddings.

This design permits parallel processing of a number of updates and helps multi-epoch adaptation, the place the identical queries might be revisited to progressively strengthen the context over time.

Empirically, ACE delivers sturdy outcomes. On benchmark evaluations, it outperforms different self-improving context approaches, reaching a +10.6% enchancment on AI agent duties and a +8.6% acquire in specialised domains akin to finance.

Determine from the paper Zhang et al, 2025 | source

Past accuracy, ACE can be extra cost-efficient due to its incremental replace mechanism, displaying 83.6% decrease token prices in comparison with baseline strategies.

Collectively, these outcomes place ACE as a sensible and scalable step ahead in constructing self-improving LLM techniques.

Utilizing ACE for banking intent knowledge

The ACE framework seems to be promising on paper, so the following step is to see the way it performs in apply. Luckily, the authors have shared an open-source implementation on GitHub, which supplies us a strong place to begin.

Loading the knowledge

To maintain the experiment targeted, I made a decision to use ACE to a classification activity. I’m utilizing a publicly available dataset of banking intents launched by PolyAI (). This dataset displays a quite common real-world downside: figuring out buyer intent when somebody contacts buyer help. Correct intent classification is crucial for routing requests to the fitting crew, triggering semi-automated responses, or just monitoring recurring points.

On this dataset, every buyer message (for instance, “I’m undecided why my card didn’t work”) must be mapped to a particular banking intent, akin to declined_card_payment. In whole, there are 77 distinct intent classes.

To maintain the experiment manageable, I sampled 500 examples from the dataset and cut up them into coaching, check, and validation units. Under is the code used to load the information and create the splits.

full_df = pd.read_csv('./poly_ai_banking_data/practice.csv')

# params
total_number_of_samples = 500 
train_share = 0.5
test_share = 0.4
val_share = 0.1

sample_df = full_df.pattern(n=total_number_of_samples, random_state=42)
  .reset_index(drop=True)

random.seed(42)
sample_df['group'] = random.decisions(['train', 'test', 'val'], 
  weights=(train_share, test_share, val_share), ok=total_number_of_samples)

train_df = sample_df[sample_df['group'] == 'practice'].reset_index(drop=True)
test_df = sample_df[sample_df['group'] == 'check'].reset_index(drop=True)
val_df = sample_df[sample_df['group'] == 'val'].reset_index(drop=True)

Extending ACE to banking intent knowledge

The subsequent step is to increase the ACE framework so it may work with our banking intent dataset. Luckily, the authors present a detailed guide that makes this course of comparatively simple.

Along with plugging within the new dataset, I made a few small modifications to the core framework to help Anthropic fashions and configurable temperature settings. You will discover the entire, modified model of the code on GitHub.

Making ready the information

The very first thing we have to do is put together the dataset in a format that ACE expects. I saved the coaching, validation, and check splits as CSV recordsdata below banking/knowledge. Every instance comprises:

textual content: the client help message,
class: the goal intent label we wish to predict,
group: an auxiliary area indicating whether or not the instance belongs to the practice, check, or validation set.

The group area gained’t be used later by the framework itself, nevertheless it’s handy for dataset administration and reproducibility.

Right here’s what the information format seems to be like.

textual content,class,group
Is it potential for me to vary my PIN quantity?,change_pin,check
What's the $1 transaction on my account?,extra_charge_on_statement,check
How a lot does high up charges value?,top_up_by_card_charge,check
I reside within the EU - can I get a card?,country_support,check

Subsequent, we have to inform ACE the place to seek out every cut up. That is completed by specifying dataset paths in banking/knowledge/task_config.json.

{
  "banking": {
    "train_data": "./banking/knowledge/practice.csv",
    "val_data": "./banking/knowledge/val.csv",
    "test_data": "./banking/knowledge/check.csv"
  }
}

Implementing the DataProcessor

To combine a brand new activity, the framework requires a customized DataProcessor module. Based on the information, this includes implementing three core strategies: process_task_data, answer_is_correct and evaluate_accuracy.

As well as, we want a helper perform to load the uncooked knowledge from disk. Let’s begin with that.

Under is the implementation of the data-loading perform. It reads a CSV file, validates its existence, and converts every row right into a dictionary that the remainder of the pipeline can work with.

def load_data(data_path: str) -> Checklist[Dict[str, Any]]:
  """
  Load and course of knowledge from a CSV file.
  
  Anticipated CSV format: textual content,class,group (with header)
  
  Args:
    data_path: Path to the CSV file
      
  Returns:
    Checklist of dictionaries containing the information
  """
  if not os.path.exists(data_path):
    elevate FileNotFoundError(f"Knowledge file not discovered: {data_path}")
  
  knowledge = []
  with open(data_path, 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
      knowledge.append({
        'textual content': row['text'],
        'class': row['category'],
        'group': row.get('group', '')
      })
  
  print(f"Loaded {len(knowledge)} samples from {data_path}")
  return knowledge

With the data-loading perform in place, we are able to transfer on to implementing the remaining DataProcessor strategies.

The primary goal of process_task_data is to transform the uncooked dataset into ACE’s standardised enter format.

ACE expects every instance to include three fields: context, query, and goal. In our case, the mapping is pretty easy. We map the intent class on to goal, and we depart context empty since there’s no further background info wanted for classification.

Crucial half right here is the query. We added further context to make it clear to the LLM that it ought to classify the question somewhat than reply questions immediately, whereas additionally offering the checklist of obtainable matters to information an LLM’s response.

def process_task_data(self, raw_data: Checklist[Dict]) -> Checklist[Dict]:
  """
  Convert uncooked CSV knowledge into standardized format for ACE.
  
  Args:
    raw_data: Uncooked knowledge loaded from CSV (checklist of dicts with 'textual content', 'class')
      
  Returns:
    Checklist of dicts with keys: 'context', 'query', 'goal'
  """
  processed_data = []
  
  # Collect the checklist of matters to incorporate into the query
  topics_list = ", ".be a part of(self.allowed_topics)
  
  for merchandise in raw_data:
    customer_query = merchandise.get('textual content', '')
    ground_truth_topic = merchandise.get('class', '')
    
    # The query offers the classification activity instruction
    query = (
      f"Classify the next banking buyer help question into one of many predefined matters.nn"
      f"Buyer Question: {customer_query}nn"
      f"Out there Matters: {topics_list}nn"
      f"Reply with ONLY the subject identify, nothing else."
    )
    
    processed_item = {
      "context": "",  # No further context wanted
      "query": query,
      "goal": ground_truth_topic,
      "others": {
        "original_text": customer_query,
        "activity": self.task_name,
      }
    }
    
    processed_data.append(processed_item)
  
  return processed_data

The subsequent technique, answer_is_correct, checks whether or not a mannequin’s prediction matches the bottom reality label. Since we explicitly instruct the LLM to reply with solely the class identify, a easy case-insensitive string comparability is enough right here.

def answer_is_correct(self, predicted: str, ground_truth: str) -> bool:
  """
  Verify if the expected subject matches the bottom reality.
  Makes use of easy case-insensitive comparability.
  
  Args:
    predicted: Mannequin's predicted subject
    ground_truth: Floor reality subject
      
  Returns:
    bool: True if prediction is right, False in any other case
  """
  return predicted.decrease().strip() == ground_truth.decrease().strip()

The ultimate technique we have to implement is evaluate_accuracy, which computes total classification accuracy throughout a number of predictions. There’s nothing fancy occurring right here. We merely calculate the fraction of instances the place answer_is_correct(prediction, ground_truth) returns True.

def evaluate_accuracy(self, predictions: Checklist[str], ground_truths: Checklist[str]) -> float:
  """
  Calculate classification accuracy throughout a number of predictions.
  
  Args:
    predictions: Checklist of mannequin predictions
    ground_truths: Checklist of floor reality matters
      
  Returns:
    Accuracy as a float between 0 and 1
  """
  if len(predictions) != len(ground_truths):
    elevate ValueError("Predictions and floor truths should have similar size")
  
  if not predictions:
    return 0.0
  
  right = sum(
    1 for pred, reality in zip(predictions, ground_truths)
    if self.answer_is_correct(pred, reality)
  )
  
  return right / len(predictions)

Placing collectively the workflow script

With the DataProcessor in place, the following step is to assemble a complete run script for ACE. I created a run_ace_workflow script that accepts a number of key arguments:

api_provider selects the language mannequin API to make use of ('anthropic', 'openai', 'collectively', or 'sambanova'), defaulting to 'anthropic'.
generator_model specifies the mannequin for the Generator agent (default: 'claude-haiku-4-5').
reflector_model specifies the mannequin for the Reflector agent (default: 'claude-sonnet-4-5').
curator_model specifies the mannequin for the Curator agent (default: 'claude-sonnet-4-5').
max_train and max_test are non-compulsory limits on the practice and check set sizes, helpful for fast experiments or debugging.

Let’s focus on how this script truly works. The script begins by loading the banking intent knowledge and initialising the DataProcessor. Right here’s the helper perform I wrote for this.

def load_banking_data(max_train=None, max_test=None):
  """Load and course of banking dataset."""
  from banking.data_processor import DataProcessor, load_data
  
  base_path = os.path.dirname(__file__)
  data_path = os.path.be a part of(base_path, "knowledge")
  
  # Load uncooked knowledge
  train_raw = load_data(os.path.be a part of(data_path, "practice.csv"))
  val_raw = load_data(os.path.be a part of(data_path, "val.csv"))
  test_raw = load_data(os.path.be a part of(data_path, "check.csv"))
  
  # Restrict samples if specified
  if max_train:
    train_raw = train_raw[:max_train]
    val_raw = val_raw[:max(max_train // 4, 10)]
  if max_test:
    test_raw = test_raw[:max_test]
  
  # Course of knowledge
  processor = DataProcessor(task_name="banking")
  train_samples = processor.process_task_data(train_raw)
  val_samples = processor.process_task_data(val_raw)
  test_samples = processor.process_task_data(test_raw)
  
  return train_samples, val_samples, test_samples, processor

train_samples, val_samples, test_samples, processor = load_banking_data(
  max_train=args.max_train,
  max_test=args.max_test
)

The subsequent step is to outline a playbook template. That is vital as a result of the present ACE implementation can’t dynamically create new sections, so we predefine the construction to information the mannequin. Right here’s the template I used for the banking area.

BANKING_PLAYBOOK_TEMPLATE = """
## GENERAL
## CLASSIFICATION PRINCIPLES
## CATEGORY DISAMBIGUATION
## BANKING DOMAIN KNOWLEDGE
## COMMON PATTERNS
## HANDLING AMBIGUOUS QUERIES
## COMMON MISTAKES TO AVOID
## OTHERS
"""

With the information and template prepared, we are able to initialise the ACE object with the primary parameters.

ace_system = ACE(
  api_provider=args.api_provider,
  generator_model=args.generator_model,
  reflector_model=args.reflector_model,
  curator_model=args.curator_model,
  max_tokens=4096,
  initial_playbook=BANKING_PLAYBOOK_TEMPLATE,
  use_bulletpoint_analyzer=True, # enabling deduplication of bullet factors within the playbook
  generator_temperature=0.1, # prioritising consistency for generator
  reflector_temperature=0.7, # prioritising creativity for reflector and curator
  curator_temperature=0.7,
)

Lastly, we outline a perform to run the ACE coaching workflow, which incorporates preliminary analysis, iterative reflection, curation, and remaining analysis.

def run_ace_training(ace_system, train_samples, val_samples, test_samples, processor, results_dir):
  """Prepare ACE to enhance the playbook (contains preliminary and remaining evaluations)."""
  config = {
    'num_epochs': 1,
    'max_num_rounds': 3,  # max reflection rounds per pattern
    'curator_frequency': 5,  # run curator each 5 steps
    'eval_steps': max(len(train_samples) // 10, 10),  # consider 10 instances throughout coaching
    'save_steps': max(len(train_samples) // 10, 10),
    'playbook_token_budget': 80000,
    'task_name': 'banking_ace',
    'json_mode': False,
    'no_ground_truth': False,
    'save_dir': os.path.be a part of(results_dir, "coaching"),
    'test_workers': 10,
  }
  
  outcomes = ace_system.run(
    mode='offline',
    train_samples=train_samples,
    val_samples=val_samples,
    test_samples=test_samples,
    data_processor=processor,
    config=config
  )
  
  # Extract outcomes
  initial_acc = outcomes.get('initial_test_results', {}).get('accuracy', 0)
  final_acc = outcomes.get('final_test_results', {}).get('accuracy', 0)
  training_results = outcomes.get('training_results', {})
  
  return ace_system.best_playbook, outcomes

best_playbook, training_results = run_ace_training(
  ace_system, train_samples, val_samples, test_samples, 
  processor, results_dir
)

And that’s it! That’s all of the core logic we have to run ACE. I’ve added some logging on high of the workflow for comfort, nevertheless it’s not important to the primary performance.

Outcomes

Let’s check out the outcomes and see how every little thing comes collectively. First, take a look at the most effective playbook, which you will discover at outcomes/banking_{dt}/best_playbook.txt. The playbook is organised into itemised bullets, grouped in line with the classes we outlined in our preliminary template. Every bullet comprises detailed directions and techniques, together with metadata displaying how typically it was marked useful or dangerous. This construction makes it straightforward to see which matters and techniques the system discovered most helpful throughout coaching.

## GENERAL
## CLASSIFICATION PRINCIPLES
[cls-00001] useful=1 dangerous=0 :: Temporal indicators like 'was capable of earlier than', 'labored beforehand', or 'used to work' are sturdy alerts that the problem is restricted to the present transaction somewhat than a common system functionality downside. These phrases counsel a change in standing for a particular entity (beneficiary, card, account) somewhat than total performance.
[cls-00002] useful=18 dangerous=4 :: Apply specificity hierarchy: when a number of classes might apply, select essentially the most particular one which matches the contextual clues. For instance, beneficiary_not_allowed (particular to recipient) is extra particular than declined_transfer (common failure).
[cls-00009] useful=0 dangerous=3 :: Specificity hierarchy works bidirectionally: select particular classes when contextual clues level to a selected transaction kind, however use common classes (like 'extra_charge_on_statement') when the question lacks enough context to find out the precise nature of the transaction. Do not power specificity when the client's question is inherently common.
[cls-00017] useful=5 dangerous=1 :: Course of-oriented vs Standing-tracking distinction: Differentiate between questions on HOW to acquire/purchase one thing (process-oriented) versus questions on WHEN one thing will arrive or WHETHER it has arrived (status-tracking). Course of questions deal with the steps and elements wanted, whereas standing questions deal with timing and supply affirmation. Use this distinction to decide on between acquisition classes and monitoring/arrival classes.
## CATEGORY DISAMBIGUATION
[dis-00003] useful=1 dangerous=0 :: declined_transfer vs beneficiary_not_allowed: If the client mentions they may switch earlier than however all of a sudden can not, this strongly signifies beneficiary_not_allowed (recipient is blocked/restricted) somewhat than declined_transfer (common switch failure on account of funds, limits, or system errors).
[dis-00011] useful=11 dangerous=0 :: pending_* vs failed_* vs declined_*: Transaction state is crucial for classification. 'Hasn't gone via but' or 'taking too lengthy' = pending state. 'Did not work', 'was declined', or 'was rejected' = failed/declined state. 'Cash got here again' or 'was returned' = reverted state. Match the class to the precise transaction state described.
[dis-00012] useful=0 dangerous=1 :: country_support vs supported_cards_and_currencies: Queries about geographic availability ('which international locations', 'the place can I', 'what areas') needs to be categorized as 'country_support'. In distinction, 'supported_cards_and_currencies' is for questions on card varieties (Visa, Mastercard) and foreign money choices, not geographic availability.
[dis-00014] useful=2 dangerous=0 :: Money withdrawal points: Distinguish by transaction state and end result: 'pending_cash_withdrawal' (not accomplished but, nonetheless processing), 'declined_cash_withdrawal' (rejected, no money acquired), 'cash_withdrawal_not_recognised' (buyer does not recall the transaction), and 'wrong_amount_of_cash_received' (transaction accomplished however incorrect quantity disbursed). If money was acquired however the quantity was flawed, use essentially the most particular class: wrong_amount_of_cash_received.
[dis-00015] useful=3 dangerous=3 :: card_arrival vs get_physical_card: Distinguish between status-tracking questions (card_arrival) and process-acquisition questions (get_physical_card). 'card_arrival' is for monitoring present orders ('Has my card arrived?', 'The place is my card?'). 'get_physical_card' encompasses the complete means of acquiring a bodily card together with all elements like PIN ('The place can I discover my PIN?', 'How do I get my card and PIN?'). Questions on lacking PINs with 'have not gotten it but' point out the client is within the acquisition course of, not simply monitoring supply.
[dis-00021] useful=1 dangerous=0 :: card_payment_not_recognised vs extra_charge_on_statement: When a buyer mentions a 'fee' they do not acknowledge or did not make ('fee I by no means submitted', 'fee I did not authorize'), classify as 'card_payment_not_recognised' as a result of 'fee' is a particular transaction kind. Use 'extra_charge_on_statement' solely when the client describes sudden quantities, charges, or expenses WITHOUT specifying the transaction kind (e.g., 'I see an additional $5 on my assertion', 'there is a unusual cost' with out mentioning fee/switch/withdrawal).
[dis-00024] useful=0 dangerous=1 :: Price/cost class specificity: When prospects ask about charges or expenses, prioritize transaction-type-specific payment classes over 'extra_charge_on_statement'. If the question mentions a particular transaction kind (switch, fee, withdrawal, top-up), use the corresponding particular payment class: 'transfer_fee_charged' for switch charges, 'card_payment_fee_charged' for fee charges, 'atm_fee_charged' for withdrawal charges, 'top_up_fee' for top-up charges. Reserve 'extra_charge_on_statement' just for payment queries the place no particular transaction kind is talked about (e.g., 'Why is there an additional $5 cost?' with out context).
[dis-00026] useful=0 dangerous=0 :: receiving_money vs transfer_into_account: Distinguish between passive receipt and lively switch. 'receiving_money' is for queries about receiving funds FROM one other get together (passive, initiated by sender). 'transfer_into_account' is for queries in regards to the buyer initiating a switch TO add funds to their very own account (lively, self-initiated). Context clues: empty/low stability + asking about transfers = doubtless transfer_into_account. Questions on 'can I switch funds' within the context of needing so as to add cash = transfer_into_account, not receiving_money.
[dis-00029] useful=0 dangerous=0 :: beneficiary_not_allowed vs declined_transfer: When a question explicitly mentions 'beneficiary' or 'recipient' mixed with restriction language ('not allowed', 'blocked', 'restricted', 'can not add', 'unable so as to add'), classify as 'beneficiary_not_allowed' even with out temporal indicators. The mixture of the precise banking entity time period (beneficiary/recipient) with restriction language is a powerful direct sign for recipient-level restrictions somewhat than common switch failures.
## BANKING DOMAIN KNOWLEDGE
[bank-00006] useful=0 dangerous=0 :: In banking, when a beforehand profitable switch all of a sudden fails, frequent causes embrace: beneficiary being flagged/blocked by fraud techniques, beneficiary account restrictions, or beneficiary being faraway from allowed checklist. These are distinct from common switch declines on account of inadequate funds or system errors.
[bank-00008] useful=0 dangerous=6 :: Small sudden quantities (like £1, £0.01) showing on statements typically point out authorization holds, verification expenses, or miscellaneous charges. When prospects query these with out further context, they need to be categorized as 'extra_charge_on_statement' somewhat than extra particular transaction varieties.
[bank-00018] useful=0 dangerous=0 :: 'card_swallowed' is the banking trade time period for ATM card retention situations the place the machine retains/retains the client's card. This is applicable when playing cards are caught, will not come out, or are held by the ATM, whatever the particular phrasing utilized by the client.
[bank-00020] useful=10 dangerous=4 :: Banking terminology has a specificity hierarchy for transaction references. Particular transaction kind key phrases embrace: 'fee' (card funds), 'switch' (cash transfers), 'withdrawal' (money withdrawals), 'top-up' (account funding), 'direct debit', 'standing order'. Generic phrases embrace: 'cost', 'quantity', 'transaction', 'payment'. When a buyer makes use of a particular transaction kind key phrase, it offers enough context to categorise into transaction-type-specific classes somewhat than common classes.
## COMMON PATTERNS
[pat-00004] useful=0 dangerous=0 :: Sample: 'It labored earlier than, now it does not' + switch context = doubtless beneficiary-level restriction somewhat than system-level decline. The earlier success signifies the account and switch mechanism are purposeful, pointing to a particular restriction on the present recipient.
[pat-00007] useful=3 dangerous=6 :: Sample: Buyer describes transaction as 'unusual', 'sudden', 'unexplained', or asks 'what is that this cost' on their assertion with out offering particular transaction kind context (switch, fee, withdrawal, and many others.) = classify as 'extra_charge_on_statement'. That is the suitable common class when the character of the cost is unclear.
[pat-00010] useful=8 dangerous=1 :: Sample: Phrases like 'hasn't gone via but', 'nonetheless ready', 'not accomplished', or 'nonetheless pending' point out a transaction in PENDING state, not a FAILED state. Select 'pending_*' classes over 'failed_*' or 'declined_*' classes when these language cues are current.
[pat-00013] useful=0 dangerous=2 :: Sample: Questions with geographic scope indicators like 'which international locations', 'the place can I', 'what areas', or 'in what areas' are asking about service availability by geography = classify as 'country_support'. The core intent is knowing geographic attain of companies.
[pat-00016] useful=2 dangerous=9 :: Sample: 'The place can I discover' or 'How do I get' phrasing signifies process-oriented questions in search of details about acquiring or buying one thing, not status-tracking questions. These ought to usually map to acquisition/setup classes (like 'get_physical_card') somewhat than supply/monitoring classes (like 'card_arrival' or 'card_delivery_estimate').
[pat-00019] useful=0 dangerous=0 :: Sample: Phrases indicating a card is bodily retained by an ATM ('card caught in ATM', 'card will not come out', 'ATM saved my card', 'get my card out of ATM', 'retrieve card from machine') needs to be categorized as 'card_swallowed'. The important thing indicator is the cardboard being bodily held/retained by the machine somewhat than different card points like harm, loss, or performance issues.
[pat-00022] useful=1 dangerous=0 :: Sample: Particular transaction kind key phrase + 'not acknowledged'/'did not make'/'by no means submitted' = use transaction-type-specific 'not_recognised' class. Examples: 'fee I did not make' → card_payment_not_recognised; 'switch I do not acknowledge' → transfer_not_received_by_recipient or associated switch subject; 'withdrawal I by no means made' → cash_withdrawal_not_recognised. The presence of a particular transaction kind key phrase (fee, switch, withdrawal) is enough context to keep away from common classes.
[pat-00025] useful=1 dangerous=0 :: Sample: Transaction kind key phrase + timing query ('how lengthy', 'when will', 'how a lot time') + geographic point out = prioritize transaction-specific timing class (e.g., 'transfer_timing', 'card_delivery_estimate'). Deal with geographic mentions as contextual details about the transaction origin/vacation spot until the question explicitly asks about service availability ('which international locations', 'the place can I take advantage of', 'is it accessible in'). Instance: 'switch from China, how lengthy?' → 'transfer_timing' (not 'country_support').
[pat-00027] useful=0 dangerous=0 :: Sample: Account stability context + switch inquiry = intent so as to add funds. When a buyer mentions their account is empty/has no funds/wants cash AND asks about transferring, they're asking about transferring funds INTO their account (transfer_into_account), not about receiving cash from others (receiving_money). The account state offers crucial context for disambiguating transfer-related intents.
## HANDLING AMBIGUOUS QUERIES
## COMMON MISTAKES TO AVOID
[err-00005] useful=2 dangerous=0 :: Do not default to common classes (like declined_transfer) when temporal context ('was capable of earlier than') suggests a extra particular subject. The temporal change is a key discriminator that usually factors to entity-specific restrictions (beneficiary, card, account) somewhat than common failures.
[err-00023] useful=2 dangerous=0 :: Do not default to 'extra_charge_on_statement' when the client mentions a particular transaction kind (fee, switch, withdrawal, top-up) they do not acknowledge. 'extra_charge_on_statement' needs to be reserved for actually ambiguous instances the place no transaction kind is specified. When a buyer says 'fee I by no means made', the phrase 'fee' offers enough context to make use of 'card_payment_not_recognised' as an alternative of the generic 'extra_charge_on_statement'.
[err-00028] useful=0 dangerous=0 :: Do not apply sample guidelines or area information which can be irrelevant to the question. If a question has no geographic indicators, do not apply geographic patterns. If there is no point out of charges, do not apply fee-related guidelines. Concentrate on guidelines that immediately match the semantic content material and context of the client's question somewhat than greedy for any relevant rule. Irrelevant rule software results in misclassification.
## OTHERS

For a deeper have a look at how every agent operates, you’ll be able to discover the detailed execution logs at outcomes/banking_{dt}/coaching/ace_run_{dt}/detailed_llm_logs . I extremely suggest searching these logs. On the very least, skim via the prompts and see how the Generator, Reflector, and Curator work together. It’s a good way to know how ACE evolves the context step-by-step.

After all, essentially the most fascinating metric is accuracy. You will discover the preliminary and remaining check leads to outcomes/banking_{datetime}/coaching/initial_test_results.json and outcomes/banking_{datetime}/coaching/final_test_results.json.

# preliminary outcomes 
{
  "test_results": {
    "accuracy": 0.7512437810945274,
    "right": 151,
    "whole": 201,
    "no_answer": 0
  },
  "error_log": {
    "accuracy": 0.7512437810945274,
    "errors": [
      {
        "index": 2,
        "prediction": "declined_card_payment",
        "ground_truth": "declined_transfer"
      },
      {
        "index": 9,
        "prediction": "top_up_limits",
        "ground_truth": "automatic_top_up"
      },
      {
        "index": 7,
        "prediction": "transfer_not_received_by_recipient",
        "ground_truth": "balance_not_updated_after_cheque_or_cash_deposit"
      },
      ...
    ]
  }
}

# remaining outcomes 
{
  "test_results": {
    "accuracy": 0.736318407960199,
    "right": 148,
    "whole": 201,
    "no_answer": 0
  },
  "error_log": {
    "accuracy": 0.736318407960199,
    "errors": [
      {
        "index": 9,
        "prediction": "top_up_limits",
        "ground_truth": "automatic_top_up"
      },
      {
        "index": 2,
        "prediction": "declined_card_payment",
        "ground_truth": "declined_transfer"
      },
      {
        "index": 7,
        "prediction": "pending_transfer",
        "ground_truth": "balance_not_updated_after_cheque_or_cash_deposit"
      },
      ...
    ]
  }
}

The outcomes, admittedly, should not very spectacular. The truth is, accuracy barely dropped after optimisation, from 75.1% to 73.6%. However even detrimental outcomes can train us one thing worthwhile.

There are a couple of doubtless explanation why ACE didn’t present a lot profit on this case:

Restricted knowledge per class. We solely had 248 coaching examples, 201 check examples, and 51 validation examples. Nonetheless, our activity concerned 77 completely different classes. With so few examples per class, the mannequin merely might not have had sufficient knowledge to be taught significant distinctions.
Small and unrepresentative validation set. With solely 51 examples, the validation set won’t have captured the complete variety of buyer queries, making it troublesome for ACE to generate helpful reflections and enhancements.
Job complexity. Our use case is comparatively simple. Because the authors word, ACE tends to shine in situations with giant quantities of extremely specialised area information or extra advanced agentic workflows, the place reflection and iterative context refinement can considerably enhance efficiency.

Utilizing ACE for code era

Inspired by the earlier experiment, I made a decision to present ACE one other strive. This time on the Mostly Basic Python Problems dataset (accessible below cc-by-4.0 license). Hopefully, the outcomes could be extra promising with a code era activity.

Knowledge overview

Every instance within the dataset comprises three key elements:

Query, for instance, “Write a perform to reverse phrases in a given string.”
Floor reality implementation — Python reference code. For instance, for the query above

def reverse_words(s):
  return ' '.be a part of(reversed(s.cut up()))

Take a look at instances are assertions to validate the generated code, akin to

[
    assert reverse_words("python program")==("program python"),
    assert reverse_words("java language")==("language java"),
    assert reverse_words("indian man")==("man indian")
]

Including a brand new activity to the ACE framework

We are able to comply with comparable steps to increase the ACE framework to deal with coding duties. I gained’t go into all of the implementation particulars right here, since you will discover the complete code on GitHub. Nonetheless, it’s price highlighting the important thing variations in comparison with the banking intent instance.

Coding duties are inherently extra advanced. Within the banking intent case, the mannequin outputs a single class out of 77, which is straightforward to check immediately with the bottom reality. In code era, nevertheless, the LLM can produce arbitrary code, so we can not merely verify for actual matches. As an alternative, we have to run exams to find out whether or not the generated answer is right.

# banking

def answer_is_correct(self, predicted: str, ground_truth: str) -> bool:
  return predicted.decrease() == ground_truth.decrease()

# coding 
def answer_is_correct(self, predicted: str, ground_truth: str, 
                    test_list: Checklist[str], idx: int, save_dir: str) -> bool:
  code = extract_code_from_response(predicted)
  consequence = execute_code_with_tests(code, test_list, timeout=5)
  return consequence['success']

Due to this added complexity, I needed to implement a number of enhancements within the DataProcessor for code era:

Code extraction. LLMs typically embrace further context across the code, akin to Markdown formatting (```python ...```). We have to clear and extract the code to make sure it may compile accurately.
Protected execution. Since we run the generated code to confirm correctness, it’s vital to implement fundamental security measures, akin to timeouts and remoted execution environments.
Offering full context. It’s essential to incorporate all needed info within the query. If we simply ask the LLM to generate code, it’s unlikely to move the exams as a result of it gained’t be clear what perform identify or signature is predicted. That’s why it’s essential to offer all needed particulars within the query when standardising the information within the process_task_data perform.

query = (
  f"Write a Python perform to resolve the next downside:nn"
  f"Downside: {problem_text}nn"
  f"Your code should move the next check instances:n"
  f"{test_cases_formatted}nn"
  f"Essential: The check instances will probably be executed in opposition to your code. "
  f"Ensure your perform identify and signature match what the exams count on.nn"
  f"Reply with ONLY the Python code, no explanations."
)

Within the authentic ACE implementation, the Reflector in contrast generated code immediately with the bottom reality, which works for classification duties. For coding, nevertheless, this method doesn’t make sense: a number of right options can exist, and optimising for code that “seems to be comparable” to the reference doesn’t assure it’ll move the exams.

To deal with this, I carried out a brand new technique, get_test_feedback, which offers the Reflector with precise check execution outcomes and error messages. The check output turns into the first sign for correctness, giving way more informative suggestions than easy code comparability.

def get_test_feedback(self, predicted: str, ground_truth: str, test_list: Checklist[str] = None) -> str:
  """
  Get detailed check execution suggestions for the reflector.
  
  This technique offers the reflector with precise check outcomes and error messages,
  which is extra informative than simply evaluating generated code with floor reality.
  The check output is the first sign for correctness in code era duties.
  
  Args:
      predicted: Mannequin's predicted code
      ground_truth: Floor reality code (reference solely, not used for analysis)
      test_list: Checklist of check assertions to run
      
  Returns:
      str: Detailed suggestions string with check execution outcomes
  """
  if test_list is None:
      return "No check instances offered - can not consider code."
  
  # Extract code from response if wanted
  code = extract_code_from_response(predicted)
  
  # Execute code with exams
  consequence = execute_code_with_tests(code, test_list, timeout=self.timeout)
  
  # Construct detailed suggestions
  feedback_parts = []
  
  if consequence['success']:
    feedback_parts.append(f"✓ All {consequence['total']} exams PASSED")
    feedback_parts.append("nTest instances executed efficiently:")
    for i, check in enumerate(test_list, 1):
        feedback_parts.append(f"  {i}. {check} ✓")
  else:
    feedback_parts.append(f"✗ Checks FAILED: {consequence['passed']}/{consequence['total']} exams handed")
    
    if consequence['timeout']:
      feedback_parts.append("n⏱ TIMEOUT: Code execution exceeded time restrict")
    
    if consequence['errors']:
      feedback_parts.append("n--- ERROR DETAILS ---")
      for error in consequence['errors']:
        feedback_parts.append(f"  • {error}")
    
    # Present which exams handed vs failed
    feedback_parts.append("n--- TEST RESULTS ---")
    for i, check in enumerate(test_list, 1):
      # Verify if this particular check seems in errors
      test_failed = any(f"Take a look at {i}" in err for err in consequence.get('errors', []))
      standing = "✗ FAILED" if test_failed else "✓ handed"
      feedback_parts.append(f"  {i}. {check} - {standing}")
  
  # Add extracted code for reference
  feedback_parts.append("n--- EXTRACTED CODE ---")
  feedback_parts.append(code)
  
  return "n".be a part of(feedback_parts)

Alongside this new technique, I created a devoted Reflector immediate tailor-made for code era. Its focus is on check outcomes, not line-by-line code comparability.

You're an professional code reviewer and educator. Your job is to research why generated code handed or failed check instances, and determine patterns that result in right or incorrect options.

**IMPORTANT: Take a look at execution outcomes are the PRIMARY sign for correctness.**
- The code is right if and provided that ALL exams move
- Do NOT evaluate implementations line-by-line with the reference - completely different implementations might be equally right
- Concentrate on understanding WHY exams handed or failed based mostly on the code's logic

**Directions:**
- First, look at the Take a look at Execution Outcomes to find out if the code is right
- If exams FAILED: Analyze what brought about the failure (syntax errors, logic errors, edge instances, flawed algorithm)
- If exams PASSED: Determine what the mannequin did properly that led to success
- The "Attainable Implementation" is simply ONE strategy to remedy the issue - the mannequin's method could also be completely different however equally legitimate
- Present actionable insights for enhancing code era sooner or later
- Tag bulletpoints as useful/dangerous/impartial based mostly on whether or not they contributed to passing exams

Your output needs to be a json object, which comprises the next fields:
  - reasoning: analyze the check outcomes and the code's logic, clarify why exams handed/failed
  - error_identification: if exams failed, what particular subject brought about the failure? If exams handed, state "No errors - all exams handed"
  - root_cause_analysis: what underlying idea or sample led to success or failure?
  - correct_approach: what coding technique or sample needs to be used for comparable issues?
  - key_insight: what precept needs to be remembered for future code era duties?
  - bullet_tags: a listing of json objects with bullet_id and tag for every bulletpoint

**Query:**
{}

**Mannequin's Reasoning Hint:**
{}

**Mannequin's Generated Code:**
{}

**Attainable Implementation (Reference Solely - NOT the one right answer):**
{}

**Take a look at Execution Outcomes (PRIMARY SIGNAL):**
{}

**A part of Playbook that is utilized by the generator to reply the query:**
{}

**Reply on this actual JSON format:**
{{
  "reasoning": "[Analyze test results and code logic - why did tests pass or fail?]",
  "error_identification": "[What caused test failures? Or 'No errors - all tests passed']",
  "root_cause_analysis": "[What concept/pattern led to success or failure?]",
  "correct_approach": "[What coding strategy works for this type of problem?]",
  "key_insight": "[What principle should be remembered for future code generation?]",
  "bullet_tags": [
    {{"id": "code-00001", "tag": "helpful"}},
    {{"id": "code-00002", "tag": "harmful"}}
  ]
}}

This coding-specific Reflector is routinely used at any time when the duty identify comprises "coding".

Outcomes

Lastly, I ran the immediate optimisation course of on a dataset of 500 samples, cut up into practice, check, and validation units. This time, the outcomes are way more promising: accuracy improved considerably from 71.1% to 87.1%. On this case, ACE clearly helped optimise the prompts and information the mannequin towards right options.

the most effective playbook, it’s fairly in depth. Most of the most useful patterns are common rules, akin to:

Write the best right, Pythonic answer first,
Deal with check instances because the true specification,
Confirm correctness earlier than any additional optimisation.

On the similar time, the playbook additionally contains very particular steering, for instance, detailed directions for duties like GCD calculations.

Total, this exhibits that ACE can successfully seize each high-level methods and task-specific suggestions.

## GENERAL
## COMMON MISTAKES TO AVOID
[err-00003] useful=5 dangerous=0 :: Do not add pointless complexity to recursive algorithms. For instance, in GCD implementations, specific min/max logic or particular instances for checking if a price equals 1 are redundant when utilizing the usual Euclidean algorithm.
[err-00007] useful=0 dangerous=0 :: Do not assume downside constraints match your algorithm's mathematical conditions. For instance, Fermat's Little Theorem for modular inverse requires a PRIME modulus - confirm the issue ensures this earlier than utilizing pow(a, p-2, p). If constraints aren't specified, select extra common algorithms.
## OTHERS
## CODE GENERATION PRINCIPLES
[cgp-00002] useful=41 dangerous=2 :: Desire minimal, mathematically sound implementations over advanced ones. Keep away from including pointless preprocessing logic (like min/max) or particular case checks when the core algorithm naturally handles all situations.
[cgp-00012] useful=91 dangerous=2 :: At all times guarantee generated code is syntactically full earlier than finalizing output. Confirm all opened brackets, braces, and parentheses are correctly closed, and all statements are totally shaped. Incomplete code era (truncation mid-statement) causes syntax errors that forestall execution no matter algorithmic correctness.
[cgp-00020] useful=6 dangerous=0 :: When an issue explicitly requires utilizing lambda features, combine them naturally with Python's purposeful programming instruments (map, filter, cut back, sorted with key parameter). Do not power lambda utilization the place it is awkward - these built-in features are designed to work seamlessly with lambdas for operations like filtering, transformation, and counting.
[cgp-00024] useful=140 dangerous=2 :: Prioritize readable, Pythonic options utilizing built-in features over performance-optimized advanced algorithms until the issue explicitly requires optimization or includes large-scale knowledge. A transparent answer utilizing bin(), str strategies, or checklist comprehensions is commonly preferable to bit manipulation or handbook loops. Optimize solely when needed.
[cgp-00047] useful=56 dangerous=2 :: Observe a correctness-first improvement technique: (1) implement the easy algorithm that accurately solves the issue, even when it is not optimally environment friendly, (2) confirm correctness with check instances, (3) solely then think about optimization if efficiency is insufficient or the issue explicitly requires it. An accurate O(n) answer is infinitely higher than a buggy O(log n) try. Untimely optimization typically introduces errors in logic, particularly for mathematical or algorithmic issues.
[cgp-00050] useful=0 dangerous=0 :: When a number of algorithmically right options exist, want the one with higher time/area complexity. An accurate O(1) formula-based answer is superior to an accurate O(n) iterative answer. Nonetheless, solely optimize for those who can keep correctness - a working O(n) answer is infinitely higher than a buggy O(1) try. Confirm the extra environment friendly method passes all exams earlier than committing to it.
[cgp-00053] useful=0 dangerous=0 :: When implementing mathematical optimizations (particularly for pair/mixture counting), confirm the optimized method in opposition to check instances via handbook calculation BEFORE coding. For every check case: (1) apply your mathematical perception to foretell the output, (2) verify it matches anticipated output, (3) solely then implement. This catches errors in mathematical reasoning early, stopping bugs which can be tougher to debug in code than in arithmetic.
[cgp-00057] useful=0 dangerous=0 :: Keep away from shadowing Python built-in names (dict, checklist, str, int, set, tuple, and many others.) when naming variables or parameters. Use descriptive alternate options as an alternative: 'd' or 'knowledge' as an alternative of 'dict', 'lst' or 'gadgets' as an alternative of 'checklist', 's' or 'textual content' as an alternative of 'str'. Shadowing built-ins makes them inaccessible in that scope and reduces code readability, regardless that it is syntactically legitimate.
[cgp-00059] useful=2 dangerous=0 :: Embrace defensive programming practices (enter validation, bounds checking, kind checking) even when not explicitly examined by seen check instances. For string indexing, validate index bounds earlier than entry. For numeric conversions, confirm the enter is a sound digit. For checklist operations, verify for empty collections. These safeguards improve code robustness and forestall runtime errors on edge instances which will exist in hidden exams, demonstrating production-quality coding practices.
[cgp-00074] useful=0 dangerous=0 :: For operations involving powers of two, want bitwise shift operators over arithmetic operations for readability and effectivity: use left shift (1 << ok) as an alternative of two**ok or pow(2, ok) for computing 2^ok, use proper shift (n >> ok) as an alternative of n // (2**ok) for dividing by powers of two. Bitwise operators make the bit-level intent specific and are the idiomatic method in bit manipulation contexts. That is particularly worthwhile when working with bit positions and their corresponding values.
[cgp-00081] useful=0 dangerous=0 :: Earlier than utilizing customary library mathematical constants (math.pi, math.e, and many others.), validate that check instances count on full-precision values by calculating one check output and evaluating to anticipated. If anticipated outputs counsel truncated/simplified constants (pi=3.14, pi=3.1415, e=2.718), use hardcoded values matching check precision as an alternative of library constants. Sample: (1) determine mathematical fixed wanted, (2) calculate check output with customary fixed, (3) if mismatch exists, derive the fixed worth that produces actual anticipated outputs, (4) use hardcoded worth. Take a look at case expectations override mathematical purity.
## COMMON PYTHON PATTERNS
[cpp-00010] useful=23 dangerous=0 :: For locating components with most/minimal properties based mostly on a criterion, use built-in max()/min() features with the important thing parameter. Instance: max(list_of_lists, key=len) finds the longest checklist. That is extra Pythonic and readable than handbook iteration with comparisons.
[cpp-00013] useful=17 dangerous=0 :: For counting or looking operations in Python collections (tuples, lists, strings), prioritize built-in strategies: use .depend() for prevalence counting, .index() for locating positions, .discover() for strings. These are extra dependable, environment friendly, and Pythonic than handbook iteration with counters or loops.
[cpp-00014] useful=3 dangerous=0 :: When working with mixed-type knowledge buildings, use isinstance() for kind checking to differentiate between completely different aspect varieties. Mix with len() checks to validate construction. Instance: isinstance(merchandise, checklist) and len(merchandise) == 2 reliably identifies 2-element lists in combined collections.
[cpp-00015] useful=3 dangerous=0 :: Use prolong() as an alternative of append() when including a number of components from a sequence to a listing. prolong() provides components individually to the goal checklist, whereas append() would add the complete sequence as a single nested aspect. Instance: consequence.prolong([value] * depend) vs consequence.append([value] * depend).
[cpp-00016] useful=2 dangerous=0 :: Use checklist multiplication ([value] * depend) to effectively repeat components. That is extra Pythonic and readable than handbook loops for creating repeated components. Mix with prolong() for including repeated components to present lists.
[cpp-00019] useful=2 dangerous=0 :: For counting components matching a situation with lambda features, use sum(map(lambda x: 1 if situation else 0, iterable)) as a chic different to len(checklist(filter(lambda x: situation, iterable))). The sum(map()) method maps components to 1/0 and sums them, typically extra readable and environment friendly than filtering then counting.
[cpp-00026] useful=14 dangerous=0 :: For changing sequences (tuples, lists) of characters/strings right into a single string, use str.be a part of() technique: ''.be a part of(sequence) for character concatenation, or 'separator'.be a part of(sequence) for becoming a member of with delimiters. That is the idiomatic Python method - extra readable and performant than handbook loops with += or accumulation patterns.
[cpp-00030] useful=1 dangerous=0 :: For character classification with regex, use re.findall() with mutually unique character class patterns. For 'every little thing else' classes (like particular characters), want negation patterns [^...] over enumerating particular characters - e.g., [^A-Za-z0-9] captures all non-alphanumeric characters comprehensively, avoiding the brittleness of lists like [,.!?]. Guarantee patterns do not overlap to forestall double-counting.
[cpp-00031] useful=2 dangerous=0 :: For locating international most/minimal throughout nested iterables (checklist of tuples, checklist of lists, and many others.), use nested generator expressions with built-in max()/min(): `max(aspect for container in containers for aspect in container)`. This sample naturally flattens one stage of nesting with out creating intermediate lists, making it splendid for locating extremes throughout tuple information or sublists. Extra environment friendly and readable than handbook iteration.
[cpp-00033] useful=2 dangerous=0 :: For index-based entry to dictionary keys, use the sample checklist(dict)[index] or checklist(dict.keys())[index]. This depends on Python 3.7+ ensures that dictionaries keep insertion order. Changing the dictionary to a listing extracts keys so as, permitting customary checklist indexing. That is the idiomatic Python answer for mapping numeric indices to dictionary keys.
[cpp-00036] useful=27 dangerous=2 :: For mathematical operations (GCD, LCM, factorial, prime checking, trigonometry), verify Python's math module FIRST earlier than implementing algorithms manually. Constructed-in features like math.gcd(), math.factorial(), math.isqrt() are well-tested, optimized, and cut back implementation errors. Sample: (1) Perceive the mathematical definition, (2) Verify if math module offers the operation, (3) Use it immediately or wrap it with problem-specific logic (e.g., is_coprime = math.gcd(a,b) == 1).
[cpp-00038] useful=0 dangerous=0 :: For checking if a quantity is an ideal sq., use math.isqrt() as an alternative of math.sqrt() to keep away from floating-point precision errors. Sample: b = math.isqrt(n); is_perfect_square = (b * b == n). The isqrt() perform returns the integer sq. root, and squaring it again permits actual integer comparability with out floating-point rounding points.
[cpp-00043] useful=0 dangerous=0 :: For character filtering issues (eradicating/preserving characters based mostly on membership standards), use the set+comprehension+be a part of sample: (1) Convert filter standards right into a set for O(1) lookup (char_set = set(filter_string)), (2) Use checklist comprehension or generator expression to filter (char for char in supply if char not in char_set), (3) Use ''.be a part of() to reconstruct the string. This sample is extra Pythonic, readable, and maintainable than handbook index manipulation or character counting approaches, whereas being equally right and environment friendly.
[cpp-00049] useful=0 dangerous=0 :: When returning tuples or lists with combined numeric varieties (integers and floats), use applicable division operators for every part: integer division (//) for entire quantity outcomes, common division (/) for decimal outcomes. Instance: for sum and common, return (n*(n+1)//2, n*(n+1)/2/n) to make sure sum is int and common is float. This prevents kind mismatches in check assertions.
[cpp-00054] useful=0 dangerous=0 :: For digit-by-digit comparability or manipulation issues (digit distance, digit sum variations, and many others.): Use the string conversion sample: (1) Convert integers to strings with str(), (2) Use zfill(max_length) to pad shorter numbers with main zeros for equal size, (3) Use zip() to pair corresponding digit positions, (4) Apply operations on paired digits and mixture outcomes. Instance: str(num1).zfill(size) and str(num2).zfill(size) then zip() for pairing. This handles different-length numbers elegantly and offers clear positional entry to digits.
[cpp-00056] useful=5 dangerous=0 :: For checking if all/any components in a group fulfill a situation, use Python's built-in all() or any() features with generator expressions. Sample: all(situation for merchandise in iterable) for common quantification (all should fulfill), any(situation for merchandise in iterable) for existential quantification (at the very least one satisfies). That is extra Pythonic, readable, and environment friendly than handbook loops with flags. Widespread use instances: all(v == goal for v in dict.values()) for worth uniformity, any(x > threshold for x in checklist) for threshold checking, all(isinstance(x, int) for x in assortment) for kind validation.
[cpp-00060] useful=0 dangerous=0 :: For whitespace normalization (collapsing a number of areas/whitespace into single areas), use the split-join sample: ' '.be a part of(s.cut up()). The important thing perception: str.cut up() with out arguments has particular conduct - it splits on ANY whitespace (areas, tabs, newlines) AND routinely removes empty strings from the consequence, naturally collapsing consecutive whitespace. Mixed with ' '.be a part of(), this creates a clear answer with out regex imports. This sample is extra Pythonic and maintainable than regex alternate options like re.sub(r' +', ' ', s) for easy whitespace normalization duties.
[cpp-00062] useful=0 dangerous=0 :: For advanced quantity operations (polar/rectangular conversion, part calculation, magnitude), use Python's cmath module features as the primary alternative: cmath.polar(z) for conversion to polar type (returns magnitude and angle), cmath.rect(r, phi) for polar to rectangular, cmath.part(z) for angle extraction. These built-in features deal with edge instances accurately (e.g., treating actual numbers as advanced with imaginary half 0) and are extra dependable than handbook trigonometric calculations. Sample: import cmath → use applicable perform → deal with the return kind (typically tuples).
[cpp-00064] useful=0 dangerous=0 :: For grouping components by a key whereas preserving insertion order (crucial for tie-breaking in subsequent sorting), use collections.OrderedDict with setdefault sample: from collections import OrderedDict; grouped = OrderedDict(); for merchandise in gadgets: grouped.setdefault(key, []).append(worth). Whereas Python 3.7+ dicts keep insertion order, OrderedDict makes the intent specific and is safer when order issues for downstream operations like sorting by aggregated properties the place equal values ought to keep authentic encounter order.
[cpp-00065] useful=0 dangerous=0 :: For creating tuples with variable-length unpacked components, use the * unpacking operator: (first, *middle_elements, final) unpacks a listing/tuple into particular person tuple positions. Instance: (key, *values, depend) the place values is a listing creates a tuple with key, all values unpacked as separate components, and depend on the finish. That is important when output format requires flattening nested buildings into single-level tuples with variable aspect counts.
[cpp-00069] useful=0 dangerous=0 :: For regex sample matching issues requiring full string matches, select between re.search(), re.match(), and re.fullmatch() based mostly on matching scope: re.match() matches from the beginning, re.search() finds patterns wherever, re.fullmatch() requires the complete string to match. When full string matching is required, both use re.fullmatch() with the sample immediately, or use re.search()/re.match() with specific anchors (^ for begin, $ for finish). Instance: re.fullmatch('a.*b', s) is equal to re.search('^a.*b$', s). Each approaches are legitimate - fullmatch() makes the intent specific, whereas search() with anchors offers extra flexibility. At all times analyze check instances to find out if partial or full string matching is required.
[cpp-00072] useful=1 dangerous=0 :: For counting components in an iterable that match a situation, use the generator expression sample with sum(): sum(1 for x in iterable if situation). This offers optimum stability of readability, reminiscence effectivity, and Pythonic model in comparison with alternate options like len([x for x in iterable if condition]) which creates an intermediate checklist. For character-level string operations, want built-in string strategies (isdigit(), isalpha(), isalnum(), isupper(), islower()) over handbook ASCII vary comparisons - they deal with edge instances accurately, enhance readability, and are extra maintainable.
[cpp-00073] useful=0 dangerous=0 :: For bit manipulation issues (discovering set bits, MSB/LSB positions, bit counting), verify Python's integer bit strategies FIRST earlier than implementing handbook algorithms: bit_length() returns the variety of bits wanted to characterize the integer (helpful for MSB place), bit_count() counts set bits (Python 3.10+), as_integer_ratio() for rational illustration. These built-in strategies are optimized, deal with edge instances (together with 0), and infrequently eradicate the necessity for handbook bit-by-bit iteration. Sample: perceive what bit property you want, verify if a built-in technique offers it immediately.
[cpp-00076] useful=0 dangerous=0 :: For grouping consecutive equivalent components in a sequence, use itertools.groupby() because the canonical Python answer. Sample: [list(group) for key, group in itertools.groupby(sequence)]. The groupby perform returns (key, group_iterator) tuples the place secret is the aspect worth and group is an iterator of consecutive occurrences. Convert every group iterator to a listing to materialize outcomes. Important distinction: groupby teams CONSECUTIVE equivalent components solely - non-consecutive duplicates type separate teams, making it splendid for run-length encoding and consecutive duplicate detection with out handbook index monitoring.
## H&LING EDGE CASES
[hec-00021] useful=2 dangerous=0 :: When utilizing mathematical operations like modulo (%), division, or exponentiation, confirm the answer handles detrimental numbers accurately. For instance, modulo operator works accurately for each constructive and detrimental integers in Python (e.g., -18 % 2 == 0 for even quantity checking), however conduct might differ from expectations in different languages.
## ALGORITHM DESIGN
[ad-00001] useful=1 dangerous=2 :: For recursive GCD issues, use the Euclidean algorithm: base case is b == 0 (return a), recursive case is gcd(b, a % b). This handles all edge instances naturally together with argument ordering, equal numbers, and divisibility.
[ad-00006] useful=0 dangerous=0 :: For bidirectional character swap issues (A↔B) utilizing regex: use re.sub() with a callback perform in a single move. Sample: (1) Create a personality class matching all swap targets (e.g., r'[ _]'), (2) Implement callback that examines every match and returns its counterpart. This avoids ambiguity from sequential replacements the place new characters turn out to be indistinguishable from originals.
[ad-00008] useful=0 dangerous=0 :: For modular arithmetic issues (nCr mod p, and many others.), verify if p should be prime. If p might be composite, keep away from algorithms requiring modular inverse (like Fermat's Little Theorem). As an alternative, use approaches that keep away from division fully, akin to Pascal's triangle with DP: C[j] = (C[j] + C[j-1]) % p, which works for ANY modulus.
[ad-00009] useful=0 dangerous=0 :: When division is required in modular arithmetic: (1) If modulus is assured prime, use Fermat's Little Theorem: a/b mod p = a * b^(p-2) mod p. (2) If modulus could also be composite, use Prolonged Euclidean Algorithm for modular inverse, or higher but, redesign to keep away from division (e.g., use recurrence relations like Pascal's triangle).
[ad-00017] useful=1 dangerous=0 :: For decoding issues with combined encoded/non-encoded components: (1) use kind checking to differentiate aspect varieties, (2) validate encoded aspect construction, (3) deal with every kind appropriately in a single move. Prioritize easy iterative approaches with specific conditionals over advanced comprehensions for higher readability and maintainability.
[ad-00018] useful=4 dangerous=0 :: For max sum issues with non-adjacent aspect constraints: Use dynamic programming with recurrence dp[i] = max(arr[i] + dp[i-2], dp[i-1]), representing the selection to incorporate present aspect (add to finest from i-2) or exclude it (maintain finest from i-1). Deal with edge instances: empty array returns 0, single aspect returns that aspect, initialize dp[0] = arr[0] and dp[1] = max(arr[0], arr[1]). Time: O(n), House: O(n) or O(1) with optimization.
[ad-00023] useful=0 dangerous=0 :: For bit counting and parity checking issues: A number of legitimate approaches exist with completely different trade-offs. (1) Pythonic method: bin(n).depend('1') - most readable and maintainable, (2) Bit manipulation: repeatedly use x & (x-1) to clear lowest set bit - higher efficiency for big inputs, (3) XOR discount for parity. Select the Pythonic method by default until efficiency profiling exhibits it is a bottleneck.
[ad-00028] useful=1 dangerous=1 :: For bit toggling issues: (1) Create a masks with 1s at positions to be toggled, (2) Use XOR operation (n ^ masks) to toggle these bits. For variable-length numbers, use bit_length() to find out what number of bits to course of. Instance: to toggle bits at positions 1,3,5 as much as bit_length, generate masks = sum(1 << i for i in vary(1, n.bit_length(), 2)).
[ad-00037] useful=0 dangerous=0 :: For aspect rearrangement/partitioning issues (transfer zeros to finish, separate by situation, and many others.): Use the filter+concatenate sample: (1) filter components into separate teams utilizing checklist comprehensions [x for x in lst if condition], (2) depend or accumulate every group individually, (3) concatenate teams in required order. This Pythonic method utilizing built-ins (checklist comprehension, depend(), checklist multiplication) is commonly clearer and equally right in comparison with in-place two-pointer algorithms, particularly for small to medium datasets.
[ad-00039] useful=0 dangerous=0 :: For 'sum of two squares' issues (checking if n = a² + b²): Use single-loop optimization O(√n) as an alternative of nested loops O(n). Iterate one variable from 0 to √n, calculate the rest (n - a²), and verify if the rest is an ideal sq. utilizing math.isqrt(). Return True instantly upon discovering legitimate pair. This sample: (1) reduces time complexity, (2) handles edge instances naturally (a=0, a=√n), (3) avoids floating-point errors with isqrt().
[ad-00041] useful=4 dangerous=1 :: For geometry and formula-based mathematical issues: Observe a structured method: (1) Determine the right mathematical method from downside area information, (2) Implement the method as a direct translation into code utilizing math module features, (3) Keep away from reimplementing mathematical features or constants that exist in customary libraries, (4) Confirm the method with at the very least one check case earlier than coding. Direct method translation results in cleaner, extra maintainable code with higher numerical precision.
[ad-00042] useful=0 dangerous=0 :: For issues choosing components from each ends of a group (ok smallest AND ok largest), use approaches that deal with overlap: (1) Index-based choice: iterate sorted assortment and embrace components the place idx < ok OR idx >= len-k, making certain every aspect chosen as soon as, or (2) Set union: mix subsets with set(min_k + max_k) then kind to eradicate duplicates. At all times think about edge instances the place ok*2 >= collection_size, as this ensures overlap between minimal and most choices. Keep away from easy checklist concatenation which creates duplicates when ranges overlap.
[ad-00045] useful=0 dangerous=0 :: For 'discover the n-th quantity with property X' issues: Use the iterative counting sample: (1) implement a helper perform to verify if a quantity satisfies the property, (2) iterate via candidate numbers ranging from an applicable preliminary worth, (3) keep a counter for numbers that fulfill the property, (4) return the candidate when counter reaches n. This sample works for prime numbers, good squares, numbers with particular factorization properties, and many others. It is simple to implement accurately and optimize later if wanted.
[ad-00046] useful=3 dangerous=0 :: For counting distinct prime elements: Use the usual factorization sample: (1) iterate potential divisors from 2 to sqrt(n), (2) for every divisor that divides n, increment the distinct issue depend, then divide n by that divisor repeatedly till it now not divides (this ensures every prime is counted as soon as no matter its energy), (3) after the loop, if n > 1, it is a remaining prime issue (depend it), (4) optimize by checking divisor 2 individually, then solely odd numbers. This accurately distinguishes between distinct primes and their multiplicities.
[ad-00048] useful=1 dangerous=0 :: For mathematical sequence issues (sum of first n numbers, arithmetic/geometric collection, factorial-related), verify if a closed-form method exists earlier than implementing iterative options. Widespread formulation: sum(1..n) = n*(n+1)/2, sum of arithmetic collection = n*(first+final)/2, sum of geometric collection = a*(r^n - 1)/(r-1). Formulation-based options present O(1) time complexity vs O(n) for loops, are much less error-prone, and show mathematical perception. At all times confirm method correctness with check instances.
[ad-00051] useful=1 dangerous=0 :: For pair-counting issues (depend pairs satisfying a situation), search for mathematical properties that eradicate the necessity for specific enumeration. Sample: (1) Determine what makes a pair legitimate, (2) Discover mathematical properties characterizing legitimate pairs (e.g., for XOR being odd: one quantity should be even, different odd), (3) Remodel right into a counting downside (depend components in every class), (4) Use combinatorics to compute consequence (e.g., odd_count × even_count). This reduces O(n²) pair enumeration to O(n) categorization + O(1) calculation.
[ad-00052] useful=0 dangerous=0 :: For issues involving XOR operations, leverage bit-level properties for optimization: (1) XOR result's odd ⟺ operands have completely different parities (one even, one odd), as a result of parity will depend on the least important bit, (2) XOR is commutative and associative, permitting reordering, (3) x ^ x = 0 and x ^ 0 = x, helpful for cancellation patterns. Analyze the precise XOR property related to your downside to seek out mathematical shortcuts that keep away from brute power computation.
[ad-00061] useful=0 dangerous=0 :: For iterative mathematical sequence issues (sum/product of first n phrases with particular properties): Use a structured 3-step method: (1) Determine the method for producing the k-th aspect (e.g., 2k-1 for odd numbers, 2k for even numbers, k² for squares), (2) Decide the operation to use to every aspect (exponentiation, multiplication, transformation), (3) Mixture with applicable perform (sum, product, max). Implement utilizing generator expressions with built-ins: sum(operation(method(i)) for i in vary(begin, n+1)). Guarantee vary bounds match the sequence indexing (1-indexed sequences want vary(1, n+1)). This sample offers readability and correctness for issues the place closed-form formulation do not exist or aren't apparent.
[ad-00066] useful=0 dangerous=0 :: For issues requiring grouping, counting, and sorting by aggregated properties: (1) Group components utilizing dict/OrderedDict with setdefault() or defaultdict, selecting OrderedDict when insertion order impacts tie-breaking in sorting, (2) Type teams utilizing sorted() with key perform based mostly on aggregated metric (e.g., key=lambda x: len(x[1]) for depend), (3) Remodel output to match required format utilizing applicable unpacking/restructuring. This sample handles 'group by X, kind by depend of Y' issues systematically.
[ad-00068] useful=0 dangerous=0 :: For heap-based 'high ok' issues, confirm OUTPUT ORDERING in opposition to check instances, not simply which components to return. Key distinction: (1) heappop() from a min-heap produces ASCENDING order by the heap key, (2) heapq.nlargest(ok, gadgets, key=func) produces DESCENDING order by key, (3) heapq.nsmallest(ok, gadgets, key=func) produces ASCENDING order by key. When implementing heap options, hint via check instances to find out if outcomes needs to be ordered ascending or descending by frequency/precedence. If ordering is flawed, both reverse the ultimate checklist or swap between nlargest/nsmallest, or use the heappop sample. Take a look at case output ordering is authoritative when the issue description does not explicitly specify.
[ad-00070] useful=0 dangerous=0 :: For 2D grid issues with adjacency or choice constraints (cannot decide adjoining cells/rows/columns): Search for alternatives to cut back dimensionality earlier than making use of DP. If constraints permit selecting at most one aspect per column (or row), pre-compute the optimum alternative for every column/row (e.g., max of two rows in a column), remodeling the issue right into a 1D array. Then apply customary 1D DP patterns (like 'home robber' for non-adjacency). This dimensional discount simplifies state area and makes advanced grid issues tractable utilizing well-known DP templates.
[ad-00071] useful=0 dangerous=0 :: Acknowledge the 'home robber' DP sample as a elementary template relevant past linear arrays: any downside involving choosing non-adjacent components to maximise/reduce a sum can use the recurrence dp[i] = max(worth[i] + dp[i-2], dp[i-1]). This sample seems in: linear arrays with spacing constraints, grid issues (after dimensional discount), tree issues (with parent-child constraints), and sequence optimization. Whenever you see 'maximize sum' + 'cannot decide adjoining', instantly think about this template.
[ad-00075] useful=0 dangerous=0 :: For locating essentially the most important bit (MSB) worth or place: Use bit_length() technique which returns the variety of bits required to characterize an integer. For MSB worth, use the sample: 1 << (n.bit_length() - 1), which leverages the connection that the MSB at place ok (0-indexed from proper) has worth 2^ok. The bit_length() method is cleaner than handbook division loops or string conversion strategies. Deal with edge case: bit_length() returns 0 for n=0, so confirm downside constraints or add specific zero dealing with if wanted.
## TEST CASE INTERPRETATION
[tci-00004] useful=0 dangerous=0 :: A number of right implementations can exist for a similar downside. Concentrate on algorithmic correctness verified by passing exams, not on matching a particular reference implementation's model or construction.
[tci-00011] useful=123 dangerous=2 :: Extract the anticipated OUTPUT FORMAT from check instances, not simply the logic. Verify if the return needs to be a single worth, tuple, checklist, or different construction, and guarantee your answer matches this actual format.
[tci-00022] useful=0 dangerous=1 :: When analyzing check instances, verify if ALL inputs map to the SAME output worth or construction. If that's the case, the answer could also be trivial - merely return that fixed output immediately. Do not overcomplicate with pointless transformations (like checklist conversions) when a direct return assertion satisfies all necessities. Instance: if all check instances count on empty tuple output, return () no matter enter complexity.
[tci-00025] useful=5 dangerous=0 :: Earlier than selecting an implementation method, deeply perceive the CORE REQUIREMENT from the issue description and check instances. For instance, 'even parity' means 'even depend of 1-bits', not a particular algorithm. Do not lock into a selected method (like bit manipulation) if less complicated alternate options (like string counting) fulfill the requirement equally properly.
[tci-00027] useful=17 dangerous=0 :: When downside descriptions use ambiguous terminology (particularly in bit manipulation: 'even bits', 'odd positions', and many others.), work backward from check instances to find the precise sample. Manually hint via examples of their related illustration (binary for bit issues) to find out the bottom reality interpretation. Take a look at instances are authoritative when terminology is unclear.
[tci-00032] useful=0 dangerous=0 :: When issues ask for 'most/minimal of all information/teams', make clear whether or not it means: (1) international excessive throughout all components, or (2) per-group extremes returned as a group. Take a look at instances reveal the excellence: single worth output signifies international excessive, checklist/tuple output suggests per-group evaluation. This interpretation impacts whether or not you flatten the construction or protect grouping.
[tci-00034] useful=0 dangerous=0 :: For dictionary-related issues, fastidiously distinguish from check instances whether or not the anticipated output is: (1) a key (string/int), (2) a price, (3) a key-value pair (tuple), or (4) a group of any of those. The output kind determines whether or not you want dict.keys(), dict.values(), dict.gadgets(), or direct indexing into transformed buildings. Take a look at case outputs reveal the precise format required.
[tci-00035] useful=3 dangerous=0 :: When perform names or downside descriptions counsel particular conduct (e.g., 'parallelogram_perimeter' implying geometric method 2*(a+b)), however check instances produce outputs inconsistent with that expectation, belief the check instances because the authoritative specification. Reverse-engineer the precise method by calculating what operation on inputs produces the given outputs, then confirm this derived sample in opposition to ALL check instances earlier than implementing. Take a look at case expectations override semantic meanings and area information.
[tci-00040] useful=0 dangerous=0 :: Take a look at outcomes are the first sign of correctness, not line-by-line comparability with reference implementations. In case your answer passes all exams with higher time complexity (e.g., O(√n) vs O(n)), it is not simply right however algorithmically superior. Totally different approaches might be equally or extra legitimate - deal with correctness verification via exams, not on matching particular implementation types.
[tci-00044] useful=2 dangerous=0 :: When encountering undefined or domain-specific mathematical phrases (like 'sensible quantity', 'fortunate quantity', and many others.), deal with check instances because the authoritative specification. Systematically analyze check case outputs to reverse-engineer the mathematical definition: (1) look at the numerical properties of output values (factorization, divisors, digits, and many others.), (2) search for patterns or frequent traits throughout all outputs, (3) formulate a speculation in regards to the defining property, (4) confirm the speculation in opposition to ALL check instances. The check instances encode the entire definition when the issue assertion is ambiguous.
[tci-00055] useful=2 dangerous=0 :: When downside terminology is totally ambiguous or undefined (like 'digit distance' which might have a number of interpretations), systematically hint via EACH check case manually to determine the precise sample: (1) Work via inputs and outputs step-by-step within the related illustration, (2) Formulate a speculation about what operation produces these outputs, (3) Confirm the speculation in opposition to ALL remaining check instances, (4) Implement the sample that satisfies all exams. The test-derived sample is the right specification, no matter what the terminology may counsel in different contexts.
[tci-00058] useful=0 dangerous=0 :: A number of algorithmically completely different options might be equally legitimate in the event that they fulfill all check instances. When deriving necessities from ambiguous specs, use systematic speculation testing: (1) analyze every check case to know input-output relationships, (2) formulate a speculation in regards to the underlying rule, (3) validate the speculation in opposition to ALL check instances, (4) implement the sample that passes all exams. Your answer is right by definition if it satisfies all check necessities, even when it differs structurally from reference implementations or makes use of a distinct interpretation of ambiguous phrases.
[tci-00063] useful=0 dangerous=0 :: In Python, parentheses alone do not create tuples - distinguish between ('worth') which is only a string 'worth' (parentheses are for grouping/priority), and ('worth',) which is a 1-element tuple (trailing comma required). When analyzing check assertions like assert func()==('Matched!'), acknowledge this expects a plain string, not a tuple. Solely ('Matched!',) with a trailing comma or (a, b) with a number of components create tuples. This syntax nuance is crucial for matching anticipated return varieties precisely.
[tci-00067] useful=0 dangerous=0 :: When check instances present advanced output buildings (tuples with variable-length unpacked components, nested aggregations), analyze the EXACT construction earlier than coding: (1) Depend components in output tuples/lists, (2) Determine which components are aggregated vs particular person, (3) Decide if nested buildings are flattened (unpacked) or preserved, (4) Verify if ordering inside teams issues. Use this structural evaluation to decide on applicable Python constructs (* unpacking, checklist flattening, tuple building patterns) that match the anticipated format exactly.
[tci-00077] useful=0 dangerous=0 :: For counting/aggregation issues involving nested buildings (lists of lists, timber, nested dictionaries), when the issue asks to 'depend components' with out specifying the extent, use check instances to find out the counting scope: (1) Verify if check outputs counsel counting solely instant/top-level youngsters (e.g., len(outer_list)) vs recursive counting of all nested components, (2) Hint via at the very least one check case with nested buildings to see which interpretation produces the anticipated output, (3) The best interpretation (top-level counting) is normally right until check instances show in any other case. Instance: 'depend lists in [[1,2], [3], [[4,5]]]' might imply 3 (top-level) or 4 (recursive) - check outputs reveal which is predicted.
[tci-00078] useful=0 dangerous=0 :: For mathematical issues with infinitely many legitimate options (linear Diophantine equations, modular arithmetic, geometric constructions, and many others.), acknowledge that exams count on ONE PARTICULAR answer, not simply any mathematically right reply. Work via check instances to determine the choice standards (e.g., smallest non-negative values, particular ordering, canonical type). When selecting algorithms, want approaches that naturally produce the anticipated answer sample (e.g., iterative search from x=0 upward for smallest non-negative x) over refined algorithms (e.g., Prolonged Euclidean Algorithm) that require further adjustment logic to match check expectations. The mathematically elegant answer is not all the time the right one for passing exams.
[tci-00079] useful=0 dangerous=0 :: For issues involving mathematical constants (pi, e, sqrt(2), and many others.), confirm that check case anticipated outputs match calculations utilizing customary library constants (math.pi, math.e). Calculate at the very least one check case output manually utilizing the usual fixed and evaluate to the anticipated worth. If there is a mismatch in precision (e.g., your 942.477 vs anticipated 942.45), the check instances doubtless count on a simplified/truncated fixed worth (like pi=3.14 or pi=3.1415) somewhat than full precision. Verify reference implementations for hardcoded fixed values and use these actual values to match check expectations, even when they're much less mathematically correct.
## DEBUGGING STRATEGIES
[ds-00005] useful=110 dangerous=2 :: Earlier than producing code, mentally hint via the logic in opposition to check instances to confirm correctness. This helps catch logical errors early and builds confidence within the answer method.
[ds-00029] useful=0 dangerous=0 :: For bit manipulation issues with unclear place indexing, check a number of interpretations systematically: (1) 0-indexed vs 1-indexed, (2) counting from proper vs left, (3) 'even/odd' referring to place vs bit worth. Work via all check instances manually in binary to validate every speculation earlier than implementing. The interpretation that satisfies all check instances is right.
[ds-00080] useful=0 dangerous=0 :: Throughout reasoning part, manually calculate anticipated outputs for at the very least one check case utilizing your proposed method and evaluate in opposition to the precise anticipated output. For numerical issues, confirm precision matches precisely - discrepancies like 942.477 vs 942.45 point out fixed precision mismatches (e.g., utilizing math.pi as an alternative of a truncated worth). This early validation catches precision points, flawed formulation, and fixed worth issues earlier than code era.

These outcomes present that ACE can considerably enhance efficiency on advanced duties like code era.

Abstract

On this article, we’ve explored rather a lot about context engineering and the ACE method, so let’s briefly recap the important thing takeaways:

Context engineering has emerged as a crucial area as a result of it permits us to enhance LLM efficiency with out prolonged and dear fine-tuning.
ACE (Agentic Context Engineering) is among the newest approaches to immediate optimisation, leveraging detailed playbooks with atomised bullet factors that embrace each directions and metadata.
As our examples confirmed, immediate optimisation is just not a silver bullet. It doesn’t enhance efficiency in each case. Based on the authors, ACE is best for agentic workflows or extremely specialised domains. In our experiments, it made a transparent distinction in code era, however had restricted affect on banking intent classification.

The primary takeaway for me is that immediate optimisation gained’t remedy your activity routinely. You continue to want a holistic understanding of what info the LLM and brokers have in the course of the optimisation course of and the way finest to construction and refine it. Context issues, and considerate engineering of that context is what makes approaches like ACE efficient.

Thanks for studying. I hope this text was insightful. Keep in mind Einstein’s recommendation: “The vital factor is to not cease questioning. Curiosity has its personal motive for present.” Could your curiosity lead you to your subsequent nice perception.

Reference

This text was based mostly on the paper and analysis by Zhang et al., printed in 2025, “Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models”.

Source link

Beyond Prompting: The Power of Context Engineering

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Black Friday Football: How to Watch Bears vs. Eagles Today for Free

Aging Workforce Challenges for Businesses (and Solutions)

Coffee and tea linked to lower dementia risk

Beyond Prompting: The Power of Context Engineering

Evolution of context engineering approaches

Agentic Context Engineering

Utilizing ACE for banking intent knowledge

Loading the knowledge

Extending ACE to banking intent knowledge

Making ready the information

Implementing the DataProcessor

Placing collectively the workflow script

Outcomes

Utilizing ACE for code era

Knowledge overview

Including a brand new activity to the ACE framework

Outcomes

Abstract

Reference

Related Posts