Tool Masking: The Layer MCP Forgot

By Frank Wittkampf & Lucas Vieira

MCP and related providers had been a breakthrough in AI connectivity¹: a giant leap ahead when we have to expose providers shortly and virtually effortlessly to an LLM. However in that additionally lies the issue: that is bottom-up pondering. “Hey, why don’t we expose the whole lot, in every single place, suddenly?”

Uncooked publicity of APIs comes at a price: each instrument floor pushed straight into an agent bloats prompts, inflates alternative entropy, and drags down execution high quality. A well-designed AI agent begins use-case down fairly than tech-up. In case you would design your LLM name from scratch, you’d by no means present the total unfiltered floor of an API. You’ve added pointless tokens, unrelated data, extra failure modes, and usually degraded high quality. Empirically, broad instrument definitions devour giant token budgets: e.g., one 28-parameter instrument ≈1,633 tokens; 37 instruments ≈6,218 tokens, which degrades accuracy and will increase latency/cost⁶.

In our job, constructing Enterprise-scale AI options for the most important tech corporations (MSFT, AWS, Databricks, and plenty of others), the place we ship hundreds of thousands of tokens a minute to our AI suppliers, these nuances matter. In case you optimize instrument publicity, which means you optimize your LLM execution context, which implies you might be bettering high quality, accuracy, consistency, price, and latency, all on the similar time.

This text will outline the novel idea of Device masking. Many individuals will have already got implicitly experimented with this, but it surely’s a subject not effectively explored in on-line publications thus far. Device masking is a vital, and lacking layer within the present agentic stack. A instrument masks shapes what the mannequin truly sees, each earlier than and after execution, so your AI agent can’t simply be related however truly enabled.

So, rounding up our intro: utilizing uncooked MCP pollutes your LLM execution. How do you optimize the model-facing floor of a instrument for a given agent or job? You utilize instrument masking. A easy idea, however as all the time, the satan is within the particulars.

What MCP does effectively, and what it doesn’t

MCP will get quite a bit proper. It’s an open protocol, and Anthropic refers to it because the “USB-C for AI”. A solution to join LLM apps with exterior instruments and knowledge with out friction¹. It nails the fundamentals: standardizing how instruments, sources, and prompts are described, found, and invoked, whether or not you’re utilizing JSON-RPC over stdio or streaming over HTTP². Auth is dealt with cleanly on the transport layer³. That’s why you see it touchdown in every single place from OpenAI’s Brokers SDK to Copilot in VS Code, all the best way to AWS guidance⁴. MCP is actual, and adoption is powerful.

But it surely’s equally vital to see what MCP doesn’t do – and that’s the place the gaps present up. MCP’s focus is context change. It doesn’t care how your app or agent truly makes use of the context you go in, or the way you handle and form that context per agent or job. It exposes the total instrument floor, however doesn’t form or filter it for high quality or relevance. Per the structure docs, MCP “focuses solely on the protocol for context change. It doesn’t dictate how AI purposes use LLMs or handle the supplied context².” You get a discoverable catalog and schemas, however no built-in mechanism within the protocol that enables to optimize how the context is offered.

Notice: Some SDKs now add non-obligatory filtering — for instance, OpenAI’s Brokers SDK helps static/dynamic MCP instrument filtering⁵ . This can be a step in the appropriate route, however nonetheless leaves an excessive amount of on the desk.

1. Anthropic MCP overview
2. Model Context Protocol — Architecture
3. MCP Spec — Authorization
4. OpenAI Agents SDK (MCP); VS Code MCP GA; AWS — Unlocking MCP
5. GitHub — PR #861 (MCP tool filtering)
6. Medium — How many tools/functions can an AI Agent have?

The Drawback in Observe

For instance this, let’s take the (unofficial) Yahoo Finance API. Like many APIs, it returns a large JSON object full of dozens of metrics. Highly effective for evaluation, however overwhelming when your agent merely must retrieve one or two key figures. For instance my level, right here’s a snippet of what the agent may obtain when calling the API:

yahooResponse = {
  "quoteResponse": {
    "outcome": [
      {
        "symbol": "AAPL",
        "regularMarketPrice": 172.19,
        "marketCap": ...,
        …
        …
        # … roughly 100 other fields

# Other fields: regularMarketChangePercent, currency, marketState, exchange,
  fiftyTwoWeekHigh/Low, trailingPE, forwardPE, earningsDate, 
  incomeStatementHistory, financialData (with revenue, grossMargins, etc.), 
  summaryProfile, etc.

For an agent, getting a 100 fields of data, among other tool output, is overwhelming: irrelevant data, bloated prompts, and wasted tokens. It’s obvious that accuracy goes down as tool counts and schema sizes grow; researchers have shown that as the toolset expands, retrieval and invocation reliability drops sharply¹, and inputting every tool into the LLM quickly becomes impractical due to context length and latency constraints². This obviously depends on the model, but as models grow more capable, tool demands are increasing as well. Even state-of-the-art models still struggle to effectively select tools in large tool libraries³.

The problem is not limited to tool output. The more important problem is the API input schema. Going back to our example, for the Yahoo Finance API, you can request any combination of modules: assetProfile, financialData, price, earningsTrend, and many more. If you expose this schema to your agent raw, through MCP (or fastAPI, etc.), you’ve just massively polluted your agent context. At massive scale, this becomes even more challenging; recent work notes that LLMs operating on very large tool graphs require new approaches such as structured scoping or graph-based methods⁴.

Tool definitions consume tokens in every conversation turn; empirical benchmarks show that large, multi-parameter tools and big toolsets quickly dominate your prompt budget⁵. Without a filtering or rewriting layer, the accuracy and efficiency of your AI agent degrade⁶.

Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval “Identifying the most relevant tools … becomes a key bottleneck as the toolset size grows, hindering reliable tool utilization.”
Towards Completeness-Oriented Tool Retrieval for LLMs “…it is impractical to input all tools into LLMs due to length limitations and latency constraints.
Deciding Whether to Use Tools and Which to Use “…the majority [of LLMs] nonetheless wrestle to successfully choose instruments…”
ToolNet: Connecting LLMs with Massive Tools via Tool Graph “It stays difficult for LLMs to function on a library of large instruments,” motivating graph-based scoping.
How many tools/functions can an AI Agent have? (Feb 2025): experiences {that a} instrument with 28 params consumed 1,633 tokens; a set of 37 instruments consumed 6,218 tokens.
Benchmarking Tool Retrieval for LLMs (ToolRet) Giant-scale benchmark exhibiting instrument retrieval is difficult even for sturdy IR fashions.

Right here’s a pattern instrument definition in case you expose the uncooked API with out making a customized instrument for it (this has been shortened for readability):

yahooFinanceTool = {
 "title": "yahoo.quote_summary",
 "parameters": {
    "sort": "object",
    "properties": {
      "image": {"sort": "string"},
      "modules": {
        "sort": "array",
        "gadgets": {"sort": "string"},
        "description": "Choose any of: assetProfile, financialData, worth, 
          earningsHistory, incomeStatementHistory, balanceSheetHistory, 
          cashflowStatementHistory, summaryDetail, quoteType, 
          recommendationTrend, secFilings, fundOwnership, 
          … (and dozens extra modules)"
      },
    # … plus extra parameters: area, lang, overrides, filters, and so on.
    },
  "required": ["symbol"]
  }
}

The Repair

Right here’s the true unlock: with instrument masking, you’re in command of the floor you current to your agent. You aren’t pressured to reveal your entire API, and also you don’t must recode your integrations for each new use case.

Need the agent to solely ever fetch the most recent inventory quote? Construct a masks that presents simply that motion as a easy instrument.

Must help a number of distinct duties, like fetching a quote, extracting solely income, or perhaps toggling between worth sorts? You possibly can design a number of slim instruments, every with its personal masks on prime of the identical underlying instrument handler.

Or, you may mix associated actions right into a single instrument and provides the agent an express toggle or enum, no matter interface matches the agent’s context and job.

Wouldn’t or not it’s nicer if the agent solely noticed quite simple, purpose-built instruments, like these?

# Easy Device: Get newest worth and market cap
fetchPriceAndCap = {
  "title": "get_price_and_marketcap",
  "parameters": {
    "sort": "object",
    "properties": {
    "image": {"sort": "string"}
   },
   "required": ["symbol"]
  }
}

# Easy Device 2: Get firm income solely
fetchRevenue = {
  "title": "get_revenue",
  "parameters": {
    "sort": "object",
    "properties": {
      "image": {"sort": "string"}
    },
    "required": ["symbol"]
  }  
}

The underlying code makes use of the identical handler. No have to duplicate logic or pressure the agent to motive concerning the full module floor. Simply totally different masks for various jobs — no module lists, no bloat, no recoding*.

* This aligns with steering to make use of solely important instruments, reduce parameters, and, the place doable, activate instruments dynamically for a given interplay.

The facility of instrument masking

The purpose: Device masking isn’t just about hiding complexity. It’s about designing the appropriate agent-facing floor for the job at hand.

You possibly can expose an API as one instrument or many.
You possibly can tune what’s required, non-obligatory, and even fastened (hard-coded values).
You possibly can current totally different masks to totally different brokers, based mostly on position, context, or enterprise logic.
You possibly can refactor the floor at any time — with out rewriting the handler or backend code.

This isn’t simply technical hygiene — it’s a strategic design resolution. It allows you to ship cleaner, leaner, extra sturdy brokers that do precisely what’s wanted, no extra and no much less.

That is the facility of instrument masking:

Begin with a broad, messy API floor
Outline as many slim masks as wanted — one for every agent use case
Current solely what issues (and nothing extra) to the mannequin

The outcome? Smaller prompts, sooner responses, fewer misfires — and brokers that get it proper, each time. Why does this matter a lot, particularly at enterprise scale?

Selection entropy: When the mannequin is overloaded with choices, it’s extra more likely to misfire or choose the unsuitable fields
Efficiency: Further tokens imply greater price, extra latency, decrease efficiency, much less accuracy, much less consistency
Enterprise scale: Whenever you’re sending hundreds of thousands of tokens per minute, small inefficiencies shortly add up. Precision issues. Fault tolerance is decrease. (Giant instrument outputs may echo by means of histories and balloon spend)¹

1. Everything Wrong with MCP

The Resolution

On the coronary heart of sturdy instrument masking is a clear separation of considerations.

First, you will have the instrument handler — that is the uncooked integration, whether or not it’s a third-party API, inner service, or direct perform name. The handler’s job is solely to reveal the full functionality floor, with all its energy and complexity.

Subsequent comes the instrument masks. The masks defines the model-facing interface — a slim schema, tailor-made enter and output, and smart defaults for the agent’s use case or position. That is the place the broad, messy floor of the underlying instrument is slimmed down to precisely what’s wanted (and nothing extra).

In between sits the tooling service. That is the mediator that applies the masks, validates the enter, interprets agent requests into handler calls, and validates or sanitizes responses earlier than returning them to the mannequin.

^{Excessive Degree Overview— Device Masks}

Ideally, you retailer and handle instrument masks in the identical place that you simply retailer all of your different agent/system prompts, as a result of, in follow, presenting a instrument to an LLM is a type of immediate engineering.

Let’s assessment an instance of an precise instrument masks. Our definition of a instrument masks has developed over the previous few years¹. Beginning as a easy filter, to a full enterprise service, utilized by the most important tech corporations on the earth.

1. Initially (in 2023), we began with easy enter/output adapters, however over as we labored throughout a number of corporations and plenty of use instances, it has developed to a full immediate engineering floor.

Device masks instance

tool_name: stock_price
description: Retrieve the most recent market worth for a inventory image by way of Yahoo Finance.

handler_name: yahoo_api

handler_input_template:
  session_id: "{{ context.session_id }}"
  image: "{{ enter.image }}"
  modules:
    - worth

output_template: |
  {
    "knowledge": {
      "image": "{{ outcome.quoteResponse.outcome[0].image }}",
      "market_price": "{{ outcome.quoteResponse.outcome[0].regularMarketPrice }}",
      "forex": "{{ outcome.quoteResponse.outcome[0].forex }}"
    }
  }

input_schema:
  sort: object
  properties:
    image:
      sort: string
      description: "The inventory ticker image (e.g., AAPL, MSFT)"
  required: ["symbol"]

custom_validation_template: |
   string %
  size > 6 or symbol_str != symbol_str.higher() %
      { "success": false, "error": "Image have to be 1–6 uppercase letters." }
  {% endif %}

The instance above ought to converse for itself, however let’s spotlight a number of traits:

The masks interprets the enter (supplied by the AI agent) to a handler_input (what the need API obtain)
The handler for this specific instrument is an API, it may simply as effectively have been every other service. The service may produce other masks on prime of it, which pull different knowledge out of the identical API
The masks permits for Jinja*. This permits for highly effective immediate engineering
A customized validation could be very highly effective if you wish to add particular nudges that steer the AI agent to self-correct its errors
The session_id and the module are hard-coded into the template. The AI agent isn’t in a position to modify these

*Notice: in case you’re doing this in a nodeJS surroundings, EJS is nice for this as effectively.

With this structure, you possibly can flexibly add, take away, or modify instrument masks with out ever touching the underlying handler or agent code. Device masking turns into a “configurable immediate engineering” layer, supporting fast iteration, testing, and sturdy, role- or use-case-specific agent conduct.

Hey, it’s virtually as if a instrument has turn into a immediate…

The Missed Immediate Engineering Floor

Instruments are prompts. It’s fascinating that in immediately’s AI blogs, there’s little reference of it. An LLM receives textual content after which generates textual content. Device names, instrument descriptions, and their enter schemas are a part of the incoming textual content. Instruments are prompts, simply with a particular taste.

When your code makes an LLM name, the mannequin reads the total immediate enter, after which decides whether or not and the best way to name a tool¹²³. If we conclude that instruments are basically prompts, then I hope whilst you’re studying this you’re having the next realization:

Instruments have to be immediate engineered, and thus any immediate engineering approach I’ve at my disposal additionally must be utilized to my tooling:

Instruments are context dependent! Device descriptions ought to match with the remainder of the immediate context.
Device naming issues, quite a bit!
Device enter floor provides tokens and complexity, and thus must be optimized.
Equally for the instrument output floor.
The framing and phrasing of instrument error responses issues, an agent will self-correct in case you present it the appropriate response.

In follow, I see many examples the place engineers present in depth directions relating to a selected instrument’s use in the primary immediate of the agent. This can be a follow that we must always query. Ought to the directions on the best way to use a instrument reside within the bigger agent-prompt, or with the instrument? Some instruments want solely a brief abstract; others profit from richer steering, examples, or edge-case notes so the mannequin selects them reliably and codecs arguments appropriately. With masking, you possibly can adapt the identical underlying API to totally different brokers and contexts by tailoring the instrument description and schema per masks. Maintaining that steering co-located with the instrument floor stabilizes the contract and avoids drifting chat prompts (see Anthropic’s Device use and Finest practices for instrument definitions). Whenever you additionally specify output construction, you enhance consistency and parse-ability¹. Masks make this editable by immediate engineers as an alternative of burying it in (Python) code.

Operationally, we must always deal with masks as configurable prompts for instruments. Virtually, we advocate that you simply retailer the masks in the identical layer that hosts your prompts. Ideally, it is a config system that helps templating (e.g., Jinja), variables, and analysis. These ideas are equally usable for instrument masks as to your common prompts. Moreover, we advocate you model them, scope by agent or position, and use these instrument masks to repair defaults, conceal unused params, or cut up one broad handler into a number of clear surfaces. Device masks even have safety advantages, permitting particular params are supplied by the system, as an alternative of the LLM. (Unbiased critiques additionally spotlight price/security dangers from unbounded instrument outputs. But another excuse to constrain surfaces⁴.)

Finished effectively, masking extends immediate engineering to the instrument boundary the place the mannequin truly acts, yielding cleaner conduct and extra constant execution.

1. Anthropic — Tool Use Overview
2. OpenAI — Tools Guide
3. OpenAI Cookbook — Prompting Guide
4. Everything Wrong with MCP

Design Patterns

A number of easy patterns cowl most masking wants. Begin with the smallest floor that works, then increase solely when a job really calls for it.

Schema Shrink: Restrict parameters to what the duty wants; constrain sorts and ranges; prefill invariants.
Position-Scoped View: Current totally different masks to totally different brokers or contexts; similar handler, tailor-made surfaces.
Functionality Gate: Expose a targeted subset of operations; cut up a mega-tool into single-purpose instruments; implement allowlists.
Defaulted Args: Set sensible defaults and conceal nonessential choices to chop tokens and variance.
System-Supplied Args: Inject tenant, account, area, or coverage values from the system; the LLM can not change them, which improves safety and consistency.
Toggle/Enum Floor: Mix associated actions into one instrument with an express enum or mode; no free-text switches.
Typed Outputs: Return a small, strict schema; normalize models and keys for dependable parsing and analysis.
Progressive Disclosure: Ship the minimal masks first; add non-obligatory fields by way of new masks variations solely when wanted.
Validation: Enable customized enter validation at instrument masks stage; set constructive validation responses to information the agent in the appropriate route

Conclusion

Connectivity solved the what. Execution is the how. Providers like MCP join instruments. Device masking makes them carry out by shaping the model-facing floor to suit the duty and the agent which might be working with it.

Assume use case down, not tech up. One handler, many masks. Slender inputs, outputs, and experiment and immediate engineer your instrument floor to perfection. Maintain the outline with the instrument, not buried in chat textual content or code. Deal with masks as configurable prompts that you may model, check, and assign per agent.

In case you expose uncooked surfaces, you pay for entropy: extra tokens, slower latency, decrease accuracy, inconsistent conduct. Masks flip that curve. Smaller prompts. Quicker responses. Increased go charges. Fewer misfires. The impression of this strategy compounds at enterprise scale. (Even MCP advocates observe that discovery lists the whole lot, with out curation, and that brokers ship/think about an excessive amount of knowledge.)

So, what to do?

Put a masking layer between brokers and each broad API
Strive a number of masks on one handler, and customizing a masks to see the way it impacts efficiency
Retailer masks together with your prompts in config; model and iterate
Transfer instrument directions into the instrument floor, and out of system prompts.
Present smart defaults, and conceal what the mannequin mustn’t contact

Cease transport mega instruments. Ship surfaces. That’s the layer MCP forgot. The step that turns an agent from related into enabled.

Drop us a touch upon LinkedIn in case you favored this text!

In regards to the authors:
Lucas and Frank have tightly labored collectively on AI infrastructure throughout a number of corporations (and advising a handful of others) – from among the earliest multi-agent groups, to LLM supplier administration, to doc processing, to Enterprise AI automation. We work at Databook, a innovative AI automation platform for the world’s largest tech corporations (MSFT, AWS, Databricks, SalesForce, and others), which we empower with a spread of options utilizing passive/proactive/guided AI for actual world, enterprise manufacturing purposes.

Source link

Tool Masking: The Layer MCP Forgot

Prompt Engineering Is Solved—Prompt Management Isn’t

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Prompt Engineering Is Solved—Prompt Management Isn’t

Samsung’s chip workers are jumping ship to rival SK Hynix

Tactile-Based Robot Centering as a Capability for Dexterous Manipulation

Dog tracker uses Starlink for lost pets when cell signal drops

Featured Picks

Innovation meets electronics: The Hong Kong Electronics Fair and electronicAsia are back this October (Sponsored)

How AI Content Detectors Are Shaping the Future of Online Trust

Jaguars vs. 49ers Livestream: How to Watch NFL Week 4 Online Today

Tool Masking: The Layer MCP Forgot

What MCP does effectively, and what it doesn’t

The Drawback in Observe

The Repair

The facility of instrument masking

The Resolution

Device masks instance

The Missed Immediate Engineering Floor

Design Patterns

Conclusion

Related Posts