The LLM Gamble | Towards Data Science

for an LLM, and you’ve got a query in thoughts, there’s an simple sense of chance. You’ll be able to’t be fairly certain what the response will likely be, however there’s an honest likelihood that it will impress you with its confidence and specificity to your request, and that it’ll clear up your drawback in seconds. When it does, the sensation will be fairly pleasant!

Nevertheless, typically it fails — whether or not typically function data or in particular instances like coding. As TikTok account Alberta Tech illustrates, typically the AI makes up its personal imaginary features and strategies, constructing you one thing that couldn’t probably run. However, typically, it provides you one thing that works! Rather a lot about this appears like a slot machine, doesn’t it?

You don’t know what’s going to occur once you push the button, however you’re hoping for a pleasant end result, and each time you have got a brand new likelihood for that dopamine hit. Nondeterminism makes each reply a bit totally different, and never realizing what you’ll get can frankly be thrilling! It’s like your social media feed, too — what’s arising? It is likely to be an advert, or it is likely to be your favourite creator.

I’m clearly nowhere close to the primary individual to note this aspect of the expertise of utilizing generative AI. In fall 2025, Cory Doctorow made the purpose that we bear in mind the instances when gen AI labored properly way over we bear in mind the instances it failed and we needed to push the button once more, identical to gamblers. Wesam Mikhail posted on LinkedIn about how the “wins” are deceptive as a result of the code that works can also be introducing bugs and tech debt underneath the hood. However we really feel the push of “oh, wow, look, it did it!” even so. Paul Weimer, Fang-Pen Lin, and lots of others have written about this identical phenomenon in simply the final a number of months.

One of many issues that a number of of them additionally gestured at is the monetary implications, and that’s an enormous a part of what pursuits me in regards to the metaphor.

The Chips

We pay for generative AI in items referred to as tokens. These are phrases or components of phrases, normally, forming items of measure for the inputs and outputs from LLMs. In a literal sense, the variety of tokens is a measure of how a lot energy is used through the inference course of. By paying for tokens, we’re paying for all of the assets and overhead concerned in an inference activity. That’s why we find yourself paying for each the quantity of textual content we move in to the LLM, within the type of prompts, and in addition the quantity of textual content the LLM returns to us in its responses.

Prices of utilizing LLMs, due to this fact, are offered in {dollars} per tokens, equivalent to $5 per million enter tokens and $25 per million output tokens, that are Anthropic’s current API rates for Opus 4.6. There are additionally detailed costs for cache hits and repetitions, however that is the fundamental price. For OpenAI, the costs are decrease however measured the identical approach: for GPT 5.4, it’s $2.50 per 1 million enter tokens and $15 per 1 million output tokens. Older and fewer subtle fashions usually run cheaper.

So, for those who submit 1 million enter tokens to Opus 4.6, that can price you $5, and if the outputs from Opus run to a size of 1 million tokens, that can price you $25, making your complete price $30. 1 million tokens appears like quite a bit, and it’s (1.5 million tokens is roughly the size of the Harry Potter ebook collection) however over time accumulating utilization for those who make the LLM part of your common work can pile up quick.

You will have already seen the primary level I wish to make — you’ll be able to ostensibly management what number of tokens you submit, and thus management your prices, however that management is restricted. You can also make your prompts temporary, restrict extraneous directions, and preserve down your prices for enter in consequence. Nevertheless, when agentic instruments become involved, and the LLM is establishing prompts to move to different LLMs, you’re not accountable for the size of the prompts. Much more considerably, you have got solely probably the most minimal management over the variety of tokens that any mannequin responds with (equivalent to by asking it to “be concise”). For probably the most half, the variety of output tokens is part of that nondeterministic unknown I described earlier than. And, you’ll be aware, an output token prices 5x the value of an enter token.

So, returning to our slot machine metaphor, you place in 1 / 4 to the machine, and that’s paying on your pull. However then you definately get a response out, and also you ALSO should pay for that, though you haven’t any warning upfront of how a lot it is going to price. In case you didn’t win on that pull, and the LLM made up its personal coding language and nothing runs? You continue to should pay for that output, and the price is just predicated on how lengthy the response was, with no regard for the way helpful it was. The size might be any dimension, particularly in agentic AI, and you haven’t any technique to predict it.

Oh well, you may think, this is the price of the product, and anyway, the next pull will surely be better, right? So you pay for that output that didn’t work, and then you put another quarter in the machine, and pull, and hope for better.

Subscriptions

Regular users of generative AI may be remarking, “oh, but you can just get a subscription and just pay a flat rate!” This is true, and this is instrumental for the success of adoption of these tools to date. This abstracts away the token level cost, letting you use the LLM for a flat rate up to the usage limit. A Claude subscription for an individual user starts at $20 a month, and this is the tier that gives you Claude Code, Cowork, research tooling, and extensions to go into other software such as Excel.

However, it’s not as transparent as it might seem. None of these plans, on any provider, permit unlimited use, and the details of the limits are deeply obscured in documentation — “Your usage is affected by several factors, including the length and complexity of your conversations, the features you use, and which Claude model you’re chatting with.” This means that you can’t actually plan ahead for how much of your usage budget you’ll consume in any singular situation. At best, you have a cap on the cost you’ll encounter in any given month, so no surprise bills will show up, but you have no real idea when your usage for the month will abruptly cut off.

Put another way, if your usage budget is based on features, what model you’re using, and the other things they describe, that means your token usage is not a flat limiter. Usage limits aren’t finely tuned to the token numbers. What this means is that many users with subscriptions may in fact use more than $20 worth of the services each month. This is even more true for Max plans, which cost between $100 and $200 a month and offer even more usage, but again, the limits on usage are obscured from users’ view. Deciphering what the limits really are, and what makes your usage take up more of your limit, is an issue that users frequently discuss, for example in Reddit communities or on other social media.

Conclusion

What does this imply, general? For one factor, the fabric price of operating generative AI inference is fairly excessive. For corporations like Anthropic and OpenAI to generate vital income, a lot much less make a revenue and meet the expectations of buyers, analysts usually agree that the costs I’ve laid out above are under price. This is the reason, for instance, Anthropic has compelled customers of OpenClaw to make use of per-token-use pricing, not subscriptions — individuals are utilizing extra of their limits and turning the subscriptions into loss leaders.

Nevertheless, pay-per-usage is extraordinarily troublesome to promote to most customers, as a result of that reveals the actual fact I outlined firstly, which is that it’s important to pay for the pull of the slot machine and for the output, even for those who don’t win. We anticipate worth for cash, in conditions like this, so the enterprise mannequin of playing doesn’t actually make sense within the context of software program. After we’re used to ROI and high quality assurance, a enterprise mannequin the place you might be required to pay for the product even when it doesn’t work requires a big paradigm shift.

There’s no selection, although, for the generative AI suppliers — when the mannequin does inference and returns tokens it prices them cash, whether or not the response is any good or not. That is core to the query of how this know-how turns from a novelty or a bubble right into a sustainable enterprise. Will individuals settle for paying for every gamble, after they can’t predict how a lot it is going to price them (as a result of the variety of output tokens is nondeterministic) and might’t predict whether or not it is going to really work for his or her wants? I’ve to doubt it, for the final inhabitants, and meaning a ticking time bomb for the business.

Learn extra of my work at www.stephaniekirmer.com. Additionally, you’ll be able to see me communicate dwell in individual at ODSC East on April 30 in Boston.