or so, it has been unimaginable to disclaim that there was a rise within the hype stage in the direction of AI, particularly with the rise of generative AI and agentic AI. As an information scientist working in a consulting agency, I’ve famous a substantial development within the variety of enquiries concerning how we will leverage these new applied sciences to make processes extra environment friendly or automated. And whereas this curiosity may flatter us knowledge scientists, it generally looks as if individuals count on magic from AI fashions, as if they might clear up each downside with nothing greater than a immediate. However, whereas I personally consider generative and agentic AI has modified (and can proceed to vary) how we work and stay, once we conduct business-process modifications, we should take into account its limitations and challenges and see the place it proves to be a superb software (as we wouldn’t use a fork, for instance, to chop meals).
As I’m a nerd and perceive how LLMs work, I needed to check their efficiency in a logic sport just like the Spanish model of Wordle in opposition to a logic I had in-built a few hours some years in the past (extra particulars on that may be discovered here). Particularly, I had the next questions:
- Will my algorithm be higher than LLM fashions?
- How will reasoning capabilities in LLM fashions have an effect on their efficiency?
Constructing an LLM-based resolution
To get an answer by the LLM mannequin, I constructed three major prompts. The primary one was focused to get an preliminary guess:
Let’s suppose I’m enjoying WORDLE, however in Spanish. It’s a sport the place you must guess a 5-letter phrase, and solely 5 letters, in 6 makes an attempt. Additionally, a letter might be repeated within the ultimate phrase.
First, let’s evaluate the foundations of the sport: Each day the sport chooses a five-letter phrase that gamers attempt to guess inside six makes an attempt. After the participant enters the phrase they assume it’s, every letter is marked in inexperienced, yellow, or grey: inexperienced means the letter is right and within the right place; yellow means the letter is within the hidden phrase however not within the right place; whereas grey means the letter shouldn’t be within the hidden phrase.
However should you place a letter twice and one exhibits up inexperienced and the opposite yellow, it means the letter seems twice: as soon as within the inexperienced place, and as soon as in one other place that’s not the yellow one.
Instance: If the hidden phrase is “PIZZA”, and your first try is “PANEL”, the response would appear like this: the “P” could be inexperienced, the “A” yellow, and the “N”, “E”, and “L” grey.
Since for now we don’t know something in regards to the goal phrase, give me a superb beginning phrase—one that you just assume will present helpful data to assist us determine the ultimate phrase.
Then, a second immediate could be used to indicate all of the phrase guidelines (the immediate right here shouldn’t be proven in full because of house, however the full model additionally had instance video games and instance reasonings):
Now, the concept is that we evaluate the sport technique. I’ll be supplying you with the sport outcomes. The thought is that, given this end result, you counsel a brand new 5-letter phrase. Keep in mind additionally that there are solely 6 complete makes an attempt. I’ll provide the end result within the following format:
LETTER -> COLORFor instance, if the hidden phrase is PIZZA, and the try is PANEL, I’ll give the end result on this format:
P -> GREEN (it’s the primary letter of the ultimate phrase)
A -> YELLOW (it’s within the phrase, however not within the second place—as a substitute it’s within the final one)
N -> GRAY (it’s not within the phrase)
E -> GRAY (it’s not within the phrase)
L -> GRAY (it’s not within the phrase)Let’s keep in mind the foundations. If a letter is inexperienced, it means it’s within the place the place it was positioned. If it’s yellow, it means the letter is within the phrase, however not in that place. If it’s grey, it means it’s not within the phrase.
If you happen to place a letter twice and one exhibits inexperienced and the opposite grey, it means the letter solely seems as soon as within the phrase. However should you place a letter twice and one exhibits inexperienced and the opposite yellow, it means the letter seems twice: as soon as within the inexperienced place, and one other time in a special place (not the yellow one).
All the knowledge I offer you have to be used to construct your suggestion. On the finish of the day, we wish to “flip” all of the letters inexperienced, since which means we guessed the phrase.
Your ultimate reply should solely comprise the phrase suggestion—not your reasoning.
The ultimate immediate was used to get a brand new suggestion after having the results of our try:
Right here’s the end result. Keep in mind that the phrase will need to have 5 letters, that you need to use the foundations and all of the information of the sport, and that the aim is to “flip” all of the letters inexperienced, with not more than 6 makes an attempt to guess the phrase. Take your time to assume by way of your reply—I don’t want a fast response. Don’t give me your reasoning, solely your ultimate end result.
One thing necessary right here is that I by no means tried to information the LLMs or identified errors or errors within the logic. I needed a pure LLM-based end result and didn’t wish to bias the answer in any form or kind.
Preliminary experiments
The reality is that my preliminary speculation was that whereas I anticipated my algorithm to be higher than the LLMs, I assumed the Generative AI-based resolution was going to do a fairly good job with out a lot assist, however after some days, I observed some “humorous” behaviors, just like the one beneath (the place the reply was apparent):
The reply was fairly apparent: it solely needed to change two letters. Nonetheless, ChatGPT answered with the identical guess as earlier than.
After seeing these sorts of errors, I began to ask about this on the finish of video games, and the LLMs mainly acknowledged their errors, however didn’t present a transparent rationalization on their reply:

Whereas these are simply two examples, this type of conduct was standard when producing the pure LLM resolution, showcasing some potential limitations within the reasoning of base fashions.
Outcomes Evaluation
With all this data into account, I ran an experiment for 30 days. For 15 days I in contrast my algorithm in opposition to 3 base LLM fashions:
- ChatGPT’s 4o/5 mannequin (After OpenAI launched GPT-5 mannequin, I couldn’t toggle between fashions on the free-tier model of ChatGPT)
- Gemini’s 2.5-Flash mannequin
- Meta’s Llama 4 mannequin
Right here, I in contrast two major metrics: the share of wins and a factors system metrics (any inexperienced letter within the ultimate guess awarded 3 factors, yellow letters awarded 1 level, and grey letters awarded 0 factors):

As might be seen, my algorithm (whereas particular to this use case, it solely took me a day or so to construct) is the one method that wins day by day. Analyzing the LLM fashions, Gemini supplies the more serious efficiency, whereas ChatGPT and Meta’s Llama present comparable numbers. Nonetheless, as might be seen on the determine on the appropriate, there may be nice variability within the efficiency of every mannequin and consistency is one thing that’s not proven by these alternate options for this specific use case.
Nonetheless, these outcomes wouldn’t be full if we didn’t analyze a reasoning LLM mannequin in opposition to my algorithm (and in opposition to a base LLM mannequin). So, for the next 15 days I additionally in contrast the next fashions:
- ChatGPT’s 4o/5 mannequin utilizing reasoning functionality
- Gemini’s 2.5-Flash mannequin (identical mannequin as earlier than)
- Meta’s Llama 4 mannequin (identical mannequin as earlier than)
Some necessary feedback right here: initially, I deliberate to make use of Grok as effectively, however after Grok 4 was launched, the reasoning toggle for Grok 3 disappeared, which made comparisons troublesome; then again, I attempted to make use of Gemini’s 2.5-Professional, however in distinction with ChatGPT’s reasoning possibility, the usage of this isn’t a toggle, however a special mannequin which solely allowed me to ship 5 prompts per day, which didn’t enable us to finish a full sport. With this in thoughts, we present the outcomes for the next 15 days:

The reasoning functionality behind LLMs supplies an enormous increase to efficiency on this activity, which requires understanding which letter can be utilized in every place, which of them have been evaluated, remembering all outcomes and understanding all mixtures. Not solely are the typical outcomes higher, but in addition efficiency is extra constant, as within the two video games that weren’t gained, just one letter was missed. Regardless of this enchancment, the precise algorithm I constructed remains to be barely higher when it comes to efficiency, however as I discussed earlier, this was completed for this particular activity. One thing attention-grabbing is that for these 15 video games, the bottom LLM fashions (Gemini 2.5 Flash and Llama 4) didn’t win as soon as, and the efficiency was worse than the opposite set, which makes me marvel if the wins that have been achieved earlier than have been fortunate or not.
Ultimate Remarks
The intention of this train has been to attempt to check the efficiency of LLMs in opposition to a particularly constructed algorithm for a activity that requires making use of logic guidelines to generate a profitable end result. We’ve seen that base fashions don’t have good efficiency, however that reasoning capabilities of LLM options present an necessary increase, producing comparable efficiency to the outcomes of the tailor-made algorithm I had constructed. One necessary factor to consider is that whereas this enchancment is actual, with real-world purposes and manufacturing methods we additionally must think about response time (reasoning LLM fashions take extra time to generate a solution than base fashions or, on this case, the logic I constructed) and price (based on the Azure OpenAI pricing page, as of the 30th of August of 2025, the value of 1M enter tokens for the overall objective GPT-4o-mini common objective mannequin is round $0.15, whereas for the o4-mini reasoning mannequin, the price of 1M enter tokens is $1.10). Whereas I firmly consider that LLMs and generative AI will proceed to evolve the best way we work, we will’t deal with them as a Swiss knife that solves every thing, with out contemplating its limitations and with out evaluating easy-to-build tailor-made options.

