Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Ensembles of Ensembles of Ensembles: A Guide to Stacking
    • This region in space poses the greatest danger in our Solar System
    • Practical info and special tips for the EU-Startups Summit 2026 in Malta – look inside!
    • Your Phone Notifications Reveal More Than You Realize. Here’s How to Lock Them Down
    • Why a recent supply-chain attack singled out security firms Checkmarx and Bitwarden
    • iPad Pro M5 Review: Closer Than Ever to the Future Mac
    • How AI Policy in South Africa Is Ruining Itself
    • Dual iris laser projector offers theater blacks
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Wednesday, April 29
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»When LLMs Try to Reason: Experiments in Text and Vision-Based Abstraction
    Artificial Intelligence

    When LLMs Try to Reason: Experiments in Text and Vision-Based Abstraction

    Editor Times FeaturedBy Editor Times FeaturedJuly 27, 2025No Comments25 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    fashions study to cause abstractly from only a few examples? On this piece, I discover this query by testing each text-based (o3-mini) and image-capable (gpt-4.1) fashions on summary grid transformation duties. These experiments reveal the extent to which present fashions depend on sample matching, procedural heuristics, and symbolic shortcuts relatively than sturdy generalization. Even with multimodal inputs, reasoning typically breaks down within the face of refined abstraction. The outcomes supply a window into the present capabilities and limitations of in-context meta-learning with LLMs.

    Introduction

    Meta-learning, the flexibility of a system to learn to study, has historically been explored by way of gradient-based optimization, memory-augmented networks, or express process embeddings. However with the rise of enormous language fashions (LLMs), significantly the o3 household with superior reasoning capabilities, a brand new query emerges: can we use LLMs themselves as meta-learners in task-based domains like ARC? The Abstraction and Reasoning Corpus (ARC), launched by François Chollet, is a benchmark explicitly designed to check broad generalization. It offers input-output transformation puzzles with minimal supervision, few examples per process, and sometimes no shared surface-level construction throughout duties. In different phrases: a playground for meta-learning. To get an understanding of typical abstraction and reasoning duties, the reader can go to the ARC play page.

    Instance game from the ARC website. From the demonstration grids, it’s clear that the duty for the take a look at grid is to show black areas into yellow wherever they’re fully enclosed by inexperienced boundaries.

    Knowledge and Setup

    To discover whether or not LLMs like o3-mini can carry out meta-learning on summary reasoning duties, I used knowledge from the ARC Prize 2025 Kaggle competition. The dataset repository might be discovered here (Apache 2.0 license). The dataset consists of input-output grid transformations that problem fashions to deduce summary guidelines from only a few examples.

    Every process offers:

    • Just a few coaching examples (enter and output 2D grids)
    • A single take a look at enter grid for which the mannequin should predict the corresponding output

    A second dataset offers the answer grids for every of the take a look at enter grids. Right here’s a simplified instance of the information format:

    # coaching examples - dictionary of dictionaries. 
    # Right here is an extracted process
    {'prepare': [{'input': [[6, 6, 0], [6, 0, 0], [0, 6, 6]],
       'output': [[6, 6, 0, 6, 6, 0, 0, 0, 0],
        [6, 0, 0, 6, 0, 0, 0, 0, 0],
        [0, 6, 6, 0, 6, 6, 0, 0, 0],
        [6, 6, 0, 0, 0, 0, 0, 0, 0],
        [6, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 6, 6, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 6, 6, 0, 6, 6, 0],
        [0, 0, 0, 6, 0, 0, 6, 0, 0],
        [0, 0, 0, 0, 6, 6, 0, 6, 6]]},
      {'enter': [[4, 0, 4], [0, 0, 0], [0, 4, 0]],
       'output': [[4, 0, 4, 0, 0, 0, 4, 0, 4],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 4, 0, 0, 0, 0, 0, 4, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 4, 0, 4, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 4, 0, 0, 0, 0]]},...,
       'take a look at': [{'input': [[7, 0, 7], [7, 0, 7], [7, 7, 0]]}]
    }
    
    # instance of resolution to check enter grid - dictionary of lists
    # Right here is the extracted resolution for the one take a look at enter grid above
    [[[3, 2, 3, 2, 3, 2],
      [7, 8, 7, 8, 7, 8],
      [2, 3, 2, 3, 2, 3],
      [8, 7, 8, 7, 8, 7],
      [3, 2, 3, 2, 3, 2],
      [7, 8, 7, 8, 7, 8]]]

    Every grid is a 2D array of integers from 0–9, representing coloured pixels. Grids have various sizes and a grid transformation may additionally carry a measurement change from the enter to the output grid. To visualise the arrays, I used a customized colormap with matplotlib:

    from matplotlib import colours
    cmap = colours.ListedColormap([
    
        '#8B00FF',  # Violet
        '#4B0082',  # Indigo
        '#0000FF',  # Blue
        '#FFFF00',  # Yellow
        '#00FF00',  # Green
        '#FF7F00',  # Orange
        '#FF0000',  # Red
        '#964B00',  # Golden
        '#000000',  # Black
        '#FFFFFF',  # White
    ])
    norm = colours.Normalize(vmin=0, vmax=9)
    
    # Perform to visualise an array
    def visualize_matrix(matrix, title='', cmap=cmap, norm=norm):
        plt.imshow(matrix, cmap=cmap, norm=norm)
        plt.title(title)
        plt.axis('off')  # Take away axes
        plt.present()

    For mannequin interplay, I used OpenAI’s o3-mini mannequin by way of LangChain. Afterward, we will even use gpt-4.1:

    from langchain_openai import ChatOpenAI
    import getpass
    import os
    
    # Immediate for a secret enter
    openai_key = getpass.getpass("Enter your OpenAI API key: ")
    
    os.environ["OPENAI_API_KEY"] = openai_key
    
    AGENT_MODEL = "o3-mini"  # reasoning mannequin, https://platform.openai.com/docs/fashions
    AGENT_LLM = ChatOpenAI(mannequin=AGENT_MODEL) 
    # AGENT_LLM = ChatOpenAI(mannequin=AGENT_MODEL, reasoning_effort='low')

    To deal with LLM responses, particularly when the mannequin returns a predicted output grid as Python code inside triple backticks, I wrote a utility:

    import re, ast
    
    def extract_python_code(response_string):
        match = re.search(r"```pythons*(.*?)```", response_string, re.DOTALL)
        if match:
            return ast.literal_eval(match.group(1).strip())
        return None

    This setup allowed me to construction a full reasoning loop: immediate the mannequin with few-shot examples, extract and apply a generated algorithm, assess its efficiency on new take a look at inputs and at last use the evaluation to enhance the algorithm.

    Testing Reasoning with o3-mini

    To guage whether or not LLMs can “meta-learn” on summary reasoning duties, I examined the o3-mini mannequin utilizing a closed-loop reasoning setup impressed by how people would possibly strategy few-shot duties. For every ARC problem, I supplied the mannequin with a handful of demonstration input-output grid pairs and requested it to derive a single reusable algorithm.

    I outlined a sequence of prompts utilizing LangChain’s ChatPromptTemplate to simulate reasoning, software, evaluation, and refinement. The method mimics an internal coaching loop with restricted supervision:

    • PROMPT_REASON: The mannequin is given coaching examples and requested to deduce a common algorithm in pseudocode.
    • PROMPT_SOLVE: The generated algorithm is utilized to new inputs (each coaching and take a look at).
    • PROMPT_ASSESS: When the algorithm fails, the mannequin receives suggestions evaluating its predicted vs. anticipated outputs.
    • PROMPT_SUMMARIZE_FEEDBACK: The mannequin summarizes cumulative suggestions from failed makes an attempt to iteratively refine its strategy.
    from langchain_core.prompts import ChatPromptTemplate
    
    PROMPT_REASON = ChatPromptTemplate.from_messages(
        [
            (
                "system", 
                "You are an expert in solving abstract reasoning tasks. "
                "You will be given several demonstration input-output pairs of 2D arrays. "
                "Your goal is to develop a single algorithm that maps each input array to its corresponding output array.nn"
                
                "Each input and output is a 2-dimensional array of integers between 0 and 9. "
                "Solving the task involves:n"
                "- Analyzing the demonstration pairsn"
                "- Identifying abstract patterns or transformationsn"
                "- Formulating a general rule or algorithm that works across all examplesn"
                "- Producing pseudocode that implements the rulenn"
                
                "If prior attempts were made, you will also receive feedback summarizing what went wrong. "
                "Carefully use this feedback to improve your solution.nn"
                
                "Return only the updated algorithm as pseudocode. Do not describe or explain it.nn"
                "### Feedback (summary of previous attempts):n{attempt_history}nn"
                "### Demonstration Pairs:n{train_pairs}n"
            ),
            (
                "ai", 
                "Answer:"
            )
        ]
    )
    
    PROMPT_SOLVE = ChatPromptTemplate.from_messages(
        [
            (
                "system", 
                "You are an expert in abstract reasoning. "
                "Previously, you analyzed demonstration input-output pairs and developed an algorithm "
                "to transform input arrays into output arrays.nn"
                
                "Now, use that algorithm to generate an output array for a new, unseen input array.nn"
                
                "Only return the output array, formatted as valid Python code within a code block. "
                "For example:n```pythonn[[2, 3], [5, 6]]n```n"
                
                "### Developed algorithm:n{reasoning_template}nn"
                "### New enter array:n{test_input}n"
            ),
            (
                "ai",
                "Reply:"
            )
        ]
    )
    
    PROMPT_ASSESS = ChatPromptTemplate.from_messages(
        [
            (
                "system", 
                "You are an expert in abstract reasoning. "
                "A solution array was generated by applying the algorithm to the input array. "
                "Compare the generated solution to the actual target output. "
                "Analyze why the two arrays differ, and provide **clear and concise feedback** on how to improve the algorithm.nn"
                
                "Only return your feedback-do not repeat the arrays or algorithm.nn"
                
                "### Algorithm:n{reasoning_template}nn"
                "### Input array:n{test_input}nn"
                "### Solution array (generated by algorithm):n{solved_test_output}nn"
                "### Target output array:n{test_output}n"
            ),
            (
                "ai",
                "Answer:"
            )
        ]
    )
    
    PROMPT_SUMMARIZE_FEEDBACK = ChatPromptTemplate.from_messages(
        [
            (
                "system", 
                "You are an expert in summarizing feedback on algorithm development. "
                "You will be given a history of past attempts, each containing an algorithm and feedback about its performance.nn"
                
                "Your goal is to produce a **concise summary** of the most important lessons learned-"
                "focusing on how the algorithm should be improved and what mistakes should be avoided in future versions.nn"
                
                "Return only the feedback summary. Do not repeat the original attempts or feedback.nn"
                
                "### Attempt History:n{attempt_history}n"
            ),
            (
                "ai",
                "Answer:"
            )
        ]
    )

    These prompts are linked right into a easy LangChain pipeline:

    reasoning_chain = PROMPT_REASON | AGENT_LLM
    solve_chain = PROMPT_SOLVE | AGENT_LLM 
    assess_chain = PROMPT_ASSESS | AGENT_LLM 
    summarize_feedback_chain = PROMPT_SUMMARIZE_FEEDBACK | AGENT_LLM

    For every ARC problem:

    • The mannequin receives the demonstration pairs and prior suggestions;
    • The mannequin generates a brand new algorithm in pseudocode (reasoning_template);
    • The algorithm is examined on all of the demonstrations;
    • If it fails, the mannequin: receives detailed suggestions on mismatched predictions; summarizes errors throughout makes an attempt; refines the subsequent model of the algorithm;
    • As soon as the mannequin will get all demonstrations right, I take a look at it on the unseen take a look at enter.

    This course of repeats for as much as a max variety of makes an attempt per problem. A profitable algorithm generalizes throughout the supplied examples and applies accurately to the withheld take a look at case. This setup assessments whether or not the mannequin can extract summary patterns, enhance its reasoning over time, and generalize from only a few examples.

    reasoning_templates = {}
    
    for i, id in enumerate(id_train_challenges):
        print(f"Coaching on problem {i} ID: {id}")
        train_pairs = train_challenges[id]['train']
        test_input = train_challenges[id]['test'][0]['input'] # solely decide the primary take a look at enter 
        test_output = train_sols[id][0] # solely decide the primary take a look at output
        train_pairs_str = ''
        for i, train_pair in enumerate(train_pairs):
            train_pairs_str += f"Demonstration pair {i+1}:n enter grid: {train_pair['input']} n output grid: {train_pair['output']}n"
        train_pairs_str = train_pairs_str.strip()
    
        # preserve making an attempt till you determine easy methods to remedy the problem
        right_wrong = "incorrect"
        # Begin with an empty reasoning template, which will probably be refined over time
        reasoning_template = '' 
        okay = 1
        max_attempts = 5
        attempt_history = []
        attempt_history_summary = ''
        whereas right_wrong == "incorrect":
            print(f"Try {okay} to unravel the problem...")
    
            # Construct the reasoning message with the present reasoning template and try historical past
            # This message will probably be used to generate a brand new reasoning template
            reason_message = {
                "train_pairs": train_pairs_str,
                "attempt_history": attempt_history_summary,  
            }
            res = reasoning_chain.invoke(reason_message)
            reasoning_template = res.content material
    
            # Assess reasoning template
            wrong_pairs = []
            for train_pair in train_pairs:
                demo_input = train_pair['input']
                demo_output = train_pair['output']
                # Check the reasoning template on the demonstration pair
                test_message = {
                    "test_input": demo_input,
                    "reasoning_template": reasoning_template,
                }
                res = solve_chain.invoke(test_message)
                solved_demo_output = extract_python_code(res.content material)            
                # Evaluate the output with the demonstration output
                if solved_demo_output != demo_output:
                    wrong_pairs.append((demo_input, demo_output, solved_demo_output))
    
            if len(wrong_pairs) > 0:
                right_wrong = 'incorrect'
                print(f"Reasoning template failed on {len(wrong_pairs)} demonstration pairs.")
    
                if okay >= max_attempts:
                    print(f"Max makes an attempt reached ({max_attempts}). Stopping for problem {id}.")
                    reasoning_templates[id] = ''
                    break
    
                print("Assessing the reasoning template...")
                assessment_res = f'Algorithm failed on {len(wrong_pairs)} demonstration pairs. Right here is the suggestions:n'
                for demo_input, demo_output, solved_demo_output in wrong_pairs:
                    assess_chain_message = {
                        "reasoning_template": reasoning_template,
                        "test_input": demo_input,
                        "solved_test_output": solved_demo_output,
                        "test_output": demo_output,
                    }
                    res = assess_chain.invoke(assess_chain_message)
                    assessment_res += f" - From enter {demo_input} to output {demo_output}, your resolution was {solved_demo_output}: {res.content material.strip()}n"
    
                attempt_history.append({
                    "try": okay,
                    "reasoning_template": reasoning_template,
                    "suggestions": assessment_res
                })
    
                summary_message = {
                    "attempt_history": attempt_history,
                }
                summary_res = summarize_feedback_chain.invoke(summary_message)
                attempt_history_summary = summary_res.content material.strip()
            else:
                print("Answer is right!")
                right_wrong = "right"
                reasoning_templates[id] = reasoning_template
    
                # take a look at it in opposition to the take a look at enter/ output .... however don't give suggestions (that is imagined to be unknown)
                test_message = {
                    "test_input": test_input,
                    "reasoning_template": reasoning_template,
                }
                res = solve_chain.invoke(test_message)
                solved_test_output = extract_python_code(res.content material)
                if test_output != solved_test_output:
                    print(f"Check output doesn't match the true output for problem {id}.")
                else:
                    print(f"Check output matches the true output for problem {id}.")
                    #visualize_matrix(test_input, "Enter grid")
                    #visualize_matrix(test_output, "True output")
                    #visualize_matrix(solved_test_output, "Check Output")
    
                print("-" * 40)  # Separator between entries
    
            okay += 1

    Outcomes: When Reasoning Works

    In some instances, o3-mini was in a position to accurately infer a generalizable algorithm from only a few input-output demonstrations. One such instance concerned producing a patterned tiling based mostly on a small 2×2 enter grid.

    After only one try, the mannequin converged on the next pseudocode:

    BEGIN  
      Let enter be a 2x2 grid, the place:
        enter[0] = [a, b]
        enter[1] = [c, d]
      
      Initialize output as an empty checklist.
    
      FOR every row index r from 0 to five DO:
        Let original_row ← enter[r mod 2]
        
        IF (FLOOR(r / 2)) mod 2 = 1 THEN
          Let base_row ← REVERSE(original_row)
        ELSE
          Let base_row ← original_row
        ENDIF
    
        Initialize new_row as an empty checklist.
        FOR repeat from 1 to three DO:
          Append all components of base_row to new_row.
        ENDFOR
    
        Append new_row to output.
      ENDFOR
    
      RETURN output
    END

    Right here is the anticipated resolution (true output) and the answer from the mannequin pseudocode (take a look at output).

    True output grid (Picture by writer).
    Check output grid from the pseudocode (Picture by writer).

    This algorithm demonstrates a number of notable reasoning capabilities:

    • Sample abstraction: the mannequin inferred a repeating tiling sample from restricted knowledge;
    • Modulo logic: it launched modulo-based indexing (r mod 2 and (r // 2) mod 2) to alternate row conduct, mimicking visible mirroring;
    • Grid development: the answer scales up the two×2 enter into a bigger 6×6 grid by way of repeated copying and reversal;

    The truth that the mannequin found this construction with out hardcoded guidelines suggests it was partaking in a type of algorithm synthesis, guided by few-shot abstraction.

    Right here is one other profitable instance. 

    Instance grids and take a look at enter grid (picture by writer).

    Once more, after only one try, the mannequin converged on the next pseudocode:

    BEGIN
      Let N = 3
      Create output as a 2D array of measurement (N×N) × (N×N), full of 0
    
      FOR every row r from 0 to N–1:
        FOR every column c from 0 to N–1:
          IF enter[r][c] ≠ 0 THEN
            FOR every i from 0 to N–1:
              FOR every j from 0 to N–1:
                Set output[(r * N) + i][(c * N) + j] = enter[i][j]
      RETURN output
    END

    Right here is the anticipated resolution and the answer from the mannequin pseudocode.

    True output grid (picture by writer).
    Check output grid from the pseudocode (picture by writer).

    This algorithm successfully tiles the complete enter grid into the output grid at each place the place the enter cell is non-zero. The tiling is aligned such that every copy of the unique grid is positioned at an offset decided by (r * N, c * N) —  the scaled-up coordinates of the non-zero enter cell.

    What’s spectacular right here is that the mannequin:

    • Learns conditional placement. It solely pastes the enter the place the enter worth is non-zero;
    • Makes use of coordinate arithmetic to scale placement places, exhibiting spatial understanding;
    • Treats the enter as each management logic and content material, combining format detection with repetition.

    When Textual content-Primarily based Reasoning Isn’t Sufficient

    Within the process under, the mannequin was given a small variety of demonstration input-output grid pairs and requested to deduce the transformation rule. 

    Instance grids and take a look at enter grid (picture by writer).

    The text-based mannequin (o3-mini) produced an in depth pseudocode resolution, structured, believable, and internally constant:

    BEGIN
      Let authentic ← enter grid  
      Let output ← deep copy of authentic
      Let R ← variety of rows in authentic  
      Let C ← variety of columns in authentic
    
      // Compute ring index for each cell that's a part of a non-zero area.
      // A cell's ring index is outlined as:
      //   - 0 if the cell is on the boundary of the grid OR if no less than one in all its 4-neighbors is 0
      //   - In any other case, 1 + min(ring index of its 4-neighbors)
    
      Create grid ring of measurement R × C, full of −1
    
      For every cell (r, c) in authentic:
        If authentic[r][c] ≠ 0 then
          If r = 0 OR c = 0 OR r = R−1 OR c = C−1 then
            ring[r][c] ← 0
          Else if any 4-neighbor (up, down, left, proper) of (r, c) has worth 0 then
            ring[r][c] ← 0
    
      modified ← true
      Whereas modified is true:
        Set modified ← false
        For every cell (r, c) in authentic such that authentic[r][c] ≠ 0 and (ring[r][c] < 0 or not finalized):
          Let candidates ← checklist of ring indices from any 4-neighbor (r′, c′) with a non-negative ring worth
          If candidates is non-empty then
            newRing ← min(candidates) + 1
            If ring[r][c] < 0 OR newRing < ring[r][c] then
              ring[r][c] ← newRing
              modified ← true
    
      // Course of propagation from outer ring inward.
      // For every cell with ring index ≥ 1, replace its worth based mostly on a neighbor with ring index one much less.
      // Choice is deterministic: "up, left, down, proper"
    
      Let maxRing ← most worth in ring
    
      For layer from 1 to maxRing:
        For every cell (r, c) such that ring[r][c] = layer:
          For route in [(−1,0), (0,−1), (1,0), (0,1)]:
            nr ← r + route.row
            nc ← c + route.col
            If (nr, nc) is in bounds AND ring[nr][nc] = layer − 1 then
              Let base ← output[nr][nc]
              If base = 2 then
                newValue ← 8
              Else if base mod 2 = 0 then
                newValue ← base + 2
              Else // base is odd
                newValue ← base + 1
              output[r][c] ← newValue
              Get away of neighbor loop
    
    Return output
    END

    At first look, this appears to be like like a complicated technique: compute a hoop index for every non-zero area, then diffuse values inward utilizing deterministic guidelines. It even consists of replace logic:

    • If the “base” cell is 2 → assign 8
    • If base is even → add 2
    • If base is odd → add 1

    However this complete algorithm, nonetheless coherent, is misguided. It fully fails to match the true underlying transformation proven within the demonstrations.

    Anticipated take a look at output grid (picture by writer).
    Check output grid from the pseudocode (picture by writer).

    As an alternative of reasoning about blue-bordered areas and their nested construction, the mannequin generated a generic flood-fill algorithm based mostly on distance from edge and adjacency to zeros. Even the suggestions retains refining the procedural strategy recognized earlier:

    ['Key lessons are to: • Precisely compute the ring index so that only',
     'true boundary (or external zero) cells get index 0, ensuring that',
     'inner cells receive higher indices for proper propagation. • Use a',
     'reliable, consistent method for selecting the "base" value for',
     'updates-ideally by considering all adjacent lower-ring cells or using',
     'a deterministic order-and use an immutable copy of the original grid',
     'for these lookups. • Apply the parity‐based update rules correctly so',
     'that cells with ring index ≥ 1 get the specified value increments',
     '(especially the special case when the base is 2) rather than remaining',
     'unchanged. • Ensure that the update logic cascades inward, allowing',
     'inner cells to correctly inherit and build upon values from outer',
     'rings.']

    So what went mistaken?

    • Topological, not visible. The mannequin targeted on connectivity and edge proximity, ignoring the visually outlined areas.
    • Procedural, not inferential. The logic was inflexible and hand-crafted, not derived from patterns within the examples.
    • Demonstration-agnostic. There’s no signal the mannequin meaningfully included the few-shot examples. It probably defaulted to a well-recognized sample — spatial development utilizing layers.

    This isn’t stunning. Textual content-only LLMs don’t have any visible grounding. They tokenize the grid as symbolic enter — rows of digits, not enclosed figures or nested patterns. In consequence, their inductive biases lean towards symbolic or graph-like algorithms, not perceptual abstractions.

    On this case, the mannequin fell into a typical lure: producing one thing plausible-sounding however mistaken. It produced a spatial propagation scheme which may work for a diffusion process however not the one at hand. This highlights a key weak spot in text-based few-shot prompting for summary visible reasoning: the mannequin’s “reasoning” is disconnected from perceptual understanding. It invents algorithms based mostly on inner priors, not exterior cues.

    When Reasoning Fails: Additionally Picture Fashions Aren’t Magic

    To enhance generalization, I transitioned from purely text-based reasoning to image-based prompting, leveraging GPT-4.1’s multimodal capabilities by way of LangChain. This setup encoded input-output grid examples as base64 pictures, which have been offered alongside a pure language immediate describing the duty.

    from langchain_core.messages import HumanMessage
    
    import io
    import base64
    
    AGENT_MODEL = "gpt-4.1"
    
    # Immediate for picture based mostly reasoning
    PROMPT_REASON_IMG = """You're an professional at fixing summary reasoning duties.
    
    These are distinctive reasoning duties with restricted examples. You're given demonstration input-output 2D grids. 
    The colormap used is as follows:
    
    {{
        'Violet': 0,
        'Indigo': 1,
        'Blue': 2,
        'Yellow': 3,
        'Inexperienced': 4,
        'Orange': 5,
        'Crimson': 6,
        'Golden': 7,
        'Black': 8,
        'White': 9
    }}
    
    Your purpose is to develop a single algorithm that maps every enter grid to its corresponding output grid.
    
    A profitable resolution entails:
    - Analyzing the demonstration examples rigorously
    - Figuring out underlying visible or spatial patterns
    - Formulating a common transformation rule
    - Translating this rule into clear pseudocode
    
    If this isn't your first try, a abstract of earlier suggestions can be supplied. Evaluation it rigorously and incorporate it to enhance your resolution.
    
    Check your algorithm in opposition to the demonstrations to make sure it really works.
    
    Return **solely the algorithm pseudocode**, formatted as plain textual content. Don't clarify it or add further commentary.
    """
    
    # In case your array is 10x10 and also you need every cell to be 20x20 pixels (cell_px), the picture will probably be 200x200 pixels.
    # Convert matrix into picture
    def visualize_grid_fig(matrix, cmap=cmap, norm=norm, cell_px=20, present=False):
        if kind(matrix) is just not np.ndarray:
            matrix = np.array(matrix)
        h, w = matrix.form[:2]
        figsize = (w * cell_px / 100, h * cell_px / 100)  # inches
        fig, ax = plt.subplots(figsize=figsize)
        ax.imshow(matrix, cmap=cmap, norm=norm)
        ax.axis('off')
        if present:
            plt.present()
        else:
            plt.shut(fig)
        return fig
    
    # encode picture for mannequin
    def fig_to_base64(fig, dpi=100):
        buf = io.BytesIO()
        fig.savefig(buf, format='png', dpi=dpi, bbox_inches='tight')
        buf.search(0)
        img_base64 = base64.b64encode(buf.learn()).decode('utf-8')
        buf.shut()
        return img_base64
    
    # Within the loop substitute reasoning code with this
    # reasoning with pictures
    reason_message = [{"type": "text", "text": PROMPT_REASON_IMG}]
    for i, instance in enumerate(train_pairs):
        #fig_in = visualize_grid_fig(instance['input'], cmap, norm)
        #fig_out = visualize_grid_fig(instance['output'], cmap, norm)
        fig_in = visualize_grid_fig(instance['input'], )
        fig_out = visualize_grid_fig(instance['output'], )
        fig_in = fig_to_base64(fig_in)
        fig_out = fig_to_base64(fig_out)
        reason_message.append({"kind": "textual content", "textual content": f"### Enter grid {i+1}:"})
        reason_message.append({"kind": "image_url", "image_url": {"url": f"knowledge:picture/jpeg;base64,{fig_in}"}})
        reason_message.append({"kind": "textual content", "textual content": f"### Output grid {i+1}:"})
        reason_message.append({"kind": "image_url", "image_url": {"url": f"knowledge:picture/jpeg;base64,{fig_out}"}})
    reason_message.append({"kind": "textual content", "textual content": f"### Suggestions (abstract of earlier makes an attempt): {attempt_history_summary}"})
    reason_message = HumanMessage(content material=reason_message)
    res = AGENT_LLM.invoke([reason_message])
    reasoning_template = res.content material

    The ensuing pseudocode marked a transparent step ahead in expressiveness. The mannequin was in a position to:

    • Detect blue-bordered squares utilizing visible options relatively than purely symbolic construction;
    • Apply guidelines based mostly on sq. measurement and nesting depth to deduce inside fill colours;
    • Kind recognized squares by measurement earlier than filling, successfully avoiding overwrite conflicts.

    Right here’s the generated pseudocode:

    1. Let grid be the enter 2D array.
    2. Create output_grid as a duplicate of grid.
    
    3. Determine all blue-bordered squares within the grid:
       a. For every doable top-left nook (i, j):
          i. For every doable sq. measurement s (s ≥ 3, as much as min(grid peak, grid width)):
             - Test if the sq. of measurement s beginning at (i, j) is absolutely inside bounds.
             - Test if all *border* cells of this sq. are Blue (worth = 2).
             - Test that the *inside* cells (not on the border) don't comprise any Blue (2).
             - If all circumstances are met, file the sq. as (i, j, s).
    
    4. Kind the checklist of detected blue-bordered squares by measurement in ascending order (smallest first).
    
    5. For every detected sq. (i, j, s), in sorted order:
       a. Decide the fill colour:
          - If the sq. is the smallest (no different blue-bordered sq. is absolutely inside it), set fill colour = Black (8).
          - If the sq. is the most important (no different blue-bordered sq. absolutely comprises it), fill colour = 
            - If there are precisely 2 blue-bordered squares, set fill colour = Inexperienced (4).
            - If there are three blue-bordered squares within the grid, fill colour = Yellow (3).
          - If the sq. is nested (not smallest or largest), fill colour = Black (8).
          - (Extra advanced guidelines could generalize past these based mostly on demonstrations.)
    
       b. Fill the inside of the sq.:
          For every cell (x, y) strictly contained in the sq. (i+1 ≤ x < i+s−1) and (j+1 ≤ y < j+s−1):
             - If output_grid[x][y] is just not Blue (2), set it to the chosen fill colour.
    
    6. Return output_grid.
    
    Particular notes:
      - By no means overwrite Blue (2) border pixels. 
      - When filling, later (bigger) squares overwrite earlier (smaller) fills in overlapping areas. 
      - Solely course of legitimate blue-bordered squares (minimal measurement 3x3, full border). 
      - If there are a number of disjoint blue-bordered squares, deal with every independently for fill colour task as above matching the demonstration logic.

    The mannequin clearly reveals structured reasoning. It develops an inner illustration of nested geometric varieties and makes an attempt to use rule-based transformations derived from examples.

    Check output grid from image-based reasoning mannequin pseudocode (picture by writer).

    Nevertheless, regardless of these advances, the mannequin nonetheless fails to generalize reliably. In novel configurations, it mis-assigns fill colours, falling again on brittle heuristics like size-based precedence or inflexible nesting assumptions. As an illustration, it’d assume the most important sq. is all the time full of yellow, even when that logic not holds in a brand new context. This failure reveals a deeper limitation: even with picture enter, the mannequin doesn’t “see” within the human sense. It doesn’t construct a holistic perceptual illustration of spatial relationships. As an alternative, it converts the picture into symbolic patterns and applies deterministic procedures like flood-fill, sorting, or positional indexing.

    In apply, this implies the mannequin causes from inner abstractions, not perceptual grounding. It infers that “smaller squares get black,” or “fill based mostly on measurement rank,” with out absolutely understanding why these assignments occurred within the demonstrations. In consequence, any deviation from the anticipated format could cause it to misfire.

    This means that whereas multimodal prompting extends the expressive vary of the mannequin, it doesn’t but present the form of versatile, generalizable visible reasoning that people show. These duties could finally require stronger types of program induction, meta-learning, or hybrid methods that combine perceptual grouping with realized guidelines.

    Conclusions

    On this examine, I explored whether or not massive language fashions — each text-based and multimodal — can carry out meta-learning from examples on summary reasoning duties. Particularly, I targeted on a category of issues from the ARC dataset, the place options require figuring out visible patterns, studying transformations, and generalizing them to novel take a look at inputs.

    By means of direct prompting experiments, I discovered that:

    • Textual content-based fashions (e.g., o3-mini) typically hallucinate believable algorithms which might be topologically or procedurally sound however fully disconnected from the duty’s visible logic. These fashions depend on symbolic reasoning over tokenized grids, and default to acquainted heuristics like flood-fill, ring propagation, or rule-based updates, whatever the examples supplied.
    • Multimodal fashions (e.g., GPT-4 with imaginative and prescient) confirmed a clear enchancment in sample detection and relational reasoning. They efficiently recognized blue-bordered areas and tailored behaviors based mostly on relative measurement or nesting. Nevertheless, their generalization remained fragile: they nonetheless utilized brittle guidelines, comparable to mounted size-based assignments, and failed in novel layouts that deviated from the demonstrations.

    These findings recommend that, even with visible enter, present LLMs don’t “see” as people do. They course of pictures symbolically, not perceptually. Their reasoning is pushed by internally constructed guidelines, not a versatile, visible understanding of shapes, hierarchy, or affordance.

    The restrictions I noticed reinforce a central stress: few-shot prompting alone, even with pictures, is just not enough for sturdy abstraction. True generalization probably requires:

    • Program induction: inferring reusable, structured transformations from examples; 
    • Perceptual grounding: creating architectures that parse and manipulate visible scenes compositionally;
    • Meta-learning architectures: constructing fashions that adapt their reasoning methods dynamically relatively than making use of pre-learned heuristics;

    At this time’s LLMs are astonishing of their breadth, however they’re nonetheless guessing based mostly on priors, not studying to study within the human sense. They lack a powerful inductive bias for abstraction and transformation. ARC-style duties expose this hole clearly: success requires greater than sample recognition, it requires reasoning from examples in a structured, compositional manner. These outcomes aren’t discouraging, relatively clarifying. We now know the place the ceiling is. And the subsequent technology of fashions, these with hybrid architectures, persistent reminiscence, and express meta-learning capabilities would possibly lastly break by way of it.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Ensembles of Ensembles of Ensembles: A Guide to Stacking

    April 29, 2026

    How AI Policy in South Africa Is Ruining Itself

    April 29, 2026

    PyTorch NaNs Are Silent Killers — So I Built a 3ms Hook to Catch Them at the Exact Layer

    April 28, 2026

    Correlation Doesn’t Mean Causation! But What Does It Mean?

    April 28, 2026

    Let the AI Do the Experimenting

    April 28, 2026

    The Next Frontier of AI in Production Is Chaos Engineering

    April 28, 2026

    Comments are closed.

    Editors Picks

    Ensembles of Ensembles of Ensembles: A Guide to Stacking

    April 29, 2026

    This region in space poses the greatest danger in our Solar System

    April 29, 2026

    Practical info and special tips for the EU-Startups Summit 2026 in Malta – look inside!

    April 29, 2026

    Your Phone Notifications Reveal More Than You Realize. Here’s How to Lock Them Down

    April 29, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Nvidia AI chips worth $1B smuggled to China after Trump export controls

    July 24, 2025

    Palantir Defends Work With ICE to Staff Following Killing of Alex Pretti

    January 27, 2026

    Uber May Soon Let You Book a Zoox Robotaxi in Las Vegas and LA

    March 11, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.