Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Asus ROG Azoth X Review: A Space-Age Gaming Keyboard
    • Behind Mark Zuckerberg’s sweeping recruitment effort for Meta’s Superintelligence lab, including hundreds of reach outs, $100M+ offers and counter offers, more (Wall Street Journal)
    • How to Watch Man City vs. Al Ain From Anywhere for Free: Stream FIFA Club World Cup Soccer
    • Decline of large scavengers raises zoonotic disease risk
    • What Satellite Images Reveal About the US Bombing of Iran’s Nuclear Sites
    • A look at Apple’s conservative approach to M&A, which may need to change to catch up in AI by acquiring a startup like Perplexity, Cohere, Sierra AI, or Mistral (Mark Gurman/Bloomberg)
    • Today’s NYT Connections: Sports Edition Hints, Answers for June 23 #273
    • Air Force plans modular nuclear reactor in Alaska
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Monday, June 23
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»LLMs + Pandas: How I Use Generative AI to Generate Pandas DataFrame Summaries
    Artificial Intelligence

    LLMs + Pandas: How I Use Generative AI to Generate Pandas DataFrame Summaries

    Editor Times FeaturedBy Editor Times FeaturedJune 3, 2025No Comments12 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    datasets and are on the lookout for fast insights with out an excessive amount of guide grind, you’ve come to the best place.

    In 2025, datasets usually include tens of millions of rows and tons of of columns, which makes guide evaluation subsequent to not possible. Native Giant Language Fashions can rework your uncooked DataFrame statistics into polished, readable experiences in seconds — minutes at worst. This method eliminates the tedious technique of analyzing knowledge by hand and writing government experiences, particularly if the info construction doesn’t change.

    Pandas handles the heavy lifting of knowledge extraction whereas LLMs convert your technical outputs into presentable experiences. You’ll nonetheless want to put in writing capabilities that pull key statistics out of your datasets, but it surely’s a one-time effort.

    This information assumes you will have Ollama put in regionally. In case you don’t, you may nonetheless use third-party LLM distributors, however I received’t clarify how to connect with their APIs.

    Desk of contents:

    • Dataset Introduction and Exploration
    • The Boring Half: Extracting Abstract Statistics
    • The Cool Half: Working with LLMs
    • What You Might Enhance

    Dataset Introduction and Exploration

    For this information, I’m utilizing the MBA admissions dataset from Kaggle. Obtain it if you wish to comply with alongside.

    The dataset is licensed below the Apache 2.0 license, which suggests you need to use it freely for each private and business initiatives.

    To get began, you’ll want just a few Python libraries put in in your system.

    Picture 1 – Required Python libraries and variations (picture by writer)

    After you have every little thing put in, import the mandatory libraries in a brand new script or a pocket book:

    import pandas as pd
    from langchain_ollama import ChatOllama
    from typing import Literal

    Dataset loading and preprocessing

    Begin by loading the dataset with Pandas. This snippet masses the CSV file, prints fundamental details about the dataset form, and exhibits what number of lacking values exist in every column:

    df = pd.read_csv("knowledge/MBA.csv")
    
    # Fundamental dataset information
    print(f"Dataset form: {df.form}n")
    print("Lacking worth stats:")
    print(df.isnull().sum())
    print("-" * 25)
    df.pattern(5)
    Picture 2 – Fundamental dataset statistics (picture by writer)

    Since knowledge cleansing isn’t the primary focus of this text, I’ll preserve the preprocessing minimal. The dataset solely has a few lacking values that want consideration:

    df["race"] = df["race"].fillna("Unknown")
    df["admission"] = df["admission"].fillna("Deny")

    That’s it! Let’s see easy methods to go from this to a significant report subsequent.

    The Boring Half: Extracting Abstract Statistics

    Even with all of the advances in AI functionality and availability, you most likely don’t wish to ship your complete dataset to an LLM supplier. There are a few good explanation why.

    It might devour approach too many tokens, which interprets on to greater prices. Processing massive datasets can take a very long time, particularly whenever you’re working fashions regionally by yourself {hardware}. You may additionally be coping with delicate knowledge that shouldn’t go away your group.

    Some guide work remains to be the way in which to go.

    This method requires you to put in writing a operate that extracts key parts and statistics out of your Pandas DataFrame. You’ll have to put in writing this operate from scratch for various datasets, however the core thought transfers simply between initiatives.

    The get_summary_context_message() operate takes in a DataFrame and returns a formatted multi-line string with an in depth abstract. Right here’s what it contains:

    • Whole utility depend and gender distribution
    • Worldwide vs home applicant breakdown
    • GPA and GMAT rating quartile statistics
    • Admission charges by educational main (sorted by fee)
    • Admission charges by work business (prime 8 industries)
    • Work expertise evaluation with categorical breakdowns
    • Key insights highlighting top-performing classes

    Right here’s the entire supply code for the operate:

    def get_summary_context_message(df: pd.DataFrame) -> str:
        """
        Generate a complete abstract report of MBA admissions dataset statistics.
        
        This operate analyzes MBA utility knowledge to offer detailed statistics on
        applicant demographics, educational efficiency, skilled backgrounds, and
        admission charges throughout varied classes. The abstract contains gender and
        worldwide standing distributions, GPA and GMAT rating statistics, admission
        charges by educational main and work business, and work expertise influence evaluation.
        
        Parameters
        ----------
        df : pd.DataFrame
            DataFrame containing MBA admissions knowledge with the next anticipated columns:
            - 'gender', 'worldwide', 'gpa', 'gmat', 'main', 'work_industry', 'work_exp', 'admission'
        
        Returns
        -------
        str
            A formatted multi-line string containing complete MBA admissions
            statistics.
        """
        # Fundamental utility statistics
        total_applications = len(df)
    
        # Gender distribution
        gender_counts = df["gender"].value_counts()
        male_count = gender_counts.get("Male", 0)
        female_count = gender_counts.get("Feminine", 0)
    
        # Worldwide standing
        international_count = (
            df["international"].sum()
            if df["international"].dtype == bool
            else (df["international"] == True).sum()
        )
    
        # GPA statistics
        gpa_data = df["gpa"].dropna()
        gpa_avg = gpa_data.imply()
        gpa_25th = gpa_data.quantile(0.25)
        gpa_50th = gpa_data.quantile(0.50)
        gpa_75th = gpa_data.quantile(0.75)
    
        # GMAT statistics
        gmat_data = df["gmat"].dropna()
        gmat_avg = gmat_data.imply()
        gmat_25th = gmat_data.quantile(0.25)
        gmat_50th = gmat_data.quantile(0.50)
        gmat_75th = gmat_data.quantile(0.75)
    
        # Main evaluation - admission charges by main
        major_stats = []
        for main in df["major"].distinctive():
            major_data = df[df["major"] == main]
            admitted = len(major_data[major_data["admission"] == "Admit"])
            complete = len(major_data)
            fee = (admitted / complete) * 100
            major_stats.append((main, admitted, complete, fee))
    
        # Type by admission fee (descending)
        major_stats.kind(key=lambda x: x[3], reverse=True)
    
        # Work business evaluation - admission charges by business
        industry_stats = []
        for business in df["work_industry"].distinctive():
            if pd.isna(business):
                proceed
            industry_data = df[df["work_industry"] == business]
            admitted = len(industry_data[industry_data["admission"] == "Admit"])
            complete = len(industry_data)
            fee = (admitted / complete) * 100
            industry_stats.append((business, admitted, complete, fee))
    
        # Type by admission fee (descending)
        industry_stats.kind(key=lambda x: x[3], reverse=True)
    
        # Work expertise evaluation
        work_exp_data = df["work_exp"].dropna()
        avg_work_exp_all = work_exp_data.imply()
    
        # Work expertise for admitted college students
        admitted_students = df[df["admission"] == "Admit"]
        admitted_work_exp = admitted_students["work_exp"].dropna()
        avg_work_exp_admitted = admitted_work_exp.imply()
    
        # Work expertise ranges evaluation
        def categorize_work_exp(exp):
            if pd.isna(exp):
                return "Unknown"
            elif exp < 2:
                return "0-1 years"
            elif exp < 4:
                return "2-3 years"
            elif exp < 6:
                return "4-5 years"
            elif exp < 8:
                return "6-7 years"
            else:
                return "8+ years"
    
        df["work_exp_category"] = df["work_exp"].apply(categorize_work_exp)
        work_exp_category_stats = []
    
        for class in ["0-1 years", "2-3 years", "4-5 years", "6-7 years", "8+ years"]:
            category_data = df[df["work_exp_category"] == class]
            if len(category_data) > 0:
                admitted = len(category_data[category_data["admission"] == "Admit"])
                complete = len(category_data)
                fee = (admitted / complete) * 100
                work_exp_category_stats.append((class, admitted, complete, fee))
    
        # Construct the abstract message
        abstract = f"""MBA Admissions Dataset Abstract (2025)
        
    Whole Purposes: {total_applications:,} individuals utilized to the MBA program.
    
    Gender Distribution:
    - Male candidates: {male_count:,} ({male_count/total_applications*100:.1f}%)
    - Feminine candidates: {female_count:,} ({female_count/total_applications*100:.1f}%)
    
    Worldwide Standing:
    - Worldwide candidates: {international_count:,} ({international_count/total_applications*100:.1f}%)
    - Home candidates: {total_applications-international_count:,} ({(total_applications-international_count)/total_applications*100:.1f}%)
    
    Educational Efficiency Statistics:
    
    GPA Statistics:
    - Common GPA: {gpa_avg:.2f}
    - twenty fifth percentile: {gpa_25th:.2f}
    - fiftieth percentile (median): {gpa_50th:.2f}
    - seventy fifth percentile: {gpa_75th:.2f}
    
    GMAT Statistics:
    - Common GMAT: {gmat_avg:.0f}
    - twenty fifth percentile: {gmat_25th:.0f}
    - fiftieth percentile (median): {gmat_50th:.0f}
    - seventy fifth percentile: {gmat_75th:.0f}
    
    Main Evaluation - Admission Charges by Educational Background:"""
    
        for main, admitted, complete, fee in major_stats:
            abstract += (
                f"n- {main}: {admitted}/{complete} admitted ({fee:.1f}% admission fee)"
            )
    
        abstract += (
            "nnWork Trade Evaluation - Admission Charges by Skilled Background:"
        )
    
        # Present prime 8 industries by admission fee
        for business, admitted, complete, fee in industry_stats[:8]:
            abstract += (
                f"n- {business}: {admitted}/{complete} admitted ({fee:.1f}% admission fee)"
            )
    
        abstract += "nnWork Expertise Impression on Admissions:nnOverall Work Expertise Comparability:"
        abstract += (
            f"n- Common work expertise (all candidates): {avg_work_exp_all:.1f} years"
        )
        abstract += f"n- Common work expertise (admitted college students): {avg_work_exp_admitted:.1f} years"
    
        abstract += "nnAdmission Charges by Work Expertise Vary:"
        for class, admitted, complete, fee in work_exp_category_stats:
            abstract += (
                f"n- {class}: {admitted}/{complete} admitted ({fee:.1f}% admission fee)"
            )
    
        # Key insights
        best_major = major_stats[0]
        best_industry = industry_stats[0]
    
        abstract += "nnKey Insights:"
        abstract += (
            f"n- Highest admission fee by main: {best_major[0]} at {best_major[3]:.1f}%"
        )
        abstract += f"n- Highest admission fee by business: {best_industry[0]} at {best_industry[3]:.1f}%"
    
        if avg_work_exp_admitted > avg_work_exp_all:
            abstract += f"n- Admitted college students have barely extra work expertise on common ({avg_work_exp_admitted:.1f} vs {avg_work_exp_all:.1f} years)"
        else:
            abstract += "n- Work expertise exhibits minimal distinction between admitted and all candidates"
    
        return abstract

    When you’ve outlined the operate, merely name it and print the outcomes:

    print(get_summary_context_message(df))
    Picture 3 – Extracted findings and statistics from the dataset (picture by writer)

    Now let’s transfer on to the enjoyable half.

    The Cool Half: Working with LLMs

    That is the place issues get fascinating and your guide knowledge extraction work pays off.

    Python helper operate for working with LLMs

    When you have first rate {hardware}, I strongly advocate utilizing native LLMs for easy duties like this. I exploit Ollama and the most recent model of the Mistral mannequin for the precise LLM processing.

    Picture 4 – Accessible Ollama fashions (picture by writer)

    If you wish to use one thing like ChatGPT by OpenAI API, you may nonetheless try this. You’ll simply want to switch the operate under to arrange your API key and return the suitable occasion from Langchain.

    Whatever the choice you select, a name to get_llm() with a take a look at message shouldn’t return an error:

    def get_llm(model_name: str = "mistral:newest") -> ChatOllama:
        """
        Create and configure a ChatOllama occasion for native LLM inference.
        
        This operate initializes a ChatOllama consumer configured to connect with a
        native Ollama server. The consumer is about up with deterministic output
        (temperature=0) for constant responses throughout a number of calls with the
        identical enter.
        
        Parameters
        ----------
        model_name : str, non-obligatory
            The title of the Ollama mannequin to make use of for chat completions.
            Have to be a sound mannequin title that's obtainable on the native Ollama
            set up. Default is "mistral:newest".
        
        Returns
        -------
        ChatOllama
            A configured ChatOllama occasion prepared for chat completions.
        """
        return ChatOllama(
            mannequin=model_name, base_url="http://localhost:11434", temperature=0
        )
    
    
    print(get_llm().invoke("take a look at").content material)
    Picture 5 – LLM take a look at message (picture by writer)

    Summarization immediate

    That is the place you will get inventive and write ultra-specific directions on your LLM. I’ve determined to maintain issues mild for demonstration functions, however be happy to experiment right here.

    There isn’t a single proper or flawed immediate.

    No matter you do, make certain to incorporate the format arguments utilizing curly brackets – these values can be crammed dynamically later:

    SUMMARIZE_DATAFRAME_PROMPT = """
    You're an skilled knowledge analyst and knowledge summarizer. Your job is to soak up complicated datasets
    and return user-friendly descriptions and findings.
    
    You got this dataset:
    - Identify: {dataset_name}
    - Supply: {dataset_source}
    
    This dataset was analyzed in a pipeline earlier than it was given to you.
    These are the findings returned by the evaluation pipeline:
    
    
    {context}
    
    
    Based mostly on these findings, write an in depth report in {report_format} format.
    Give the report a significant title and separate findings into sections with headings and subheadings.
    Output solely the report in {report_format} and nothing else.
    
    Report:
    """

    Summarization Python operate

    With the immediate and the get_llm() capabilities declared, the one factor left is to attach the dots. The get_report_summary() operate takes in arguments that may fill the format placeholders within the immediate, then invokes the LLM with that immediate to generate a report.

    You may select between Markdown or HTML codecs:

    def get_report_summary(
        dataset: pd.DataFrame,
        dataset_name: str,
        dataset_source: str,
        report_format: Literal["markdown", "html"] = "markdown",
    ) -> str:
        """
        Generate an AI-powered abstract report from a pandas DataFrame.
        
        This operate analyzes a dataset and generates a complete abstract report
        utilizing a big language mannequin (LLM). It first extracts statistical context
        from the dataset, then makes use of an LLM to create a human-readable report within the
        specified format.
        
        Parameters
        ----------
        dataset : pd.DataFrame
            The pandas DataFrame to investigate and summarize.
        dataset_name : str
            A descriptive title for the dataset that can be included within the
            generated report for context and identification.
        dataset_source : str
            Details about the supply or origin of the dataset.
        report_format : {"markdown", "html"}, non-obligatory
            The specified output format for the generated report. Choices are:
            - "markdown" : Generate report in Markdown format (default)
            - "html" : Generate report in HTML format
        
        Returns
        -------
        str
            A formatted abstract report.
        
        """
        context_message = get_summary_context_message(df=dataset)
        immediate = SUMMARIZE_DATAFRAME_PROMPT.format(
            dataset_name=dataset_name,
            dataset_source=dataset_source,
            context=context_message,
            report_format=report_format,
        )
        return get_llm().invoke(enter=immediate).content material

    Utilizing the operate is easy – simply cross within the dataset, its title, and supply. The report format defaults to Markdown:

    md_report = get_report_summary(
        dataset=df, 
        dataset_name="MBA Admissions (2025)",
        dataset_source="https://www.kaggle.com/datasets/taweilo/mba-admission-dataset"
    )
    print(md_report)
    Picture 6 – Last report in Markdown format (picture by writer)

    The HTML report is simply as detailed, however might use some styling. Possibly you might ask the LLM to deal with that as nicely!

    Picture 7 – Last report in HTML format (picture by writer)

    What You Might Enhance

    I might have simply turned this right into a 30-minute learn by optimizing each element of the pipeline, however I stored it easy for demonstration functions. You don’t need to (and shouldn’t) cease right here although.

    Listed below are the issues you may enhance to make this pipeline much more highly effective:

    • Write a operate that saves the report (Markdown or HTML) on to disk. This fashion you may automate the complete course of and generate experiences on a schedule with out guide intervention.
    • Within the immediate, ask the LLM so as to add CSS styling to the HTML report to make it look extra presentable. You could possibly even present your organization’s model colours and fonts to take care of consistency throughout all of your knowledge experiences.
    • Increase the immediate to comply with extra particular directions. You may want experiences that concentrate on particular enterprise metrics, comply with a selected template, or embrace suggestions based mostly on the findings.
    • Increase the get_llm() operate so it could possibly join each to Ollama and different distributors like OpenAI, Anthropic, or Google. This provides you flexibility to modify between native and cloud-based fashions relying in your wants.
    • Do actually something within the get_summary_context_message() operate because it serves as the inspiration for all context knowledge supplied to the LLM. That is the place you will get inventive with characteristic engineering, statistical evaluation, and knowledge insights that matter to your particular use case.

    I hope this minimal instance has set you heading in the right direction to automate your personal knowledge reporting workflows.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    The Evolution of AI Voices: From Robotic to Human-Like

    June 21, 2025

    Why Open Source is No Longer Optional — And How to Make it Work for Your Business

    June 21, 2025

    Understanding Application Performance with Roofline Modeling

    June 21, 2025

    Why You Should Not Replace Blanks with 0 in Power BI

    June 20, 2025

    Computer Vision’s Annotation Bottleneck Is Finally Breaking

    June 20, 2025

    What PyTorch Really Means by a Leaf Tensor and Its Grad

    June 20, 2025

    Comments are closed.

    Editors Picks

    Asus ROG Azoth X Review: A Space-Age Gaming Keyboard

    June 23, 2025

    Behind Mark Zuckerberg’s sweeping recruitment effort for Meta’s Superintelligence lab, including hundreds of reach outs, $100M+ offers and counter offers, more (Wall Street Journal)

    June 23, 2025

    How to Watch Man City vs. Al Ain From Anywhere for Free: Stream FIFA Club World Cup Soccer

    June 23, 2025

    Decline of large scavengers raises zoonotic disease risk

    June 23, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    AI tool put to test sifting public views on botox and fillers

    May 20, 2025

    BioTech startup VERIGRAFT secures €1.2 million to advance 3D-printed arterial grafts

    February 16, 2025

    1-minute video game distinguishes autistic from neurotypical kids

    February 1, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.