Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • The Latest Push to Extend Key US Spy Powers Is Still a Mess
    • Anthropic says Google is committing to invest $10B now in cash at a $350B valuation and will invest another $30B if Anthropic hits performance targets (Bloomberg)
    • ‘Apex’ Review: Charlize Theron Netflix Thriller Avoids Rock Bottom, but Barely
    • I Built an AI Pipeline for Kindle Highlights
    • Mud-covered shipping container restaurant solves poor insulation problem in India
    • 1stCRWD CEO Stefania Ditrani Seychell joins the EU-Startups Summit 2026 speaker lineup
    • Dyson PencilVac Review (2026): Limited but Handy
    • Meta announces a deal to use “tens of millions” of Amazon’s Graviton chips to help deliver its next generation of AI models, amid a shortage of Nvidia chips (Ina Fried/Axios)
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Friday, April 24
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»I Built an AI Pipeline for Kindle Highlights
    Artificial Intelligence

    I Built an AI Pipeline for Kindle Highlights

    Editor Times FeaturedBy Editor Times FeaturedApril 24, 2026No Comments15 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    I learn, I like to focus on stuff (I exploit a Kindle). I really feel like by studying I don’t get to retain greater than 10% of the data I eat however it’s via re-reading the highlights or summarizing the e book utilizing them is what makes me really perceive what I learn.

    The issue is that, generally, I find yourself highlighting so much.

    And by so much I imply A LOT. We can’t even name them “key notes.”

    So in these instances, after studying the e book, I find yourself both losing loads of time summarizing or simply stop doing it (the latter is the extra frequent).

    I not too long ago learn a e book that I loved quite a bit and want to totally retain what struck me probably the most. However, once more, it was a kind of books I over-highlighted.

    And I didn’t wish to use loads of my scarce free time on it. So I made a decision to automate the method and use my tech/information expertise. As a result of I’m proud of the end result, I believed I’d share it so anybody can benefit from this software as nicely.

    Disclaimer: my Kindle is kind of previous so this could work on new ones as nicely. The truth is, there’s a barely higher method for brand spanking new Kindle variations (defined additionally on this put up).


    The Challenge

    Let’s outline the aim: generate a abstract from our Kindle highlights.

    As I thought of it, I imagined the next easy pipeline for a single e book:

    • Get the e book highlights
    • Create a RAG or one thing comparable
    • Export the abstract

    The result’s totally different on the primary half, however all as a result of preprocessing wanted taking into consideration how the information is structured.

    So I’ll construction this put up into two foremost sections:

    • Knowledge retrieval and processing
    • AI mannequin and output

    1. Knowledge Retrieval and Processing

    My instinct instructed me there was a strategy to extract highlights from my Kindle. In the long run, they’re saved there, so I simply want a strategy to get them out.

    There are a number of methods to do it however I needed an method that labored with each books purchased on the official Kindle retailer but in addition PDFs or information I despatched from my laptop computer.

    And I additionally determined I wouldn’t use any current software program to extract the information. Simply my book and my laptop computer (and a USB connecting them each).

    Fortunately for us, no jailbreak is required and there are two methods of doing so relying in your Kindle model:

    1. All Kindles (presumably) have a file within the paperwork folder named My Clippings.txt. It actually accommodates any clipping you’ve made at any level and any e book.
    2. New Kindles even have a SQLite file within the system listing named annotations.db. This has your highlights in a extra structured means.

    On this put up I’ll be utilizing methodology 1 (My Clippings.txt) primarily as a result of my Kindle doesn’t have the annotations.db database. However if you happen to’re fortunate sufficient to have the DB, use it because it’ll be extra easy and better high quality (a lot of the preprocessing we’ll be seeing subsequent received’t in all probability be wanted).

    So getting the clippings is as straightforward as studying the TXT. Listed below are some key facets and issues I encountered utilizing this methodology:

    • All books are on the identical file.
    • I’m unsure in regards to the precise “clipping” definition on Amazon’s facet however the way in which I’ve seen it’s: something you spotlight at any level. Even if you happen to delete it or broaden it, the unique will stay within the TXT. I assume that is like that as a result of, certainly, we’re working with a TXT file and it’s very laborious to delete stuff that’s not listed in any means.
    • There’s a restrict to clipping: I’m not conscious of the precise threshold however as soon as we cross it, we are able to’t retrieve any extra clippings. That is carried out as a result of somebody might in any other case spotlight the complete e book, extract it and share it illegally.

    And that is the anatomy of a clipping:

    ==========
    E-book Title (Creator Title)
    - Your Spotlight on web page 145 | Location 2212-2212 | Added on Sunday, August 30, 2020 11:25:29 PM
    
    transparency drawback leads to the identical place as
    ==========

    So step one is parsing the highlights, and that is the place we begin seeing Python code:

    def parse_clippings(file_path):
    
        uncooked = Path(file_path).read_text(encoding="utf-8")
        entries = uncooked.cut up("==========")
    
        highlights = []
    
        for entry in entries:
    
            traces = [l.strip() for l in entry.strip().split("n") if l.strip()]
    
            if len(traces) < 3:
                proceed
    
            e book = traces[0]
    
            if "Spotlight" not in traces[1]:
                proceed
    
            location_match = re.search(r"Location (d+)", traces[1])
            if not location_match:
                proceed
    
            location = int(location_match.group(1))
    
            textual content = " ".be part of(traces[2:]).strip()
    
            highlights.append(
                {
                    "e book": e book,
                    "location": location,
                    "textual content": textual content
                }
            )
    
        return highlights

    Given the trail of the clippings file, all this perform does is cut up the textual content into the totally different entries after which loop via them. For every entry, it extracts the title title, the placement and the highlighted textual content.

    This last construction (an inventory of dictionaries) makes it straightforward to filter by e book:

    [
        h for h in highlights
        if book_name.lower() in h["book"].decrease()
    ]

    As soon as filtered, we should order the highlights. Since clippings are appended to the TXT file, the order relies on once we spotlight, not on the textual content’s location.

    And I personally need my outcomes to look as they do within the e book, so ordering is critical:

    sorted(highlights, key=lambda x: x["location"])

    Now, if you happen to examine your clippings file, you may discover duplicated clippings (or duplicated subclippings). This occurs as a result of any time you edit a spotlight (that you just’ve failed to incorporate all of the phrases you geared toward, for instance), it’s accounted as a brand new one. So there can be two very comparable clippings within the TXT. Or much more if you happen to edit it loads of instances.

    We have to deal with this by making use of a deduplication someway. It’s simpler than anticipated:

    def deduplicate(highlights):
    
        clear = []
        for h in highlights:
    
            textual content = h["text"]
            duplicate = False
    
            for c in clear:
    
                if textual content == c["text"]:
                    duplicate = True
                    break
                if textual content in c["text"]:
                    duplicate = True
                    break
                if c["text"] in textual content:
                    c["text"] = textual content
                    duplicate = True
                    break
    
            if not duplicate:
                clear.append(h)
    
        return clear

    It’s quite simple and could possibly be perfected, however we principally examine if there are consecutive clippings with the identical textual content (or a part of it) and preserve the longest.

    Proper now we now have the e book highlights correctly sorted, and we might cease the pre-processing right here. However I can’t do this. I like to focus on titles each time as a result of, when summarizing, I get to correctly assign a piece to every spotlight.

    However our code isn’t capable of differ between an actual spotlight and a piece title… But. See under:

    def is_probable_title(textual content):
    
        textual content = textual content.strip()
        if len(textual content) > 120:
            return False
        if textual content.endswith("."):
            return False
    
        phrases = textual content.cut up()
        if len(phrases) > 12:
            return False
    
        # chapter fashion prefix
        if has_chapter_prefix(textual content):
            return True
    
        # capitalization ratio
        capitalized = sum(
            1 for w in phrases if w[0].isupper()
        )
    
        cap_ratio = capitalized / len(phrases)
    
        # stopword ratio
        stopword_count = sum(
            1 for w in phrases if w.decrease() in STOPWORDS
        )
    
        stop_ratio = stopword_count / len(phrases)
    
        rating = 0
        if cap_ratio > 0.6:
            rating += 1
        if stop_ratio < 0.3:
            rating += 1
        if len(phrases) <= 6:
            rating += 1
    
        return rating >= 2

    It may appear fairly arbitrary, and it’s not the most effective answer to this drawback, however it does work fairly nicely. It makes use of a heuristic based mostly on capitalization, size, stopwords and prefixes.

    This perform known as inside a loop via all of the highlights, as we’ve seen in earlier capabilities, to examine if a spotlight is a title or not. The result’s a “sections” checklist of dictionaries the place the dictionary has two keys:

    • Title: the part title.
    • Highlights: the part’s highlights.

    Proper now, sure, we’re able to summarize.

    AI Mannequin and Output

    I needed this to be a free venture, so we want an AI mannequin that’s open supply.

    I believed that Ollama [1] was among the best choices to run a venture like this (not less than regionally). Plus, our information all the time stay ours with them and we are able to run the fashions offline.

    As soon as put in, the code was straightforward. I’m not a immediate engineer so anybody with the know-how would get even higher outcomes, however that is what works for me:

    def summarize_with_ollama(textual content, mannequin):
    
        immediate = f"""
        You're summarizing a e book from reader highlights.
    
        Produce a structured abstract with:
    
        - Fundamental thesis
        - Temporary abstract
        - Key concepts
        - Necessary ideas
        - Sensible takeaways
    
        Highlights:
    
        {textual content}
        """
    
        end result = subprocess.run(
            ["ollama", "run", model],
            enter=immediate,
            textual content=True,
            capture_output=True
        )
    
        return end result.stdout

    Easy, I do know. Nevertheless it works partly as a result of the information preprocessing has been intense but in addition as a result of we’re already leveraging the fashions constructed on the market.

    However what can we do with the abstract? I like utilizing Obsidian [2] so exporting a Markdown file is what makes extra sense. Right here you have got it:

    def export_markdown(e book, sections, abstract, output):
    
        md = f"# {e book}nn"
        for part in sections:
            md += f"## {part['title']}nn"
            for h in part["highlights"]:
                md += f"- {h}n"
            md += "n"
    
        md += "n---nn"
        md += "## E-book Summarynn"
        md += abstract
    
        output_path = Path(output)
        output_path.guardian.mkdir(dad and mom=True, exist_ok=True)
        output_path.write_text(md, encoding="utf-8")
        print(f"nSaved to {output_path}")

    Et voilà.

    And that is how I am going from highlights to a full Markdown abstract (straight to Obsidian if I wish to) with lower than 300 traces of Python code!

    Full Code and Check

    Right here’s the complete code, simply in case you wish to copy-paste it. It accommodates what we’ve seen plus some helper capabilities and argument parsing:

    import re
    import argparse
    from pathlib import Path
    import subprocess
    
    
    # ---------- PARSE CLIPPINGS ----------
    
    def parse_clippings(file_path):
    
        uncooked = Path(file_path).read_text(encoding="utf-8")
        entries = uncooked.cut up("==========")
    
        highlights = []
    
        for entry in entries:
    
            traces = [l.strip() for l in entry.strip().split("n") if l.strip()]
    
            if len(traces) < 3:
                proceed
    
            e book = traces[0]
    
            if "Spotlight" not in traces[1]:
                proceed
    
            location_match = re.search(r"Location (d+)", traces[1])
            if not location_match:
                proceed
    
            location = int(location_match.group(1))
    
            textual content = " ".be part of(traces[2:]).strip()
    
            highlights.append(
                {
                    "e book": e book,
                    "location": location,
                    "textual content": textual content
                }
            )
    
        return highlights
    
    
    # ---------- FILTER BOOK ----------
    
    def filter_book(highlights, book_name):
    
        return [
            h for h in highlights
            if book_name.lower() in h["book"].decrease()
        ]
    
    
    # ---------- SORT ----------
    
    def sort_by_location(highlights):
    
        return sorted(highlights, key=lambda x: x["location"])
    
    
    # ---------- DEDUPLICATE ----------
    
    def deduplicate(highlights):
    
        clear = []
    
        for h in highlights:
    
            textual content = h["text"]
            duplicate = False
    
            for c in clear:
    
                if textual content == c["text"]:
                    duplicate = True
                    break
    
                if textual content in c["text"]:
                    duplicate = True
                    break
    
                if c["text"] in textual content:
                    c["text"] = textual content
                    duplicate = True
                    break
    
            if not duplicate:
                clear.append(h)
    
        return clear
    
    
    # ---------- TITLE DETECTION ----------
    
    STOPWORDS = {
        "the","and","or","however","of","in","on","at","for","to",
        "is","are","was","have been","be","been","being",
        "that","this","with","as","by","from"
    }
    
    
    def has_chapter_prefix(textual content):
    
        return bool(
            re.match(
                r"^(chapter|half|part)s+d+|^d+[.)]|^[ivxlcdm]+.",
                textual content.decrease()
            )
        )
    
    
    def is_probable_title(textual content):
    
        textual content = textual content.strip()
        if len(textual content) > 120:
            return False
        if textual content.endswith("."):
            return False
    
        phrases = textual content.cut up()
    
        if len(phrases) > 12:
            return False
        # chapter fashion prefix
        if has_chapter_prefix(textual content):
            return True
    
        # capitalization ratio
        capitalized = sum(
            1 for w in phrases if w[0].isupper()
        )
        cap_ratio = capitalized / len(phrases)
    
        # stopword ratio
        stopword_count = sum(
            1 for w in phrases if w.decrease() in STOPWORDS
        )
        stop_ratio = stopword_count / len(phrases)
    
        rating = 0
        if cap_ratio > 0.6:
            rating += 1
        if stop_ratio < 0.3:
            rating += 1
        if len(phrases) <= 6:
            rating += 1
    
        return rating >= 2
    
    
    # ---------- GROUP SECTIONS ----------
    
    def group_by_sections(highlights):
    
        sections = []
        present = {
            "title": "Introduction",
            "highlights": []
        }
    
        for h in highlights:
            textual content = h["text"]
    
            if is_probable_title(textual content):
                sections.append(present)
                present = {
                    "title": textual content,
                    "highlights": []
                }
            else:
                present["highlights"].append(textual content)
        sections.append(present)
        return sections
    
    
    # ---------- SUMMARY ----------
    
    
    
    
    # ---------- EXPORT MARKDOWN ----------
    
    def export_markdown(e book, sections, abstract, output):
    
        md = f"# {e book}nn"
        for part in sections:
            md += f"## {part['title']}nn"
            for h in part["highlights"]:
                md += f"- {h}n"
            md += "n"
    
        md += "n---nn"
        md += "## E-book Summarynn"
        md += abstract
    
        output_path = Path(output)
        output_path.guardian.mkdir(dad and mom=True, exist_ok=True)
        output_path.write_text(md, encoding="utf-8")
        print(f"nSaved to {output_path}")
    
    
    # ---------- MAIN ----------
    
    def foremost():
    
        parser = argparse.ArgumentParser()
    
        parser.add_argument("--book", required=True)
        parser.add_argument("--output", required=False, default=None)
        parser.add_argument(
            "--clippings",
            default="Knowledge/My Clippings.txt"
        )
        parser.add_argument(
            "--model",
            default="mistral"
        )
    
        args = parser.parse_args()
    
        highlights = parse_clippings(args.clippings)
        highlights = filter_book(highlights, args.e book)
        highlights = sort_by_location(highlights)
        highlights = deduplicate(highlights)
        sections = group_by_sections(highlights)
    
        all_text = "n".be part of(
            h["text"] for h in highlights
        )
    
        abstract = summarize_with_ollama(all_text, args.mannequin)
    
        if args.output:
            export_markdown(
                args.e book,
                sections,
                abstract,
                args.output
            )
        else:
            print("n---- HIGHLIGHTS ----n")
            for h in highlights:
                print(f"{h['text']}n")
    
            print("n---- SUMMARY ----n")
            print(abstract)
    
    
    if __name__ == "__main__":
        foremost()

    However let’s see the way it works! The code itself is helpful however I wager you’re keen to see the outcomes. It’s an extended one so I made a decision to delete the primary half as all it does is simply copy-paste the highlights.

    I randomly selected a e book I learn like 6 years in the past (2020) referred to as Speaking to Strangers by Malcolm Gladwell (a bestseller, fairly gratifying learn). See the mannequin’s printed output (not the Markdown):

    $ python3 kindle_summary.py --book "Speaking to Strangers"
    
    ---- HIGHLIGHTS ----
    
    ...
    
    
    ---- SUMMARY ----
    
     Title: Speaking to Strangers: What We Ought to Know About Human Interplay
    
    Fundamental Thesis: The e book explores the complexities and paradoxes of human 
    interplay, significantly in conversations with strangers, and emphasizes 
    the significance of warning, humility, and understanding the context in 
    which these interactions happen.
    
    Temporary Abstract: The writer delves into the misconceptions and shortcomings 
    in our dealings with strangers, specializing in how we regularly make incorrect 
    assumptions about others based mostly on restricted info or preconceived 
    notions. The e book gives insights into why this occurs, its penalties, 
    and methods for bettering our capacity to grasp and talk 
    successfully with individuals we do not know.
    
    Key Concepts:
    1. The transparency drawback and the default-to-truth drawback: Folks typically 
    assume that others are open books, sharing their true feelings and 
    intentions, when in actuality this isn't all the time the case.
    2. Coupling: Behaviors are strongly linked to particular circumstances and 
    circumstances, making it important to grasp the context through which a 
    stranger operates.
    3. Limitations of understanding strangers: There isn't a good mechanism 
    for peering into the minds of these we have no idea, emphasizing the necessity 
    for restraint and humility when interacting with strangers.
    
    Necessary Ideas:
    1. Emotional responses falling exterior expectations
    2. Defaulting to reality
    3. Transparency as an phantasm
    4. Contextual understanding in coping with strangers
    5. The paradox of speaking to strangers (want versus terribleness)
    6. The phenomenon of coupling and its affect on habits
    7. Blaming the stranger when issues go awry
    
    Sensible Takeaways:
    1. Acknowledge that individuals could not all the time seem as they appear, each 
    emotionally and behaviorally.
    2. Perceive the significance of context in decoding strangers' 
    behaviors and intentions.
    3. Be cautious and humble when interacting with strangers, acknowledging 
    our limitations in understanding them totally.
    4. Keep away from leaping to conclusions about strangers based mostly on restricted 
    info or preconceived notions.
    5. Settle for that there'll all the time be some extent of ambiguity and 
    complexity in coping with strangers.
    6. Keep away from penalizing others for defaulting to reality as a protection mechanism.
    7. When interactions with strangers go awry, think about the function one may 
    have performed in contributing to the scenario fairly than solely blaming 
    the stranger.

    And all this inside just a few seconds. Fairly cool in my view.

    Conclusion

    And that’s principally how I’m now saving loads of free time (that I can use to put in writing posts like this one) by leveraging my information expertise and AI.

    I hope you loved the learn and felt motivated to offer it a strive! It received’t be higher than the abstract you’d write with your personal notion of the e book… Nevertheless it received’t be removed from that!

    Thanks to your consideration, be at liberty to remark if in case you have any concepts or ideas!


    Sources

    [1] Ollama. (n.d.). Ollama. https://ollama.com

    [2] Obsidian. (n.d.). Obsidian. https://obsidian.md



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    How to Select Variables Robustly in a Scoring Model

    April 24, 2026

    Your Synthetic Data Passed Every Test and Still Broke Your Model

    April 23, 2026

    Using a Local LLM as a Zero-Shot Classifier

    April 23, 2026

    I Simulated an International Supply Chain and Let OpenClaw Monitor It

    April 23, 2026

    Lasso Regression: Why the Solution Lives on a Diamond

    April 23, 2026

    The Most Efficient Approach to Crafting Your Personal AI Productivity System

    April 23, 2026
    Leave A Reply Cancel Reply

    Editors Picks

    The Latest Push to Extend Key US Spy Powers Is Still a Mess

    April 24, 2026

    Anthropic says Google is committing to invest $10B now in cash at a $350B valuation and will invest another $30B if Anthropic hits performance targets (Bloomberg)

    April 24, 2026

    ‘Apex’ Review: Charlize Theron Netflix Thriller Avoids Rock Bottom, but Barely

    April 24, 2026

    I Built an AI Pipeline for Kindle Highlights

    April 24, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Why grippers and sensors matter for real-world robotics

    April 9, 2026

    OnePlus and Oppo to Raise Smartphone Prices as Memory Costs Climb

    March 11, 2026

    Visible Promo Codes and Deals: Save On Phones and Plans

    January 31, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.