Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Amazon boss says AI will replace jobs at tech giant
    • Apply Sphinx’s Functionality to Create Documentation for Your Next Data Science Project
    • Protein discovery may combat aging and brain diseases
    • From rooftop chats to real investments: How the EU-Startups Summit turned conversations into real opportunities
    • Gardyn Indoor Hydroponic Garden Review: Better Growing Through AI
    • Best Budget Smartwatches: Top Cheap Picks
    • Donald Trump to extend US TikTok ban deadline, White House says
    • Abstract Classes: A Software Engineering Concept Data Scientists Must Know To Succeed
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Wednesday, June 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Abstract Classes: A Software Engineering Concept Data Scientists Must Know To Succeed
    Artificial Intelligence

    Abstract Classes: A Software Engineering Concept Data Scientists Must Know To Succeed

    Editor Times FeaturedBy Editor Times FeaturedJune 18, 2025No Comments14 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    it’s best to learn this text

    If you’re planning to enter knowledge science, be it a graduate or knowledgeable in search of a profession change, or a supervisor answerable for establishing greatest practices, this text is for you.

    Information science attracts quite a lot of totally different backgrounds. From my skilled expertise, I’ve labored with colleagues who have been as soon as:

    • Nuclear physicists
    • Submit-docs researching gravitational waves
    • PhDs in computational biology
    • Linguists

    simply to call a couple of.

    It’s great to have the ability to meet such a various set of backgrounds and I’ve seen such quite a lot of minds result in the expansion of a inventive and efficient knowledge science perform.

    Nevertheless, I’ve additionally seen one huge draw back to this selection:

    Everybody has had totally different ranges of publicity to key Software program Engineering ideas, leading to a patchwork of coding expertise.

    In consequence, I’ve seen work carried out by some knowledge scientists that’s good, however is:

    • Unreadable — you don’t have any thought what they’re attempting to do.
    • Flaky — it breaks the second another person tries to run it.
    • Unmaintainable — code shortly turns into out of date or breaks simply.
    • Un-extensible — code is single-use and its behaviour can’t be prolonged

    which finally dampens the impression their work can have and creates all types of points down the road.

    So, in a collection of articles, I plan to stipulate some core software program engineering ideas that I’ve tailor-made to be requirements for knowledge scientists.

    They’re easy ideas, however the distinction between figuring out them vs not figuring out them clearly attracts the road between novice {and professional}.

    Summary Artwork, Photograph by Steve Johnson on Unsplash

    As we speak’s idea: Summary courses

    Summary courses are an extension of sophistication inheritance, and it may be a really great tool for knowledge scientists if used appropriately.

    Should you want a refresher on class inheritance, see my article on it here.

    Like we did for class inheritance, I gained’t trouble with a proper definition. Trying again to once I first began coding, I discovered it arduous to decipher the obscure and summary (no pun supposed) definitions on the market within the Web.

    It’s a lot simpler as an instance it by going by means of a sensible instance.

    So, let’s go straight into an instance {that a} knowledge scientist is prone to encounter to reveal how they’re used, and why they’re helpful.

    Instance: Getting ready knowledge for ingestion right into a characteristic technology pipeline

    Photograph by Scott Graham on Unsplash

    Let’s say we’re a consultancy that specialises in fraud detection for monetary establishments.

    We work with a lot of totally different purchasers, and we’ve a set of options that carry a constant sign throughout totally different shopper initiatives as a result of they embed area data gathered from subject material specialists.

    So it is smart to construct these options for every venture, even when they’re dropped throughout characteristic choice or are changed with bespoke options constructed for that shopper.

    The problem

    We knowledge scientists know that working throughout totally different initiatives/environments/purchasers signifies that the enter knowledge for each is rarely the identical;

    • Purchasers might present totally different file varieties: CSV, Parquet, JSON, tar, to call a couple of.
    • Totally different environments might require totally different units of credentials.
    • Most positively every dataset has their very own quirks and so each requires totally different knowledge cleansing steps.

    Subsequently, you might assume that we would wish to construct a brand new characteristic technology pipeline for each shopper.

    How else would you deal with the intricacies of every dataset?

    No, there’s a higher manner

    Provided that:

    • We all know we’re going to be constructing the identical set of helpful options for every shopper
    • We will construct one characteristic technology pipeline that may be reused for every shopper
    • Thus, the one new drawback we have to clear up is cleansing the enter knowledge.

    Thus, our drawback will be formulated into the next phases:

    Picture by writer. Blue circles are datasets, yellow squares are pipelines.
    • Information Cleansing pipeline
      • Answerable for dealing with any distinctive cleansing and processing that’s required for a given shopper so as to format the dataset right into a standardised schema dictated by the characteristic technology pipeline.
    • The Function Technology pipeline
      • Implements the characteristic engineering logic assuming the enter knowledge will observe a hard and fast schema to output our helpful set of options.

    Given a hard and fast enter knowledge schema, constructing the characteristic technology pipeline is trivial.

    Subsequently, we’ve boiled down our drawback to the next:

    How can we guarantee the standard of the information cleansing pipelines such that their outputs at all times adhere to the downstream necessities?

    The actual drawback we’re fixing

    Our drawback of ‘guaranteeing the output at all times adhere to downstream necessities’ is not only about getting code to run. That’s the simple half.

    The arduous half is designing code that’s strong to a myriad of exterior, non-technical components akin to:

    • Human error
      • Folks naturally overlook small particulars or prior assumptions. They might construct a knowledge cleansing pipeline while overlooking sure necessities.
    • Leavers
      • Over time, your workforce inevitably adjustments. Your colleagues might have data that they assumed to be apparent, and subsequently they by no means bothered to doc it. As soon as they’ve left, that data is misplaced. Solely by means of trial and error, and hours of debugging will your workforce ever get well that data.
    • New joiners
      • In the meantime, new joiners don’t have any data about prior assumptions that have been as soon as assumed apparent, so their code often requires numerous debugging and rewriting.

    That is the place summary courses actually shine.

    Enter knowledge necessities

    We talked about that we are able to repair the schema for the characteristic technology pipeline enter knowledge, so let’s outline this for our instance.

    Let’s say that our pipeline expects to learn in parquet recordsdata, containing the next columns:

    row_id:
        int, a singular ID for each transaction.
    timestamp:
        str, in ISO 8601 format. The timestamp a transaction was made.
    quantity: 
        int, the transaction quantity denominated in pennies (for our US readers, the equal will likely be cents).
    path: 
        str, the path of the transaction, certainly one of ['OUTBOUND', 'INBOUND']
    account_holder_id: 
        str, distinctive identifier for the entity that owns the account the transaction was made on.
    account_id: 
        str, distinctive identifier for the account the transaction was made on.

    Let’s additionally add in a requirement that the dataset should be ordered by timestamp.

    The summary class

    Now, time to outline our summary class.

    An summary class is actually a blueprint from which we are able to inherit from to create little one courses, in any other case named ‘concrete‘ courses.

    Let’s spec out the totally different strategies we may have for our knowledge cleansing blueprint.

    import os
    from abc import ABC, abstractmethod
    
    class BaseRawDataPipeline(ABC):
        def __init__(
            self,
            input_data_path: str | os.PathLike,
            output_data_path: str | os.PathLike
        ):
            self.input_data_path = input_data_path
            self.output_data_path = output_data_path
    
        @abstractmethod
        def remodel(self, raw_data):
            """Remodel the uncooked knowledge.
            
            Args:
                raw_data: The uncooked knowledge to be remodeled.
            """
            ...
    
        @abstractmethod
        def load(self):
            """Load within the uncooked knowledge."""
            ...
    
        def save(self, transformed_data):
            """save the remodeled knowledge."""
            ...
    
        def validate(self, transformed_data):
            """validate the remodeled knowledge."""
            ...
    
        def run(self):
            """Run the information cleansing pipeline."""
            ...

    You’ll be able to see that we’ve imported the ABC class from the abc module, which permits us to create summary courses in Python.

    Picture by writer. Diagram of the summary class and concrete class relationships and strategies.

    Pre-defined behaviour

    Picture by writer. The strategies to be pre-defined are circled purple.

    Let’s now add some pre-defined behaviour to our summary class.

    Keep in mind, this behaviour will likely be made obtainable to all little one courses which inherit from this class so that is the place we bake in behaviour that you just wish to implement for all future initiatives.

    For our instance, the behaviour that wants fixing throughout all initiatives are all associated to how we output the processed dataset.

    1. The run technique

    First, we outline the run technique. That is the tactic that will likely be known as to run the information cleansing pipeline.

        def run(self):
            """Run the information cleansing pipeline."""
            inputs = self.load()
            output = self.remodel(*inputs)
            self.validate(output)
            self.save(output)

    The run technique acts as a single level of entry for all future little one courses.

    This standardises how any knowledge cleansing pipeline will likely be run, which permits us to then construct new performance round any pipeline with out worrying in regards to the underlying implementation.

    You’ll be able to think about how incorporating such pipelines into some orchestrator or scheduler will likely be simpler if all pipelines are executed by means of the identical run technique, versus having to deal with many alternative names akin to run, execute, course of, match, remodel and so on.

    2. The save technique

    Subsequent, we repair how we output the remodeled knowledge.

        def save(self, transformed_data:pl.LazyFrame):
            """save the remodeled knowledge to parquet."""
            transformed_data.sink_parquet(
                self.output_file_path,
            )

    We’re assuming we’ll use `polars` for knowledge manipulation, and the output is saved as `parquet` recordsdata as per our specification for the characteristic technology pipeline.

    3. The validate technique

    Lastly, we populate the validate technique which can verify that the dataset adheres to our anticipated output format earlier than saving it down.

        @property
        def output_schema(self):
            return dict(
                row_id=pl.Int64,
                timestamp=pl.Datetime,
                quantity=pl.Int64,
                path=pl.Categorical,
                account_holder_id=pl.Categorical,
                account_id=pl.Categorical,
            )
        
        def validate(self, transformed_data):
            """validate the remodeled knowledge."""
            schema = transformed_data.collect_schema()
            assert (
                self.output_schema == schema, 
                f"Anticipated {self.output_schema} however bought {schema}"
            )

    We’ve created a property known as output_schema. This ensures that every one little one courses could have this obtainable, while stopping it from being unintentionally eliminated or overridden if it was outlined in, for instance, __init__.

    Mission-specific behaviour

    Picture by writer. Mission particular strategies that should be overridden are circled purple.

    In our instance, the load and remodel strategies are the place project-specific behaviour will likely be held, so we go away them clean within the base class – the implementation is deferred to the longer term knowledge scientist answerable for scripting this logic for the venture.

    Additionally, you will discover that we’ve used the abstractmethod decorator on the remodel and load strategies. This decorator enforces these strategies to be outlined by a toddler class. If a consumer forgets to outline them, an error will likely be raised to remind them to take action.

    Let’s now transfer on to some instance initiatives the place we are able to outline the remodel and load strategies.

    Instance venture

    The shopper on this venture sends us their dataset as CSV recordsdata with the next construction:

    event_id: str
    unix_timestamp: int
    user_uuid: int
    wallet_uuid: int
    payment_value: float
    nation: str

    We study from them that:

    • Every transaction is exclusive recognized by the mix of event_id and unix_timestamp
    • The wallet_uuid is the equal identifier for the ‘account’
    • The user_uuid is the equal identifier for the ‘account holder’
    • The payment_value is the transaction quantity, denominated in Pound Sterling (or Greenback).
    • The CSV file is separated by | and has no header.

    The concrete class

    Now, we implement the load and remodel capabilities to deal with the distinctive complexities outlined above in a toddler class of BaseRawDataPipeline.

    Keep in mind, these strategies are all that should be written by the information scientists engaged on this venture. All of the aforementioned strategies are pre-defined in order that they needn’t fear about it, lowering the quantity of labor your workforce must do.

    1. Loading the information

    The load perform is sort of easy:

    class Project1RawDataPipeline(BaseRawDataPipeline):
    
        def load(self):
            """Load within the uncooked knowledge.
            
            Notice:
                As per the shopper's specification, the CSV file is separated 
                by `|` and has no header.
            """
            return pl.scan_csv(
                self.input_data_path,
                sep="|",
                has_header=False
            )

    We use polars’ scan_csv method to stream the information, with the suitable arguments to deal with the CSV file construction for our shopper.

    2. Remodeling the information

    The remodel technique can also be easy for this venture, since we don’t have any complicated joins or aggregations to carry out. So we are able to match all of it right into a single perform.

    class Project1RawDataPipeline(BaseRawDataPipeline):
    
        ...
    
        def remodel(self, raw_data: pl.LazyFrame):
            """Remodel the uncooked knowledge.
    
            Args:
                raw_data (pl.LazyFrame):
                    The uncooked knowledge to be remodeled. Should comprise the next columns:
                        - 'event_id'
                        - 'unix_timestamp'
                        - 'user_uuid'
                        - 'wallet_uuid'
                        - 'payment_value'
    
            Returns:
                pl.DataFrame:
                    The remodeled knowledge.
    
                    Operations:
                        1. row_id is constructed by concatenating event_id and unix_timestamp
                        2. account_id and account_holder_id are renamed from user_uuid and wallet_uuid
                        3. transaction_amount is transformed from payment_value. Supply knowledge
                        denomination is in £/$, so we have to convert to p/cents.
            """
    
            # choose solely the columns we want
            DESIRED_COLUMNS = [
                "event_id",
                "unix_timestamp",
                "user_uuid",
                "wallet_uuid",
                "payment_value",
            ]
            df = raw_data.choose(DESIRED_COLUMNS)
    
            df = df.choose(
                # concatenate event_id and unix_timestamp
                # to get a singular identifier for every row.
                pl.concat_str(
                    [
                        pl.col("event_id"),
                        pl.col("unix_timestamp")
                    ],
                    separator="-"
                ).alias('row_id'),
    
                # convert unix timestamp to ISO format string
                pl.from_epoch("unix_timestamp", "s").dt.to_string("iso").alias("timestamp"),
    
                pl.col("user_uuid").alias("account_id"),
                pl.col("wallet_uuid").alias("account_holder_id"),
    
                # convert from £ to p
                # OR convert from $ to cents
                (pl.col("payment_value") * 100).alias("transaction_amount"),
            )
    
            return df

    Thus, by overloading these two strategies, we’ve applied all we want for our shopper venture.

    The output we all know conforms to the necessities of the downstream characteristic engineering pipeline, so we routinely have assurance that our outputs are suitable.

    No debugging required. No problem. No fuss.

    Ultimate abstract: Why use summary courses in knowledge science pipelines?

    Summary courses provide a robust option to convey consistency, robustness, and improved maintainability to knowledge science initiatives. By utilizing Summary Courses like in our instance, our knowledge science workforce sees the next advantages:

    1. No want to fret about compatibility

    By defining a transparent blueprint with summary courses, the information scientist solely must deal with implementing the load and remodel strategies particular to their shopper’s knowledge.

    So long as these strategies conform to the anticipated enter/output varieties, compatibility with the downstream characteristic technology pipeline is assured.

    This separation of considerations simplifies the event course of, reduces bugs, and accelerates growth for brand spanking new initiatives.

    2. Simpler to doc

    The structured format naturally encourages in-line documentation by means of technique docstrings.

    This proximity of design selections and implementation makes it simpler to speak assumptions, transformations, and nuances for every shopper’s dataset.

    Nicely-documented code is less complicated to learn, keep, and hand over, lowering the data loss brought on by workforce adjustments or turnover.

    3. Improved code readability and maintainability

    With summary courses imposing a constant interface, the ensuing codebase avoids the pitfalls of unreadable, flaky, or unmaintainable scripts.

    Every little one class adheres to a standardized technique construction (load, remodel, validate, save, run), making the pipelines extra predictable and simpler to debug.

    4. Robustness to human components

    Summary courses assist scale back dangers from human error, teammates leaving, or studying new joiners by embedding important behaviours within the base class. This ensures that essential steps are by no means skipped, even when particular person contributors are unaware of all downstream necessities.

    5. Extensibility and reusability

    By isolating client-specific logic in concrete courses whereas sharing frequent behaviors within the summary base, it turns into simple to increase pipelines for brand spanking new purchasers or initiatives. You’ll be able to add new knowledge cleansing steps or help new file codecs with out rewriting the whole pipeline.

    In abstract, summary courses ranges up your knowledge science codebase from ad-hoc scripts to scalable, and maintainable production-grade code. Whether or not you’re a knowledge scientist, a workforce lead, or a supervisor, adopting these software program engineering ideas will considerably increase the impression and longevity of your work.

    Associated articles:

    Should you loved this text, then take a look at a few of my different associated articles.

    • Inheritance: A software program engineering idea knowledge scientists should know to succeed (here)
    • Encapsulation: A softwre engineering idea knowledge scientists should know to succeed (here)
    • The Information Science Instrument You Want For Environment friendly ML-Ops (here)
    • DSLP: The info science venture administration framework that remodeled my workforce (here)
    • The right way to stand out in your knowledge scientist interview (here)
    • An Interactive Visualisation For Your Graph Neural Community Explanations (here)
    • The New Greatest Python Package deal for Visualising Community Graphs (here)



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Apply Sphinx’s Functionality to Create Documentation for Your Next Data Science Project

    June 18, 2025

    LLaVA on a Budget: Multimodal AI with Limited Resources

    June 17, 2025

    Build an AI Agent to Explore Your Data Catalog with Natural Language

    June 17, 2025

    Regularisation: A Deep Dive into Theory, Implementation, and Practical Insights

    June 17, 2025

    Grad-CAM from Scratch with PyTorch Hooks

    June 17, 2025

    A Practical Starters’ Guide to Causal Structure Learning with Bayesian Methods in Python

    June 17, 2025
    Leave A Reply Cancel Reply

    Editors Picks

    Amazon boss says AI will replace jobs at tech giant

    June 18, 2025

    Apply Sphinx’s Functionality to Create Documentation for Your Next Data Science Project

    June 18, 2025

    Protein discovery may combat aging and brain diseases

    June 18, 2025

    From rooftop chats to real investments: How the EU-Startups Summit turned conversations into real opportunities

    June 18, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    May 17, 2025

    Found on VirusTotal: The world’s first UEFI bootkit for Linux

    November 29, 2024

    I Transitioned from Data Science to AI Engineering: Here’s Everything You Need to Know

    June 1, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.