you need to learn this text
In case you are planning to enter information science, be it a graduate or an expert searching for a profession change, or a supervisor accountable for establishing finest practices, this text is for you.
Information science attracts quite a lot of completely different backgrounds. From my skilled expertise, I’ve labored with colleagues who have been as soon as:
- Nuclear Physicists
- Put up-docs researching Gravitational Waves
- PhDs in Computational Biology
- Linguists
simply to call just a few.
It’s great to have the ability to meet such a various set of backgrounds and I’ve seen such quite a lot of minds result in the expansion of a artistic and efficient Information Science perform.
Nevertheless, I’ve additionally seen one huge draw back to this selection:
Everybody has had completely different ranges of publicity to key Software program Engineering ideas, leading to a patchwork of coding expertise.
In consequence, I’ve seen work achieved by some information scientists that’s good, however is:
- Unreadable — you haven’t any concept what they’re attempting to do.
- Flaky — it breaks the second another person tries to run it.
- Unmaintainable — code rapidly turns into out of date or breaks simply.
- Un-extensible — code is single-use and its behaviour can’t be prolonged.
Which finally dampens the affect their work can have and creates all types of points down the road.
So, in a sequence of articles, I plan to stipulate some core software program engineering ideas that I’ve tailor-made to be requirements for information scientists.
They’re easy ideas, however the distinction between understanding them vs not understanding them clearly attracts the road between beginner {and professional}.
At the moment’s Idea: Inheritance
Inheritance is key to writing clear, reusable code that improves your effectivity and work productiveness. It may also be used to standardise the best way a group writes code which reinforces readability and maintainability.
Wanting again at how troublesome it was to study these ideas once I was first studying to code, I’m not going to begin off with an summary, excessive degree definition that gives no worth to you at this stage. There’s a lot within the web you possibly can google if you would like this.
As a substitute, let’s check out a real-life instance of an information science mission.
We’ll define the form of sensible issues an information scientist might run into, see what inheritance is, and the way it may also help an information scientist write higher code.
And by higher we imply:
- Code that’s simpler to learn.
- Code that’s simpler to take care of.
- Code that’s simpler to re-use.
Instance: Ingesting information from a number of completely different sources

Essentially the most tedious and time consuming a part of an information scientist’s job is determining the place to get information, the right way to learn it, the right way to clear it, and the way to put it aside.
Let’s say you could have labels supplied in CSV information submitted from 5 completely different exterior sources, every with their very own distinctive schema.
Your activity is to scrub every certainly one of them and output them as a parquet file, and for this file to be appropriate with downstream processes, they have to conform to a schema:
label_id
: Integerlabel_value
: Integerlabel_timestamp
: String timestamp in ISO format.
The Fast & Soiled Strategy
On this case, the short and soiled method can be to jot down a separate script for every file.
# clean_source1.py
import polars as pl
if __name__ == '__main__':
df = pl.scan_csv('source1.csv')
overall_label_value = df.group_by('some-metadata1').agg(
overall_label_value=pl.col('some-metadata2').or_().over('some-metadata2')
)
df = df.drop(['some-metadata1', 'some-metadata2', 'some-metadata3'], axis=1)
df = df.be a part of(overall_label_value, on='some-metadata4')
df = df.choose(
pl.col('primary_key').alias('label_id'),
pl.col('overall_label_value').alias('label_value').substitute([True, False], [1, 0]),
pl.col('some-metadata6').alias('label_timestamp'),
)
df.to_parquet('output/source1.parquet')
and every script can be distinctive.
So what’s fallacious with this? It will get the job achieved proper?
Let’s return to our criterion for good code and consider why this one is unhealthy:
1. It’s onerous to learn
There’s no organisation or construction to the code.
All of the logic for loading, cleansing, and saving is all in the identical place, so it’s troublesome to see the place the road is between every step.
Consider, this can be a contrived, easy instance. In the actual world, the code you’d write can be for much longer and sophisticated.
When you could have onerous to learn code, and 5 completely different variations of it, it results in long term issues:
2. It’s onerous to take care of
The dearth of construction makes it onerous so as to add new options or repair bugs. If the logic needed to be modified, your entire script will possible must be overhauled.
If there was a standard operation that wanted to be utilized to all outputs, then somebody must go and modify all 5 scripts individually.
Every time, they should decipher the aim of traces and contours of code. As a result of there’s no clear distinction between
- the place information is loaded,
- the place information is used,
- which variables are depending on downstream operations,
it turns into onerous to know whether or not the adjustments you make can have any unknown affect on downstream code, or violates some upstream assumption.
Finally, it turns into very straightforward for bugs to creep in.
3. It’s onerous to re-use
This code is the definition of a one-off.
It’s onerous to learn, you don’t know what’s occurring the place until you make investments a number of time to be sure to perceive each line of code.
If somebody wished to reuse logic from it, the one choice they’d have is to copy-paste the complete script and modify it, or rewrite their very own from scratch.
There are higher, extra environment friendly methods of writing code.
The Higher, Skilled Strategy
Now, let’s have a look at how we are able to enhance our scenario by utilizing inheritance.

1. Establish the commonalities
In our instance, each information supply is exclusive. We all know that every file would require:
- A number of cleansing steps
- A saving step, which we already know all information will likely be saved right into a single parquet file.
We additionally know every file wants to adapt to the identical schema, so finest we’ve got some validation of the output information.
So these commonalities will inform us what functionalities we might write as soon as, after which reuse them.
2. Create a base class
Now comes the inheritance half.
We write a base class
, or mum or dad class
, which implements the logic for dealing with the commonalities we recognized above. This class will change into the template from which different courses will ‘inherit’.
Courses which inherit from this class (referred to as baby courses) can have the identical performance because the mum or dad class, however may even be capable to add new performance, or change those which can be already obtainable.
import polars as pl
class BaseCSVLabelProcessor:
REQUIRED_OUTPUT_SCHEMA = {
"label_id": pl.Int64,
"label_value": pl.Int64,
"label_timestamp": pl.Datetime
}
def __init__(self, input_file_path, output_file_path):
self.input_file_path = input_file_path
self.output_file_path = output_file_path
def load(self):
"""Load the information from the file."""
return pl.scan_csv(self.input_file_path)
def clear(self, information:pl.LazyFrame):
"""Clear the enter information"""
...
def save(self, information:pl.LazyFrame):
"""Save the information to parquet file."""
information.sink_parquet(self.output_file_path)
def validate_schema(self, information:pl.LazyFrame):
"""
Test that the information conforms to the anticipated schema.
"""
for colname, expected_dtype in self.REQUIRED_OUTPUT_SCHEMA.objects():
actual_dtype = information.schema.get(colname)
if actual_dtype is None:
elevate ValueError(f"Column {colname} not present in information")
if actual_dtype != expected_dtype:
elevate ValueError(
f"Column {colname} has incorrect kind. Anticipated {expected_dtype}, acquired {actual_dtype}"
)
def run(self):
"""Run information processing on the required file."""
information = self.load()
information = self.clear(information)
self.validate_schema(information)
self.save(information)
3. Outline baby courses
Now we outline the kid courses:
class Source1LabelProcessor(BaseCSVLabelProcessor):
def clear(self, information:pl.LazyFrame):
# bespoke logic for supply 1
...
class Source2LabelProcessor(BaseCSVLabelProcessor):
def clear(self, information:pl.LazyFrame):
# bespoke logic for supply 2
...
class Source3LabelProcessor(BaseCSVLabelProcessor):
def clear(self, information:pl.LazyFrame):
# bespoke logic for supply 3
...
Since all of the frequent logic is already applied within the mum or dad class, all of the baby class must be involved of is the bespoke logic that’s distinctive to every file.
So the code we wrote for the unhealthy instance can now be become:
from import BaseCSVLabelProcessor
class Source1LabelProcessor(BaseCSVLabelProcessor):
def get_overall_label_value(self, information:pl.LazyFrame):
"""Get general label worth."""
return information.with_column(pl.col('some-metadata2').or_().over('some-metadata1'))
def conform_to_output_schema(self, information:pl.LazyFrame):
"""Drop pointless columns and confrom required columns to output schema."""
information = information.drop(['some-metadata1', 'some-metadata2', 'some-metadata3'], axis=1)
information = information.choose(
pl.col('primary_key').alias('label_id'),
pl.col('some-metadata5').alias('label_value').substitute([True, False], [1, 0]),
pl.col('some-metadata6').alias('label_timestamp'),
)
return information
def clear(self, information:pl.LazyFrame) -> pl.DataFrame:
"""Clear label information from Supply 1.
The next steps are vital to scrub the information:
1.
2.
3. Renaming columns and information varieties to confrom to the anticipated output schema.
"""
overall_label_value = self.get_overall_label_value(information)
df = df.be a part of(overall_label_value, on='some-metadata4')
df = self.conform_to_output_schema(df)
return df
and with a purpose to run our code, we are able to do it in a centralised location:
# label_preparation_pipeline.py
from import Source1LabelProcessor, Source2LabelProcessor, Source3LabelProcessor
INPUT_FILEPATHS = {
'source1': '/path/to/file1.csv',
'source2': '/path/to/file2.csv',
'source3': '/path/to/file3.csv',
}
OUTPUT_FILEPATH = '/path/to/output.parquet'
def major():
"""Label processing pipeline.
The label processing pipeline ingests information sources 1, 2, 3 that are from
exterior distributors .
The output is written to a parquet file, prepared for ingestion by .
The code assumes the next:
-
The consumer must specify the next inputs:
-
"""
processors = [
Source1LabelProcessor(FILEPATHS['source1'], OUTPUT_FILEPATH),
Source2LabelProcessor(FILEPATHS['source2'], OUTPUT_FILEPATH),
Source3LabelProcessor(FILEPATHS['source3'], OUTPUT_FILEPATH)
]
for processor in processors:
processor.run()
Why is that this higher?
1. Good encapsulation
You shouldn’t must look beneath the hood to know the right way to drive a automotive.
Any colleague who must re-run this code will solely have to run the major()
perform. You’d have supplied enough docstrings within the respective features to elucidate what they do and the right way to use them.
However they don’t have to understand how each single line of code works.
They need to be capable to belief your work and run it. Solely when they should repair a bug or prolong its performance will they should go deeper.
That is referred to as encapsulation — strategically hiding the implementation particulars from the consumer. It’s one other programming idea that’s important for writing good code.

In a nutshell, it must be enough for the reader to depend on the docstrings to grasp what the code does and the right way to use it.
How typically do you go into the scikit-learn
supply code to discover ways to use their fashions? You by no means do. scikit-learn
is a perfect instance of excellent Coding design by way of encapsulation.
I’ve already written an article devoted to encapsulation here, so if you wish to know extra, test it out.
2. Higher extensibility
What if the label outputs now needed to change? For instance, downstream processes that ingest the labels now require them to be saved in a SQL desk.
Nicely, it turns into quite simple to do that – we merely want to switch the save
technique within the BaseCSVLabelProcessor
class, after which the entire baby courses will inherit this modification robotically.
What should you discover an incompatibility between the label outputs and a few course of downstream? Maybe a brand new column is required?
Nicely, you would wish to alter the respective clear
strategies to account for this. However, you may as well prolong the checks within the validate
technique within the BaseCSVLabelProcessor
class to account for this new requirement.
You’ll be able to even take this one step additional and add many extra checks to all the time ensure that the outputs are as anticipated – chances are you’ll even need to outline a separate validation module for doing this, and plug them into the validate
technique.
You’ll be able to see how extending the behaviour of our label processing code turns into quite simple.
Compared, if the code lived in separate bespoke scripts, you’d be copy and pasting these checks over and over. Even worse, possibly every file requires some bespoke implementation. This implies the identical drawback must be solved 5 occasions, when it could possibly be solved correctly simply as soon as.
It’s rework, its inefficiency, it’s wasted sources and time.
Last Remarks
So, on this article, we’ve coated how using inheritance enormously enhances the standard of our codebase.
By appropriately making use of inheritance, we’re capable of remedy frequent issues throughout completely different duties, and we’ve seen first hand how this results in:
- Code that’s simpler to learn — Readability
- Code that’s simpler to debug and keep — Maintainability
- Code that’s simpler so as to add and prolong performance — Extensibility
Nevertheless, some readers will nonetheless be sceptical of the necessity to write code like this.
Maybe they’ve been writing one-off scripts for his or her whole profession, and every part has been wonderful so far. Why trouble writing code in a extra sophisticated approach?

Nicely, that’s an excellent query — and there’s a very clear motive why it’s vital.
Up till very not too long ago, Data Science has been a brand new, area of interest trade the place proof-of-concepts and analysis was the primary focus of labor. Coding requirements didn’t matter then, so long as we acquired one thing out by way of the doorways and it labored.
However information science is quick approaching maturity, the place it’s not sufficient to simply construct fashions.
We now have to take care of, repair, debug, and retrain not solely fashions, but in addition all the processes required to create the mannequin – for so long as they’re used.
That is the fact that information science must face — constructing fashions is the straightforward half while sustaining what we’ve got constructed is the onerous half.
In the meantime, software program engineering has been doing this for many years, and has by way of trial and error constructed up all one of the best practices we mentioned in the present day in order that the code that they construct are straightforward to take care of.
Subsequently, information scientists might want to know these finest practices going forwards.
Those that know this can inevitably be better off in comparison with those that don’t.