Why it’s best to learn this text
knowledge scientists whip up a Jupyter Pocket book, mess around in some cells, after which keep total knowledge processing and mannequin coaching pipelines in the identical pocket book.
The code is examined as soon as when the pocket book was first written, after which it’s uncared for for some undetermined period of time – days, weeks, months, years, till:
- The outputs of the pocket book must be rerun to re-generate outputs that had been misplaced.
- The pocket book must be rerun with completely different parameters to retrain a mannequin.
- One thing wanted to be modified upstream, and the pocket book must be rerun to refresh downstream datasets.
A lot of you’ll have felt shivers down your backbone studying this…
Why?
Since you instinctively know that this pocket book isn’t going to run.
it in your bones the code in that pocket book will must be debugged for hours at greatest, re-written from scratch at worst.
In each instances, it can take you a very long time to get what you want.
Why does this occur?
Is there any approach of avoiding this?
Is there a greater approach of writing and sustaining code?
That is the query we can be answering on this article.
The Answer: Automated Testing
What’s it?
Because the title suggests, automated testing is the method of working a predefined set of exams in your code to make sure that it’s working as anticipated.
These exams confirm that your code behaves as anticipated — particularly after modifications or additions — and provide you with a warning when one thing breaks. It removes the necessity for a human to manually take a look at your code, and there’s no have to run it on precise knowledge.
Handy, isn’t it?
Sorts of Automated Testing
There are such a lot of several types of testing, and protecting all of them is past the scope of this text.
Let’s simply deal with the 2 fundamental varieties most related to a knowledge scientist:
- Unit Exams
- Integration Exams
Unit Exams
Exams the smallest elements of code in isolation (e.g., a perform).
The perform ought to do one factor solely to make it straightforward to check. Give it a recognized enter, and examine that the output is as anticipated.
Integration Exams

Exams how a number of parts work collectively.
For us knowledge scientists, it means checking whether or not knowledge loading, merging, and preprocessing steps produce the anticipated last dataset, given a recognized enter dataset.
A sensible instance
Sufficient with the idea, let’s see the way it works in follow.
We are going to undergo a easy instance the place an information scientist has written some code in a Jupyter pocket book (or script), one which many knowledge scientists may have seen of their jobs.
We are going to decide up on why the code is unhealthy. Then, we’ll try to make it higher.
By higher, we imply:
- Straightforward to check
- Straightforward to learn
which finally means straightforward to take care of, as a result of in the long term, good code is code that works, retains working, and is simple to take care of.
We are going to then design some unit exams for our improved code, highlighting why the modifications are useful for testing. To stop this text from turning into too lengthy, I’ll defer examples of integration testing to a future article.
Then, we are going to undergo some guidelines of thumb for what code to check.
Lastly, we are going to cowl find out how to run exams and find out how to construction initiatives.

Instance Pipeline
We are going to use the next pipeline for instance:
# bad_pipeline.py
import pandas as pd
# Load knowledge
df1 = pd.read_csv("knowledge/customers.csv")
df2 = pd.read_parquet("knowledge/transactions.parquet")
df3 = pd.read_parquet("knowledge/merchandise.parquet")
# Preprocessing
# Merge person and transaction knowledge
df = df2.merge(df1, how='left', on='user_id')
# Merge with product knowledge
df = df.merge(df3, how='left', on='product_id')
# Filter for latest transactions
df = df[df['transaction_date'] > '2023-01-01']
# Calculate whole value
df['total_price'] = df['quantity'] * df['price']
# Create buyer phase
df['segment'] = df['total_price'].apply(lambda x: 'excessive' if x > 100 else 'low')
# Drop pointless columns
df = df.drop(['user_email', 'product_description', 'price'], axis=1)
# Group by person and phase to get whole quantity spent
df = df.groupby(['user_id', 'segment']).agg({'total_price': 'sum'}).reset_index()
# Save output
df.to_parquet("knowledge/final_output.parquet")
In actual life, we might see lots of of traces of code crammed right into a single pocket book. However the script is exemplary of all of the issues that want fixing in typical knowledge science notebooks.
This code is doing the next:
- Masses person, transaction, and product knowledge.
- Merges them right into a unified dataset.
- Filters latest transactions.
- Provides calculated fields (
total_price,phase). - Drops irrelevant columns.
- Aggregates whole spending per person and phase.
- Saves the consequence as a Parquet file.
Why is that this pipeline unhealthy?
Oh, there are such a lot of causes coding on this method is unhealthy, relying on what lens you take a look at it from. It’s not the content material that’s the downside, however how it’s structured.
Whereas there are lots of angles we are able to focus on the disadvantages of writing code this manner, for this text we are going to deal with testability.
1. Tightly coupled logic (in different phrases, no modularity)
All operations are crammed right into a single script and run directly. It’s unclear what every half does except you learn each line. Even for a script this straightforward, that is troublesome to do. In real-life scripts, it may well solely worsen when code can attain lots of of traces.
This makes it unimaginable to check.
The one approach to take action could be to run your complete factor unexpectedly from begin to end, most likely on precise knowledge that you just’re going to make use of.
In case your dataset is small, then maybe you will get away with this. However generally, knowledge scientists are working with a truck-load of knowledge, so it’s infeasible to run any type of a take a look at or sanity examine shortly.
We want to have the ability to break the code up into manageable chunks that do one factor solely, and do it properly. Then, we are able to management what goes in, and ensure that what we anticipate comes out of it.
2. No Parameterization
Hardcoded file paths and values like 2023-01-01 make the code brittle and rigid. Once more, onerous to check with something however the reside/manufacturing knowledge.
There’s no flexibility in how we are able to run the code, the whole lot is fastened.
What’s worse, as quickly as you modify one thing, you haven’t any assurance that nothing’s damaged additional down the script.
For instance, what number of occasions have you ever made a change that you just thought was benign, solely to run the code and discover a fully surprising a part of the code to interrupt?
The best way to enhance?
Now, let’s see step-by-step how we are able to enhance this code.
Please be aware, we are going to assume that we’re utilizing the
pytestmodule for our exams going forwards.
1. A transparent, configurable entry level
def run_pipeline(
user_path: str,
transaction_path: str,
product_path: str,
output_path: str,
cutoff_date: str = '2023-01-01'
):
# Load knowledge
...
# Course of knowledge
...
# Save consequence
...
We begin off by making a single perform that we are able to run from anyplace, with clear arguments that may be modified.
What does this obtain?
This enables us to run the pipeline in particular take a look at circumstances.
# GIVEN SOME TEST DATA
test_args = dict(
test_user_path = "/fake_users.csv",
test_transaction_path = "/fake_transaction.parquet",
test_product_path = "/fake_products.parquet",
test_cutoff_date = "",
)
# RUN THE PIPELINE THAT'S TO BE TESTED
run_pipeline(**test_args)
# TEST THE OUTPUT IS AS EXPECTED
output =
expected_output =
assert output == expected_output
Instantly, you can begin passing in numerous inputs, completely different parameters, relying on the sting case that you just wish to take a look at for.
It offers you flexibility to run the code in numerous settings by making it simpler to regulate the inputs and outputs of your code.
Writing your pipeline on this approach paves the best way for integration testing your pipeline. Extra on this in a later article.
2. Group code into significant chunks that do one factor, and do it properly
Now, that is the place a little bit of artwork is available in – completely different individuals will organise code otherwise relying on which elements they discover necessary.
There isn’t a proper or flawed reply, however the widespread sense is to verify a perform does one factor and does it properly. Do that, and it turns into straightforward to check.
A method we might group our code is like under:
def load_data(user_path: str, transaction_path: str, product_path: str):
"""Load knowledge from specified paths"""
df1 = pd.read_csv(user_path)
df2 = pd.read_parquet(transaction_path)
df3 = pd.read_parquet(product_path)
return df1, df2, df3
def create_user_product_transaction_dataset(
user_df:pd.DataFrame,
transaction_df:pd.DataFrame,
product_df:pd.DataFrame
):
"""Merge person, transaction, and product knowledge right into a single dataset.
The dataset identifies which person purchased what product at what time and value.
Args:
user_df (pd.DataFrame):
A dataframe containing person info. Should have column
'user_id' that uniquely identifies every person.
transaction_df (pd.DataFrame):
A dataframe containing transaction info. Should have
columns 'user_id' and 'product_id' which are overseas keys
to the person and product dataframes, respectively.
product_df (pd.DataFrame):
A dataframe containing product info. Should have
column 'product_id' that uniquely identifies every product.
Returns:
A dataframe that merges the person, transaction, and product knowledge
right into a single dataset.
"""
df = transaction_df.merge(user_df, how='left', on='user_id')
df = df.merge(product_df, how='left', on='product_id')
return df
def drop_unnecessary_date_period(df:pd.DataFrame, cutoff_date: str):
"""Drop transactions that occurred earlier than the cutoff date.
Word:
Something earlier than the cutoff date could be dropped as a result of
of .
Args:
df (pd.DataFrame): A dataframe with a column `transaction_date`
cutoff_date (str): A date within the format 'yyyy-MM-dd'
Returns:
A dataframe with the transactions that occurred after the cutoff date
"""
df = df[df['transaction_date'] > cutoff_date]
return df
def compute_secondary_features(df:pd.DataFrame) -> pd.DataFrame:
"""Compute secondary options.
Args:
df (pd.DataFrame): A dataframe with columns `amount` and `value`
Returns:
A dataframe with columns `total_price` and `phase`
added to it.
"""
df['total_price'] = df['quantity'] * df['price']
df['segment'] = df['total_price'].apply(lambda x: 'excessive' if x > 100 else 'low')
return df
What does the grouping obtain?
Higher documentation
Nicely, to start with, you find yourself with some pure retail area in your code so as to add docstrings. Why is that this necessary? Nicely have you ever tried studying your personal code a month after writing it?
Individuals neglect particulars in a short time, and even code *you’ve* written can turn into undecipherable inside only a few days.
It’s important to doc what the code is doing, what it expects to take as enter, and what it returns, on the very least.
Together with docstrings in your code offers context and units expectations for a way a perform ought to behave, making it simpler to know and debug failing exams sooner or later.
Higher Readability
By ‘encapsulating’ the complexity of your code into smaller features, you may make it simpler to learn and perceive the general circulate of a pipeline with out having to learn each single line of code.
def run_pipeline(
user_path: str,
transaction_path: str,
product_path: str,
output_path: str,
cutoff_date: str
):
user_df, transaction_df, product_df = load_data(
user_path,
transaction_path,
product_path
)
df = create_user_product_transaction_dataset(
user_df,
transaction_df,
product_df
)
df = drop_unnecessary_date_period(df, cutoff_date)
df = compute_secondary_features(df)
df.to_parquet(output_path)
You’ve offered the reader with a hierarchy of knowledge, and it offers the reader a step-by-step breakdown of what’s happing within the run_pipeline perform by significant perform names.
The reader then has the selection of wanting on the perform definition and the complexity inside, relying on their wants.
The act of mixing code into ‘significant’ chunks like that is demonstrating an idea referred to as ‘Encapsulation’ and ‘Abstraction’.
For extra particulars on encapsulation, you possibly can learn my article on this here
Smaller packets of code to check
Subsequent, we have now a really particular, well-defined set of features that do one factor. This makes it simpler to check and debug, since we solely have one factor to fret about.
See under on how we assemble a take a look at.
Establishing a Unit Check
1. Observe the AAA Sample
def test_create_user_product_transaction_dataset():
# GIVEN
# RUN
# TEST
...
Firstly, we outline a take a look at perform, appropriately named test_.
Then, we divide it into three sections:
GIVEN: the inputs to the perform, and the anticipated output. Arrange the whole lot required to run the perform we wish to take a look at.RUN: run the perform given the inputs.TEST: examine the output of the perform to the anticipated output.
It is a generic sample that unit exams ought to comply with. The usual title for this design sample is the ‘AAA sample’, which stands for Organize, Act, Assert.
I don’t discover this naming intuitive, which is why I take advantage of GIVEN, RUN, TEST.
2. Organize: arrange the take a look at
# GIVEN
user_df = pd.DataFrame({
'user_id': [1, 2, 3], 'title': ["John", "Jane", "Bob"]
})
transaction_df = pd.DataFrame({
'user_id': [1, 2, 3],
'product_id': [1, 1, 2],
'extra-column1-str': ['1', '2', '3'],
'extra-column2-int': [4, 5, 6],
'extra-column3-float': [1.1, 2.2, 3.3],
})
product_df = pd.DataFrame({
'product_id': [1, 2], 'product_name': ["apple", "banana"]
})
expected_df = pd.DataFrame({
'user_id': [1, 2, 3],
'product_id': [1, 1, 2],
'extra-column1-str': ['1', '2', '3'],
'extra-column2-int': [4, 5, 6],
'extra-column3-float': [1.1, 2.2, 3.3],
'title': ["John", "Jane", "Bob"],
'product_name': ["apple", "apple", "banana"],
})
Secondly, we outline the inputs to the perform, and the anticipated output. That is the place we bake in our expectations about how the inputs will appear like, and what the output ought to appear like.
As you possibly can see, we don’t have to outline each single function that we anticipate to be run, solely those that matter for the take a look at.
For instance, transaction_df defines the user_id, product_id columns correctly, while additionally including three columns of various varieties (str, int, float) to simulate the truth that there can be different columns.
The identical goes for product_df and user_df, although these tables are anticipated to be a dimension desk, so simply defining title and product_name columns will suffice.
3. Act: Run the perform to check
# RUN
output_df = create_user_product_transaction_dataset(
user_df, transaction_df, product_df
)
Thirdly, we run the perform with the inputs we outlined, and acquire the output.
4. Assert: Check the result is as anticipated
# TEST
pd.testing.assert_frame_equal(
output_df,
expected_df
)
and at last, we examine whether or not the output matches the anticipated output.
Word, we use the pandas testing module since we’re evaluating pandas dataframes. For non-pandas datafames, you should use the assert assertion as a substitute.
The total testing code will appear like this:
import pandas as pd
def test_create_user_product_transaction_dataset():
# GIVEN
user_df = pd.DataFrame({
'user_id': [1, 2, 3], 'title': ["John", "Jane", "Bob"]
})
transaction_df = pd.DataFrame({
'user_id': [1, 2, 3],
'product_id': [1, 1, 2],
'transaction_date': ["2021-01-01", "2021-01-01", "2021-01-01"],
'extra-column1': [1, 2, 3],
'extra-column2': [4, 5, 6],
})
product_df = pd.DataFrame({
'product_id': [1, 2], 'product_name': ["apple", "banana"]
})
expected_df = pd.DataFrame({
'user_id': [1, 2, 3],
'product_id': [1, 1, 2],
'transaction_date': ["2021-01-01", "2021-01-01", "2021-01-01"],
'extra-column1': [1, 2, 3],
'extra-column2': [4, 5, 6],
'title': ["John", "Jane", "Bob"],
'product_name': ["apple", "apple", "banana"],
})
# RUN
output_df = create_user_product_transaction_dataset(
user_df, transaction_df, product_df
)
# TEST
pd.testing.assert_frame_equal(
output_df,
expected_df
)
To organise your exams higher and make them cleaner, you can begin utilizing a mixture of lessons, fixtures, and parametrisation.
It’s past the scope of this text to delve into every of those ideas intimately, so for individuals who have an interest I present the pytest How-To guide as reference to those ideas.

What to Check?
Now that we’ve created a unit take a look at for one perform, we flip our consideration to the remaining features that we have now. Acute readers will now be pondering:
“Wow, do I’ve to write down a take a look at for the whole lot? That’s lots of work!”
Sure, it’s true. It’s additional code that it’s essential to write and keep.
However the excellent news is, it’s not essential to check completely the whole lot, however it’s essential to know what’s necessary within the context of what your work is doing.
Beneath, I’ll offer you a couple of guidelines of thumb and issues that I make when deciding what to check, and why.
1. Is the code vital for the result of the challenge?
There are vital junctures in an information science challenge which are simply pivotal to the success of an information science challenge, lots of which normally comes on the data-preparation and mannequin analysis/rationalization levels.
The instance take a look at we noticed above on the create_user_product_transaction_dataset perform is an effective instance.
This dataset will kind the idea of all downstream modelling exercise.
If the person -> product be part of is inaccurate in no matter approach, then it can impression the whole lot we do downstream.
Thus, it’s value taking the time to make sure this code works appropriately.
At a naked minimal, the take a look at we’ve established makes certain the perform is behaving in precisely the identical approach because it used to after each code change.
Instance
Suppose the be part of must be rewritten to enhance reminiscence effectivity.
After making the change, the unit take a look at ensures the output stays the identical.
If one thing was inadvertently altered such that the output began to look completely different (lacking rows, columns, completely different datatypes), the take a look at would instantly flag the problem.
2. Is the code primarily utilizing third-party libraries?
Take the load knowledge perform for instance:
def load_data(user_path: str, transaction_path: str, product_path: str):
"""Load knowledge from specified paths"""
df1 = pd.read_csv(user_path)
df2 = pd.read_parquet(transaction_path)
df3 = pd.read_parquet(product_path)
return df1, df2, df3
This perform is encapsulating the method of studying knowledge from completely different information. Below the hood, all it does is name three pandas load features.
The principle worth of this code is the encapsulation.
In the meantime, it doesn’t have any enterprise logic, and in my view, the perform scope is so particular that you just wouldn’t anticipate any logic to be added sooner or later.
If it does, then the perform title needs to be modified because it does extra than simply loading knowledge.
Due to this fact, this perform does not require a unit take a look at.
A unit take a look at for this perform would simply be testing that pandas works correctly, and we should always have the ability to belief that pandas has examined their very own code.
3. Is the code more likely to change over time?
This level has already been implied in 1 & 2. For maintainability, maybe that is a very powerful consideration.
You ought to be pondering:
- How complicated is the code? Are there some ways to attain the identical output?
- What might trigger somebody to change this code? Is the information supply inclined to modifications sooner or later?
- Is the code clear? Are there behaviours that may very well be simply missed throughout a refactor?
Take create_user_product_transaction_dataset for instance.
- The enter knowledge could have modifications to their schema sooner or later.
- Maybe the dataset turns into bigger, and we have to break up the merge into a number of steps for efficiency causes.
- Maybe a unclean hack must go in briefly to deal with nulls attributable to a problem with the information supply.
In every case, a change to the underlying code could also be essential, and every time we have to make sure the output doesn’t change.
In distinction, load_data does nothing however masses knowledge from a file.
I don’t see this altering a lot sooner or later, apart from maybe a change in file format. So I’d defer writing a take a look at for this till a major change to the upstream knowledge supply happens (one thing like this is able to most probably require altering lots of the pipeline).
The place to Put Exams and The best way to Run Them
To this point, we’ve lined find out how to write testable code and find out how to create the exams themselves.
Now, let’s take a look at find out how to construction your challenge to incorporate exams — and find out how to run them successfully.
Undertaking Construction
Usually, an information science challenge can comply with the under construction:
|-- knowledge # the place knowledge is saved
|-- conf # the place config information in your pipelines are saved
|-- src # all of the code to duplicate your challenge is saved right here
|-- notebooks # all of the code for one-off experiments, explorations, and many others. are saved right here
|-- exams # all of the exams are saved right here
|-- pyproject.toml
|-- README.md
|-- necessities.txt
The src folder ought to include all of the code for the challenge which are vital for the supply of your challenge.
Normal rule of thumb
If it’s code you anticipate working a number of occasions (with completely different inputs or parameters), it ought to go within the src folder.
Examples embrace:
- knowledge processing
- function engineering
- mannequin coaching
- mannequin analysis
In the meantime, something that’s one-off items of research could be in Jupyter notebooks, saved within the notebooks folder.
This primarily consists of
- EDA
- ad-hoc mannequin experimentation
- evaluation of native mannequin explanations
Why?
As a result of Jupyter notebooks are notoriously flaky, troublesome to handle, and onerous to check. We don’t wish to be rerunning vital code through notebooks.
The Check Folder Construction
Let’s say your src folder appears like this:
src
|-- pipelines
|-- data_processing.py
|-- feature_engineering.py
|-- model_training.py
|-- __init__.py
Every file comprises features and pipelines, much like the instance we noticed above.
The take a look at folder ought to then appear like this:
exams
|-- pipelines
|-- test_data_processing.py
|-- test_feature_engineering.py
|-- test_model_training.py
the place the take a look at listing mirrors the construction of the src listing and every file begins with the test_ prefix.
The rationale for that is easy:
- It’s straightforward to search out the exams for a given file, because the take a look at folder construction mirrors the
srcfolder. - It retains take a look at code properly separated from supply code.
Working Exams
After getting your exams arrange like above, you possibly can run them in a wide range of methods:
1. By way of the terminal
pytest -v
2. By way of a code editor
I take advantage of this for all my initiatives.
Visible studio code is my editor of selection; it auto-discovers the exams for me, and it’s tremendous straightforward to debug.
After having a learn of the docs, I don’t assume there’s any level in me re-iterating their contents since they’re fairly self-explanatory, so right here’s the hyperlink:
Equally, most code editors may also have comparable capabilities, so there’s no excuse for not writing exams.
It actually is straightforward, learn the docs and get began.
3. By way of a CI pipeline (e.g. GitHub Actions, Gitlab, and many others.)
It’s straightforward to arrange exams to run routinely on pull requests through GitHub.
The concept is everytime you make a PR, it can routinely discover and run the exams for you.
Which means that even when neglect to run the exams regionally through 1 or 2, they’ll all the time be run for you everytime you wish to merge your modifications.
Once more, no level in me re-iterating the docs; right here’s the hyperlink
The Finish-Purpose We Need To Obtain
Following on from the above directions, I believe it’s higher use of each of our time to focus on some necessary factors about what we wish to obtain by automated exams, reasonably than regurgitating directions you’ll find within the above hyperlinks.
At the start, automated exams are being written to ascertain belief in your code, and to minimise human error.
That is for the advantage of:
- Your self
- Your workforce
- and the enterprise as a complete.
Due to this fact, to really get essentially the most out of the exams you’ve written, you could get spherical to establishing a CI pipeline.
It makes a world of distinction having the ability to neglect to run the exams regionally, and nonetheless have the peace of mind that the exams can be run once you create a PR or push some modifications.
You don’t wish to be the particular person liable for a bug that creates a manufacturing incident since you forgot to run the exams, or to be the one to have missed a bug throughout a PR evaluation.
So please, should you write some exams, make investments a while into establishing a CI pipeline. Learn the github docs, I implore you. It’s trivial to arrange, and it’ll do you wonders.
Closing Remarks
After studying this text, I hope it’s impressed upon you
- The significance of writing exams, particularly throughout the context of knowledge science
- How straightforward it’s to write down and run them
However there may be one final cause why it’s best to know find out how to write automated take a look at.
That cause is that
Knowledge Science is altering.
Knowledge science was largely proof-of-concept, constructing fashions in Jupyter notebooks, and sending fashions to engineers for deployment. In the meantime, knowledge scientists constructed up a notoriety for creating horrible code.
However now, the trade has matured.
It’s turning into simpler to shortly construct and deploy fashions as ML-Ops and ML-engineering mature.
Thus,
- mannequin constructing
- deployment
- retraining
- upkeep
is turning into the duty of machine studying engineers.
On the identical time, the information wrangling that we used to do have gotten so complicated that that is now turning into specialised to devoted knowledge engineering groups.
In consequence, knowledge science sits in a really slim area between these two disciplines, and fairly quickly the traces between knowledge scientist and knowledge analyst will blur.
The trajectory is that knowledge scientists will now not be constructing cutting-edge fashions, however will turn into extra enterprise and product targeted, producing insights and MI experiences as a substitute.
If you wish to keep nearer to the mannequin constructing, it doesn’t suffice to only code anymore.
It’s essential to discover ways to code correctly, and find out how to keep them properly. Machine studying is now not a novelty, it’s now not simply PoCs, it’s turning into software program engineering.
If You Need To Be taught Extra
If you wish to study extra about software program engineering abilities utilized to Knowledge Science, listed here are some associated articles:
You may as well turn into a Staff Member on Patreon here!
We’ve got devoted dialogue threads for all articles; Ask me questions on automated testing, focus on the subject in additional element, and share experiences with different knowledge scientists. The educational doesn’t have to cease right here.
You could find the devoted dialogue thread for this text here.

