science workflows, groups typically want entry to a shared dataset that stays completely synchronized and can’t be modified, e.g., in distributed machine studying environments the place a number of groups depend on the very same characteristic set.
On this article, I’ll stroll by a easy, fee-free methodology for cryptographically hashing a dataset of any measurement and storing its hash immutably on the Ethereum blockchain, making a everlasting and verifiable report of the dataset’s integrity.
This methodology may also merely be prolonged to mannequin weights, particular transformations which should be utilized in a constant method, supply code, or different knowledge which must be immutable and verifiable.
🤔Why Integrity Issues
When you’re at the least considerably aware of knowledge science as a apply, you’re already conscious of the significance of information integrity. Even small adjustments or errors within the enter knowledge can collapse a mission.
Fashionable machine studying fashions are extraordinarily delicate to their coaching knowledge. Lacking normalization steps, a modified CSV file, shuffled rows, corrupted options, or mismatches between coaching and validation datasets can produce dramatically totally different outcomes.
Integrity failures are tough to detect and infrequently derailing.
Fashions should seem to perform usually or prepare, however metrics can degrade slowly, drift accrues, or experiments change into unimaginable to breed. Integrity is doubly vital when the workforce is distributed, probably throughout totally different organizations, and have to work on totally different variations of the identical downside.
🔐Utilizing a Cryptographic Hash as a “Supply of Fact”
A cryptographic hash offers us a easy and really helpful mechanism for verifying knowledge integrity.
A quick primer on cryptographic hashes
A hash perform takes any quantity of enter knowledge (bytes) and deterministically produces a hard and fast size output referred to as a hash or digest. Cryptographic hashes are foundational in laptop science, as you’re most certainly already conscious.
The secret is determinism:
Similar knowledge in → similar hash out
Even a single byte modified within the enter knowledge produces a totally totally different hash.
Due to this property, hashes act as distinctive fingerprints for knowledge and are extraordinarily helpful for verifying integrity. There are lots of flavors of hash features, and a few are extra helpful for this process as I’ll describe.
How does this apply to datasets?
Due to the hash perform’s determinism, as soon as utilized to a dataset, we are able to shortly and reliably check whether or not the dataset is similar to what we predict.
That is exceptionally helpful with giant datasets that are utilized by a number of groups, a number of corporations, transferring from one model to the subsequent. Staff 1 at Analysis Group Alpha creates options 1-10, Staff 2 at Analysis Group Zeta creates options 10-100, System X consumes model Y, and so forth.
We now not have to query the info, merely compute the hash perform over the dataset and evaluate it to the hash computed at a reference level. If it matches, OK. If not, one thing has modified.
Hashing is extraordinarily environment friendly. Working a hash perform over a 10MB or 10TB dataset shortly offers us a small, mounted measurement string that may be shared, saved or revealed.
🧐 Why Use Ethereum as an Immutable Retailer?
That is the actual helpful piece of this text.
Ethereum, once more, as you’re already conscious is a blockchain. This offers us:
- Immutability: a transaction can by no means be modified
- Distributed availability: at all times accessible with out central authority
- Everlasting: as soon as written, it’s completely accessible
However, Ethereum is for transactions? Don’t we have to write a sophisticated good contract for this specialised function?
You might certainly. However, we don’t have to.
The intelligent bit is using this uncommonly used enter knowledge subject in an Ethereum transaction, generally known as “calldata.”
However, Ethereum transactions value actual cash (gasoline, charges, and so forth)?
Additionally true. On Ethereum, you’re charged “gasoline” for every byte within the enter knowledge. On the mainnet, with a worth of $2,000 per ETH, this may cost us between $0.04 – $0.10 per hash. This doesn’t embody the gasoline required for an precise switch to be included by a block validator, which will be hefty relying on the community’s present load.
Let’s make this extra intelligent. 🦊
By offloading every part to the “testnet”, which each and every blockchain generally has, we are able to make this solely free.
Sepolia (the ETH testnet) isn’t used except you’re a developer of good contracts. Sepolia ETH is free and publicly accessible from taps.
This implies we are able to create an infinite quantity of transactions, on the publicly accessible testnet (referred to as Sepolia for Ethereum), at no cost!
So long as our enter knowledge is fairly sized, Sepolia supplies a solution to make the most of the blockchain for infinite knowledge storage, with principally the identical properties because the mainnet*
* Sepolia blockchains aren’t everlasting, however are principally trustable for a number of years. When you want absolute permanence, you’ll have to pay for it utilizing the mainnet.
Bear in mind, we’re not storing the precise knowledge on-chain. Simply the fingerprint.
⚙️The method
First, we have to a solution to reliably create transactions on Ethereum.
Regardless of seeming complicated, that is really very simple. We don’t want any extra software program or pockets tech. A pockets is nothing greater than a key, paired with a secret used to signal it.
To create an Ethereum transaction, we create a python object with the required keys and format, encode it with our key, and broadcast it to the community. A validator then picks up our transaction from the “mempool” and consists of it in a block.
So long as we embody all of the required fields, and it checks out, it’s now a everlasting a part of the blockchain inside ~12 seconds.
Step 1: Create the important thing and secret with web3.py with just a few traces of code
from eth_account import Account
account = Account.create()
print("Handle:", account.handle)
print("Non-public Key:", account.key.hex())
Step 2: Get some ETH on Sepolia. Plug in your handle here and wait 12 seconds. Thanks Google!
Step 3: Hash the dataset
As I discussed, there are some hashes which are higher for this course of. We may use an SHA256 hash, however Blake2b is definitely higher for throughput. Actually, any hashing perform will work.
Use this perform to hash the info.
import hashlib
from pathlib import Path
def hash_dataset(dataset, algorithm="blake2b", chunk_size=1024 * 1024):
h = hashlib.new(algorithm)
def replace(obj):
if isinstance(obj, (str, Path)) and Path(obj).exists():
with open(obj, "rb") as f:
whereas chunk := f.learn(chunk_size):
h.replace(chunk)
elif isinstance(obj, bytes):
h.replace(obj)
elif isinstance(obj, str):
h.replace(obj.encode("utf-8"))
elif isinstance(obj, dict):
for ok in sorted(obj.keys()):
replace(ok)
replace(obj[k])
elif isinstance(obj, (checklist, tuple)):
for merchandise in obj:
replace(merchandise)
elif isinstance(obj, set):
strive:
for merchandise in sorted(obj):
replace(merchandise)
besides TypeError:
for merchandise in sorted(obj, key=str):
replace(merchandise)
elif hasattr(obj, "__iter__"):
for merchandise in obj:
replace(merchandise)
else:
h.replace(repr(obj).encode("utf-8"))
replace(dataset)
return h.hexdigest()
digest = hash_dataset("hugedataset.parquet", algorithm="blake2b")
Step 4: Write, signal and publish a transaction with the hash of our dataset.
Utilizing the web3.py library, we are able to construction our transaction as a python dict, after which publish it to the community.
We’d like a supplier to broadcast our transaction (we don’t have a node). Right here we use Infura, however there are others, like Alchemy
Simply observe that we add a zero bit “0x” to the hash calculated on our dataset. We have to take away it after we validate our hash.
from web3 import Web3
w3 = Web3(Web3.HTTPProvider("https://sepolia.infura.io/v3/YOUR_KEY"))
dataset_hash = "0x" + digest
account = w3.eth.account.from_key("YOUR_PRIVATE_KEY")
tx = {
"to": account.handle, # self-send (no contract required)
"worth": 0, # no ETH switch
"gasoline": 50_000,
"maxFeePerGas": w3.to_wei("20", "gwei"),
"maxPriorityFeePerGas": w3.to_wei("2", "gwei"),
"nonce": w3.eth.get_transaction_count(account.handle),
"chainId": 11155111, # Sepolia testnet
"knowledge": dataset_hash
}
Signal it and ship it off. Right here, we wait until the transaction is finalized.
signed_tx = account.sign_transaction(tx)
tx_hash = w3.eth.send_raw_transaction(signed_tx.rawTransaction)
print("Broadcast tx hash:", tx_hash.hex())
# Look forward to mining / inclusion in a block
tx_receipt = w3.eth.wait_for_transaction_receipt(tx_hash)
print("Transaction mined in block:", tx_receipt["blockNumber"])
print("Standing:", tx_receipt["status"])
Make sure you retain the transaction id.
Step 5: Create a metadata report to retailer alongside our dataset
Right here, we create a easy piece of metadata, which will be saved in a database (DynamoDB, MongoDB) or alongside facet our knowledge object immediately (S3, Google Cloud Storage).
The metadata may look one thing like so:
{
"dataset_id": "feature_set_v42",
"dataset_uri": "s3://ml-bucket/options/v42.parquet",
"dataset_hash": "0x9f3c...ab21",
"tx_hash": "0x7c1a...e91d",
"timestamp_unix": 1730000000,
"hash_algorithm": "blake2b",
"creator": "0xabc123...",
"notes": "normalized options"
}
Step 6: At any time when studying the dataset, validate the hash matches the unique hash saved alongside our knowledge
The ultimate step of the method combines three actions:
- Fetch the Ethereum transaction
- Extract the dataset hash from calldata
- Examine it to a regionally recomputed hash
from web3 import Web3
w3 = Web3(Web3.HTTPProvider("https://sepolia.infura.io/v3/YOUR_KEY"))
def verify_dataset(dataset_path, tx_hash):
tx = w3.eth.get_transaction(tx_hash)
raw_input = tx["input"]
onchain_hash = raw_input.hex() if hasattr(raw_input, 'hex') else str(raw_input).decrease()
computed_hash = "0x" + hash_dataset(dataset_path).decrease()
if computed_hash != onchain_hash:
elevate ValueError(f"Integrity FAILED: Native {computed_hash} != On-chain {onchain_hash}")
print("Integrity test PASSED. Dataset matches the blockchain report.")
return True
Thats it!
An vital observe, this doesn’t stop anybody from rewriting our metadata object. Nevertheless, there are lots of methods to forestall modification of a small piece of metadata internally, like audit databases or S3 Object Lock.
Wrapping up
In the end, using a cryptographic hash to confirm dataset integrity is a light-weight strategy to a heavy downside.
Some pure extensions to this embody utilizing this methodology to confirm mannequin weights, and even hashing items of supply code to make sure preprocessing is related.
Whether or not you’re collaborating throughout distributed, open-source groups, constructing reproducible analysis, or just creating an audit path for compliance, the blockchain is a pleasant, neutral notary to your knowledge. You don’t have to belief the infrastructure; you simply have to belief the mathematics.

