Throughout the telecommunication increase, Claude Shannon, in his seminal 1948 paper¹, posed a query that might revolutionise know-how:
How can we quantify communication?
Shannon’s findings stay elementary to expressing info quantification, storage, and communication. These insights made main contributions to the creation of applied sciences starting from sign processing, knowledge compression (e.g., Zip information and compact discs) to the Web and synthetic intelligence. Extra broadly, his work has considerably impacted numerous fields akin to neurobiology, statistical physics and laptop science (e.g, cybersecurity, cloud computing, and machine studying).
[Shannon’s paper is the]
Magna Carta of the Data Age
That is the primary article in a collection that explores info quantification – a vital device for knowledge scientists. Its functions vary from enhancing statistical analyses to serving as a go-to resolution heuristic in cutting-edge machine studying algorithms.
Broadly talking, quantifying info is assessing uncertainty, which can be phrased as: “how shocking is an consequence?”.
This text thought shortly grew right into a collection since I discovered this matter each fascinating and numerous. Most researchers, at one stage or one other, come throughout generally used metrics akin to entropy, cross-entropy/KL-divergence and mutual-information. Diving into this matter I discovered that so as to absolutely respect these one must study a bit in regards to the fundamentals which we cowl on this first article.
By studying this collection you’ll acquire an instinct and instruments to quantify:
- Bits/Nats – Unit measures of knowledge.
- Self-Data – **** The quantity of knowledge in a particular occasion.
- Pointwise Mutual Data – The quantity of knowledge shared between two particular occasions.
- Entropy – The typical quantity of knowledge of a variable’s consequence.
- Cross-entropy – The misalignment between two likelihood distributions (additionally expressed by its spinoff KL-Divergence – a distance measure).
- Mutual Data – The co-dependency of two variables by their conditional likelihood distributions. It expresses the data acquire of 1 variable given one other.
No prior information is required – only a fundamental understanding of possibilities.
I show utilizing frequent statistics akin to coin and cube 🎲 tosses in addition to machine studying functions akin to in supervised classification, characteristic choice, mannequin monitoring and clustering evaluation. As for actual world functions I’ll talk about a case examine of quantifying DNA variety 🧬. Lastly, for enjoyable, I additionally apply to the favored mind tornado generally often known as the Monty Corridor downside 🚪🚪 🐐 .
All through I present python code 🐍 , and attempt to hold formulation as intuitive as potential. You probably have entry to an built-in growth atmosphere (IDE) 🖥 you may need to plug 🔌 and play 🕹 round with the numbers to achieve a greater instinct.
This collection is split into 4 articles, every exploring a key side of Information Theory:
-
😲 Quantifying Shock: 👈 👈 👈 YOU ARE HERE
On this opening article, you’ll discover ways to quantify the “shock” of an occasion utilizing _self-informatio_n and perceive its models of measurement, akin to _bit_s and _nat_s. Mastering self-information is crucial for constructing instinct in regards to the subsequent ideas, as all later heuristics are derived from it. - 🤷 Quantifying Uncertainty: Constructing on self-information, this text shifts focus to the uncertainty – or “common shock” – related to a variable, often known as entropy. We’ll dive into entropy’s wide-ranging functions, from Machine Learning and knowledge evaluation to fixing enjoyable puzzles, showcasing its adaptability.
- 📏 Quantifying Misalignment: Right here, we’ll discover the best way to measure the space between two likelihood distributions utilizing entropy-based metrics like cross-entropy and KL-divergence. These measures are significantly precious for duties like evaluating predicted versus true distributions, as in classification loss features and different alignment-critical eventualities.
- 💸 Quantifying Acquire: Increasing from single-variable measures, this text investigates the relationships between two. You’ll uncover the best way to quantify the data gained about one variable (e.g, goal Y) by understanding one other (e.g., predictor X). Functions embody assessing variable associations, characteristic choice, and evaluating clustering efficiency.
Every article is crafted to face alone whereas providing cross-references for deeper exploration. Collectively, they supply a sensible, data-driven introduction to info idea, tailor-made for knowledge scientists, analysts and machine studying practitioners.
Disclaimer: Except in any other case talked about the formulation analysed are for categorical variables with c≥2 lessons (2 that means binary). Steady variables might be addressed in a separate article.
🚧 Articles (3) and (4) are at the moment underneath building. I’ll share hyperlinks as soon as obtainable. Follow me to be notified 🚧
Quantifying Shock with Self-Data
Self-information is taken into account the constructing block of knowledge quantification.
It’s a means of quantifying the quantity of “shock” of a particular consequence.
Formally self-information, or additionally known as Shannon Data or info content material, quantifies the shock of an occasion x occurring based mostly on its likelihood, p(x). Right here we denote it as hₓ:
The models of measure are referred to as bits. One bit (binary digit) is the quantity of knowledge for an occasion x that has likelihood of p(x)=½. Let’s plug in to confirm: hₓ=-log₂(½)= log₂(2)=1 bit.
This heuristic serves as a substitute for possibilities, odds and log-odds, with sure mathematical properties that are advantageous for info idea. We talk about these beneath when studying about Shannon’s axioms behind this selection.
It’s at all times informative to discover how an equation behaves with a graph:
To deepen our understanding of self-information, we’ll use this graph to discover the stated axioms that justify its logarithmic formulation. Alongside the way in which, we’ll additionally construct instinct about key options of this heuristic.
To emphasize the logarithmic nature of self-information, I’ve highlighted three factors of curiosity on the graph:
- At p=1 an occasion is assured, yielding no shock and therefore zero bits of knowledge (zero bits). A helpful analogy is a trick coin (the place either side present HEAD).
- Lowering the likelihood by an element of two (p=½) will increase the data to _hₓ=_1 bit. This, in fact, is the case of a good coin.
- Additional decreasing it by an element of 4 ends in hₓ(p=⅛)=3 bits.
In case you are inquisitive about coding the graph here’s a python script:
To summarise this part:
Self-Data hₓ=-log₂(p(x)) quantifies the quantity of “shock” of a particular consequence x.
Three Axioms
Referencing prior work by Ralph Hartley, Shannon selected -log₂(p) as a way to satisfy three axioms. We’ll use the equation and graph to look at how these are manifested:
-
An occasion with likelihood 100% is no surprise and therefore doesn’t yield any info.
Within the trick coin case that is evident by p(x)=1 yielding hₓ=0. -
Much less possible occasions are extra shocking and supply extra info.
That is obvious by self-information reducing monotonically with rising likelihood. - The property of Additivity – the full self-information of two impartial occasions equals the sum of particular person contributions. This might be explored additional within the upcoming fourth article on Mutual Data.
There are mathematical proofs (that are past the scope of this collection) that present that solely the log operate adheres to all three².
The applying of those axioms reveals a number of intriguing and sensible properties of self-information:
Necessary properties :
- Minimal certain: The primary axiom hₓ(p=1)=0 establishes that self-information is non-negative, with zero as its decrease certain. That is extremely sensible for a lot of functions.
- Monotonically reducing: The second axiom ensures that self-information decreases monotonically with rising likelihood.
- No Most certain: On the excessive the place _p→_0, monotonicity results in self-information rising with out certain hₓ(_p→0) →_ ∞, a characteristic that requires cautious consideration in some contexts. Nevertheless, when averaging self-information – as we’ll later see within the calculation of entropy – possibilities act as weights, successfully limiting the contribution of extremely inconceivable occasions to the general common. This relationship will grow to be clearer after we discover entropy intimately.
It’s helpful to grasp the shut relationship to log-odds. To take action we outline p(x) because the likelihood of occasion x to occur and p(¬x)=1-p(x) of it to not occur. log-odds(x) = log₂(p(x)/p(¬x))= h(¬x) – h(x).
The primary takeaways from this part are
Axiom 1: An occasion with likelihood 100% is no surprise
Axiom 2: Much less possible occasions are extra shocking and, after they happen, present extra info.
Self info (1) monotonically decreases (2) with a minimal certain of zero and (3) no higher certain.
Within the subsequent two sections we additional talk about models of measure and selection of normalisation.
Data Models of Measure
Bits or Shannons?
A bit, as talked about, represents the quantity of knowledge related to an occasion that has a 50% likelihood of occurring.
The time period can also be generally known as a Shannon, a naming conference proposed by mathematician and physicist David MacKay to keep away from confusion with the time period ‘bit’ within the context of digital processing and storage.
After some deliberation, I made a decision to make use of ‘bit’ all through this collection for a number of causes:
- This collection focuses on quantifying info, not on digital processing or storage, so ambiguity is minimal.
- Shannon himself, inspired by mathematician and statistician John Tukey, used the time period ‘bit’ in his landmark paper.
- ‘Bit’ is the usual time period in a lot of the literature on info idea.
- For comfort – it’s extra concise
Normalisation: Log Base 2 vs. Pure
All through this collection we use base 2 for logarithms, reflecting the intuitive notion of a 50% probability of an occasion as a elementary unit of knowledge.
An alternate generally utilized in machine studying is the pure logarithm, which introduces a distinct unit of measure referred to as nats (brief for natural models of knowledge). One nat corresponds to the data gained from an occasion occurring with a likelihood of 1/e the place e is Euler’s quantity (≈2.71828). In different phrases, 1 nat = -ln(p=(1/e)).
The connection between bits (base 2) and nats (pure log) is as follows:
1 bit = ln(2) nats ≈ 0.693 nats.
Consider it as just like a financial present change or changing centimeters to inches.
In his seminal publication Shanon defined that the optimum selection of base depends upon the precise system being analysed (paraphrased barely from his unique work):
- “A tool with two secure positions […] can retailer one bit of knowledge” (bit as in binary digit).
- “A digit wheel on a desk computing machine that has ten secure positions […] has a storage capability of 1 decimal digit.”³
- “In analytical work the place integration and differentiation are concerned the bottom e is typically helpful. The ensuing models of knowledge might be referred to as pure models.“
Key features of machine studying, akin to standard loss features, usually depend on integrals and derivatives. The pure logarithm is a sensible selection in these contexts as a result of it may be derived and built-in with out introducing extra constants. This probably explains why the machine studying group incessantly makes use of nats because the unit of knowledge – it simplifies the arithmetic by avoiding the necessity to account for elements like ln(2).
As proven earlier, I personally discover base 2 extra intuitive for interpretation. In instances the place normalisation to a different base is extra handy, I’ll make an effort to clarify the reasoning behind the selection.
To summarise this part of models of measure:
bit = quantity of knowledge to tell apart between two equally probably outcomes.
Now that we’re conversant in self-information and its unit of measure let’s look at just a few use instances.
Quantifying Occasion Data with Cash and Cube
On this part, we’ll discover examples to assist internalise the self-information axioms and key options demonstrated within the graph. Gaining a stable understanding of self-information is crucial for greedy its derivatives, akin to entropy, cross-entropy (or KL divergence), and mutual info – all of that are averages over self-information.
The examples are designed to be easy, approachable, and lighthearted, accompanied by sensible Python code that can assist you experiment and construct instinct.
Notice: In case you really feel comfy with self-information, be at liberty to skip these examples and go straight to the Quantifying Uncertainty article.
To additional discover the self-information and bits, I discover analogies like coin flips and cube rolls significantly efficient, as they’re usually helpful analogies for real-world phenomena. Formally, these will be described as multinomial trials with n=1 trial. Particularly:
- A coin flip is a Bernoulli trial, the place there are c=2 potential outcomes (e.g., heads or tails).
- Rolling a die represents a categorical trial, the place c≥3 outcomes are potential (e.g., rolling a six-sided or eight-sided die).
As a use case we’ll use simplistic climate studies restricted to that includes solar 🌞 , rain 🌧 , and snow ⛄️.
Now, let’s flip some digital cash 👍 and roll some funky-looking cube 🎲 …
Honest Cash and Cube
We’ll begin with the only case of a good coin (i.e, 50% probability for fulfillment/Heads or failure/Tails).
Think about an space for which at any given day there’s a 50:50 probability for solar or rain. We are able to write the likelihood of every occasion be: p(🌞 )=p(🌧 )=½.
As seen above, in accordance the the self-information formulation, when 🌞 or 🌧 is reported we’re supplied with h(🌞 __ )=h(🌧 )=-log₂(½)=1 bit of knowledge.
We are going to proceed to construct on this analogy afterward, however for now let’s flip to a variable that has greater than two outcomes (c≥3).
Earlier than we tackle the usual six sided die, to simplify the maths and instinct, let’s assume an 8 sided one (_c=_8) as in Dungeons Dragons and different tabletop video games. On this case every occasion (i.e, touchdown on both sides) has a likelihood of p(🔲 ) = ⅛.
When a die lands on one aspect going through up, e.g, worth 7️⃣, we’re supplied with h(🔲 =7️⃣)=-log₂(⅛)=3 bits of knowledge.
For the standard six sided truthful die: p(🔲 ) = ⅙ → an occasion yields __ h(🔲 )=-log₂(⅙)=2.58 bits.
Evaluating the quantity of knowledge from the truthful coin (1 bit), 6 sided die (2.58 bits) and eight sided (3 bits) we establish the second axiom: The much less possible an occasion is, the extra shocking it’s and the extra info it yields.
Self info turns into much more attention-grabbing when possibilities are skewed to choose sure occasions.
Loaded Cash and Cube
Let’s assume a area the place p(🌞 ) = ¾ and p(🌧 )= ¼.
When rain is reported the quantity of knowledge conveyed just isn’t 1 bit however relatively h(🌧 )=-log₂(¼)=2 bits.
When solar is reported much less info is conveyed: h(🌞 )=-log₂(¾)=0.41 bits.
As per the second axiom— a rarer occasion, like p(🌧 )=¼, reveals extra info than a extra probably one, like p(🌞 )=¾ – and vice versa.
To additional drive this level let’s now assume a desert area the place p(🌞 ) =99% and p(🌧 )= 1%.
If sunshine is reported – that’s sort of anticipated – so nothing a lot is learnt (“nothing new underneath the solar” 🥁) and that is quantified as h(🌞 )=0.01 bits. If rain is reported, nonetheless, you possibly can think about being fairly shocked. That is quantified as h(🌧 )=6.64 bits.
Within the following python scripts you possibly can look at all of the above examples, and I encourage you to play with your personal to get a sense.
First let’s outline the calculation and printout operate:
import numpy as np
def print_events_self_information(probs):
for ps in probs:
print(f"Given distribution {ps}")
for occasion in ps:
if ps[event] != 0:
self_information = -np.log2(ps[event]) #similar as: -np.log(ps[event])/np.log(2)
text_ = f'When `{occasion}` happens {self_information:0.2f} bits of knowledge is communicated'
print(text_)
else:
print(f'a `{occasion}` occasion can not occur p=0 ')
print("=" * 20)
Subsequent we’ll set just a few instance distributions of climate frequencies
# Setting a number of likelihood distributions (every sums to 100%)
# Enjoyable truth - 🐍 💚 Emojis!
probs = [{'🌞 ': 0.5, '🌧 ': 0.5}, # half-half
{'🌞 ': 0.75, '🌧 ': 0.25}, # more sun than rain
{'🌞 ': 0.99, '🌧 ': 0.01} , # mostly sunshine
]
print_events_self_information(probs)
This yields printout
Given distribution {'🌞 ': 0.5, '🌧 ': 0.5}
When `🌞 ` happens 1.00 bits of knowledge is communicated
When `🌧 ` happens 1.00 bits of knowledge is communicated
====================
Given distribution {'🌞 ': 0.75, '🌧 ': 0.25}
When `🌞 ` happens 0.42 bits of knowledge is communicated
When `🌧 ` happens 2.00 bits of knowledge is communicated
====================
Given distribution {'🌞 ': 0.99, '🌧 ': 0.01}
When `🌞 ` happens 0.01 bits of knowledge is communicated
When `🌧 ` happens 6.64 bits of knowledge is communicated
Let’s look at a case of a loaded three sided die. E.g, info of a climate in an space that studies solar, rain and snow at uneven possibilities: p(🌞 ) = 0.2, p(🌧 )=0.7, p(⛄️)=0.1.
Operating the next
print_events_self_information([{'🌞 ': 0.2, '🌧 ': 0.7, '⛄️': 0.1}])
yields
Given distribution {'🌞 ': 0.2, '🌧 ': 0.7, '⛄️': 0.1}
When `🌞 ` happens 2.32 bits of knowledge is communicated
When `🌧 ` happens 0.51 bits of knowledge is communicated
When `⛄️` happens 3.32 bits of knowledge is communicated
What we noticed for the binary case applies to increased dimensions.
To summarise – we clearly see the implications of the second axiom:
- When a extremely anticipated occasion happens – we don’t study a lot, the bit depend is low.
- When an surprising occasion happens – we study rather a lot, the bit depend is excessive.
Occasion Data Abstract
On this article we launched into a journey into the foundational ideas of knowledge idea, defining the best way to measure the shock of an occasion. Notions launched function the bedrock of many instruments in info idea, from assessing knowledge distributions to unraveling the interior workings of machine studying algorithms.
Via easy but insightful examples like coin flips and cube rolls, we explored how self-information quantifies the unpredictability of particular outcomes. Expressed in bits, this measure encapsulates Shannon’s second axiom: rarer occasions convey extra info.
Whereas we’ve centered on the data content material of particular occasions, this naturally results in a broader query: what’s the common quantity of knowledge related to all potential outcomes of a variable?
Within the subsequent article, Quantifying Uncertainty, we construct on the muse of self-information and bits to discover entropy – the measure of common uncertainty. Removed from being only a stunning theoretical assemble, it has sensible functions in knowledge evaluation and machine studying, powering duties like resolution tree optimisation, estimating variety and extra.
Beloved this put up? ❤️🍕
💌 Observe me right here, be a part of me on LinkedIn or 🍕 buy me a pizza slice!
About This Sequence
Regardless that I’ve twenty years of expertise in knowledge evaluation and predictive modelling I at all times felt fairly uneasy about utilizing ideas in info idea with out really understanding them.
The aim of this collection was to place me extra relaxed with ideas of knowledge idea and hopefully present for others the reasons I wanted.
Try my different articles which I wrote to raised perceive Causality and Bayesian Statistics:
Footnotes
¹ A Mathematical Concept of Communication, Claude E. Shannon, Bell System Technical Journal 1948.
It was later renamed to a ebook The Mathematical Concept of Communication in 1949.
[Shannon’s “A Mathematical Theory of Communication”] the blueprint for the digital period – Historian James Gleick
² See Wikipedia web page on Information Content (i.e, self-information) for an in depth derivation that solely the log operate meets all three axioms.
³ The decimal-digit was later renamed to a hartley (image Hart), a ban or a dit. See Hartley (unit) Wikipedia web page.
Credit
Except in any other case famous, all photographs had been created by the writer.
Many because of Will Reynolds and Pascal Bugnion for his or her helpful feedback.