chapter of the in-progress e book on linear algebra, “A birds eye view of linear algebra”. The desk of contents to this point:
Keep tuned for future chapters.
Right here, we are going to describe operations we will do with two matrices, however holding in thoughts they’re simply representations of linear maps.
I) Why care about matrix multiplication?
Virtually any info could be embedded in a vector area. Photographs, video, language, speech, biometric info and no matter else you’ll be able to think about. And all of the purposes of machine studying and synthetic intelligence (just like the current chat-bots, textual content to picture, and so forth.) work on prime of those vector embeddings. Since linear algebra is the science of coping with excessive dimensional vector areas, it’s an indispensable constructing block.
A number of the methods contain taking some enter vectors from one area and mapping them to different vectors from another area.
However why the deal with “linear” when most fascinating features are non-linear? It’s as a result of the issue of creating our fashions excessive dimensional and that of creating them non-linear (normal sufficient to seize every kind of advanced relationships) transform orthogonal to one another. Many neural community architectures work through the use of linear layers with easy one dimensional non-linearities in between them. And there’s a theorem that claims this type of structure can mannequin any operate.
Because the means we manipulate high-dimensional vectors is primarily matrix multiplication, it isn’t a stretch to say it’s the bedrock of the trendy AI revolution.

II) Algebra on maps
In chapter 2, we learnt easy methods to quantify linear maps with determinants. Now, let’s do some algebra with them. We’ll want two linear maps and a foundation.

II-A) Addition
If we will add matrices, we will add linear maps since matrices are the representations of linear maps. And matrix addition isn’t very fascinating if you recognize scalar addition. Simply as with vectors, it’s solely outlined if the 2 matrices are the identical dimension (identical rows and columns) and includes lining them up and including ingredient by ingredient.

So, we’re simply doing a bunch of scalar additions. Which implies that the properties of scalar addition logically prolong.
Commutative: in the event you swap, the outcome gained’t twitch
A+B = B+A
However commuting to work won’t be commutative since going from A to B may take longer than B to A.
Associative: in a series, don’t chorus, take any 2 and proceed
A+(B+C) = (A+B)+C
Identification: And right here I’m the place I started! That’s no method to deal with a person!
The presence of a particular ingredient that when added to something leads to the identical factor. Within the case of scalars, it’s the quantity 0. Within the case of matrices, it’s a matrix stuffed with zeros.
A + 0 = A or 0 + A = A
Additionally, it’s doable to start out at any ingredient and find yourself at some other through addition. So it should be doable to start out at A and find yourself on the additive id, 0. The factor that should be added to A to attain that is the additive inverse of A and it’s referred to as -A.
A + (-A) = 0
For matrices, you simply go to every scalar ingredient within the matrix and substitute with the additive inverse of every one (switching the indicators if the scalars are numbers) to get the additive inverse of the matrix.
II-B) Subtraction
Subtraction is simply addition with the additive inverse of the second matrix as a substitute.
A-B = A+(-B)
II-C) Multiplication
We might have outlined matrix multiplication simply as we outlined matrix addition. Simply take two matrices which are the identical dimension (rows and columns) after which multiply the scalars ingredient by ingredient. There’s a identify for that sorts of operation, the Hadamard product.
However no, we outlined matrix multiplication as a much more convoluted operation, extra “unique” than addition. And it isn’t advanced only for the sake of it. It’s crucial operation in linear algebra by far.
It enjoys this particular standing as a result of it’s the means by which linear maps are utilized to vectors, constructing on prime of dot merchandise.
The best way it truly works requires a devoted part, so we’ll cowl that in part III. Right here, let’s checklist a few of its properties.
Commutative
In contrast to addition, matrix multiplication isn’t all the time commutative. Which implies that the order during which you apply linear maps to your enter vector issues.
A.B != B.A
Associative
It’s nonetheless associative
A.B.C = A.(B.C) = (A.B).C
And there’s a lot of depth to this property, as we are going to see in part IV.
Identification
Similar to addition, matrix multiplication additionally has an id ingredient, I, a component that when any matrix is multiplied to leads to the identical matrix. The large caveat being that this ingredient solely exists for sq. matrices and is itself sq..
Now, due to the significance of matrix multiplication, “the id matrix” usually is outlined because the id ingredient of matrix multiplication (not that of addition or the Hadamard product for instance).
The id ingredient for addition is a matrix composed of 0’s and that of the Hadamard product is a matrix composed of 1’s. The id ingredient of matrix multiplication is:

So, 1’s on the principle diagonal and 0’s all over the place else. What sort of definition for matrix multiplication would result in an id ingredient like this? We’ll want to explain the way it works to see, however first let’s go to the ultimate operation.
II-D) Division
Simply as with addition, the presence of an id matrix suggests any matrix, A could be multiplied with one other matrix, A^-1 and brought to the id. That is referred to as the inverse. Since matrix multiplication isn’t commutative, there are two methods to this. Fortunately, each result in the id matrix.
A.(A^-1) = (A^-1).A = I
So, “dividing” a matrix by one other is solely multiplication with the second ones inverse, A.B^-1. If matrix multiplication is essential, then this operation is as effectively because it’s the inverse. It’s also associated to how we traditionally developed (or perhaps stumbled upon) linear algebra. However extra on that within the subsequent chapter (4).
One other property we’ll be utilizing that could be a mixed property of addition and multiplication is the distributive property. It applies to every kind of matrix multiplication from the standard one to the Hadamard product:
A.(B+C) = A.B + A.C
III) Why is matrix multiplication outlined this fashion?
We’ve got arrived ultimately to the part the place we are going to reply the query within the title, the meat of this chapter.
Matrix multiplication is the best way linear maps act on vectors. So, we get to inspire it that means.
III-A) How are linear maps utilized in follow?
Take into account a linear map that takes m dimensional vectors (from R^m) as enter and maps them to n dimensional vectors (in R^n). Let’s name the m dimensional enter vector, v.
At this level, it is perhaps useful to consider your self truly coding up this linear map in some programming language. It must be a operate that takes the m-dimensional vector, v as enter and returns the n dimensional vector, u.
The linear map has to take this vector and switch it into an n dimensional vector one way or the other. Within the operate above, you’ll discover we simply generated some vector at random. However this fully ignored the enter vector, v. That’s unreasonable, v ought to have some say. Now, v is simply an ordered checklist of m scalars v = [v1, v2, v3, …, vm]. What do scalars do? They scale vectors. And the output vector we’d like must be n dimensional. How about we take some (mounted) m vectors (pulled out of skinny air, every n dimensional), w1, w2, …, wm. Then, scale w1 by v1, w2 by v2 and so forth and add all of them up. This results in an equation for our linear map (with the output on the left).

Make notice of the equation (1) above since we’ll be utilizing it once more.
Because the w1, w2,… are all n dimensional, so is u. And all the weather of v=[v1, v2, …, vm] have an affect on the output, u. The concept in equation (1) is applied under. We take some randomly generated vectors for the w’s however with mounted seeds (making certain that the vectors are the identical throughout each name of the operate).
We’ve got a means now to “map” m dimensional vectors (v) to n dimensional vectors (u). However does this “map” fulfill the properties of a linear map? Recall from chapter-1, part II the properties of a linear map, f (right here, a and b are vectors and c is a scalar):
f(a+b) = f(a) + f(b)
f(c.a) = c.f(a)
It’s clear that the map specified by equation (1) satisfies the above two properties of a linear map.


The m vectors, w1, w2, …, wm are arbitrary and it doesn’t matter what we select for them, the operate, f outlined in equation (1) is a linear map. So, totally different decisions for these w vectors leads to totally different linear maps. Furthermore, for any linear map you’ll be able to think about, there can be some vectors w1, w2,… that may be utilized together with equation (1) to signify it.
Now, for a given linear map, we will accumulate the vectors w1, w2,… into the columns of a matrix. Such a matrix can have n rows and m columns. This matrix represents the linear map, f and its multiplication with an enter vector, v represents the applying of the linear map, f to v. And this utility is the place the definition of matrix multiplication comes from.

We are able to now see why the id ingredient for matrix multiplication is the best way it’s:

We begin with a column vector, v and finish with a column vector, u (so only one column for every of them). And because the parts of v should align with the column vectors of the matrix representing the linear map, the variety of columns of the matrix should equal the variety of parts in v. Extra on this in part III-C.
III-B) Matrix multiplication as a composition of linear maps
Now that we described how a matrix is multiplied to a vector, we will transfer on to multiplying a matrix with one other matrix.
The definition of matrix multiplication is far more pure once we take into account the matrices as representations of linear maps.
Linear maps are features that take a vector as enter and produce a vector as output. Let’s say the linear maps corresponding to 2 matrices are f and g. How would you consider including these maps (f+g)?
(f+g)(v) = f(v)+g(v)
That is harking back to the distributive property of addition the place the argument goes contained in the bracket to each the features and we add the outcomes. And if we repair a foundation, this corresponds to making use of each linear maps to the enter vector and including the outcome. By the distributive property of matrix and vector multiplication, this is identical as including the matrices comparable to the linear maps and making use of the outcome to the vector.
Now, let’s consider multiplication (f.g).
(f.g)(v) = f(g(v))
Since linear maps are features, probably the most pure interpretation of multiplication is to compose them (apply them separately, in sequence to the enter vector).
When two matrices are multiplied, the ensuing matrix represents the composition of the corresponding linear maps. Take into account matrices A and B; the product AB embodies the transformation achieved by making use of the linear map represented by B to the enter vector first after which making use of the linear map represented by A.
So we have now a linear map comparable to the matrix, A and a linear map comparable to the matrix, B. We’d prefer to know the matrix, Ccomparable to the composition of the 2 linear maps. So, making use of B to any vector first after which making use of A to the outcome must be equal to only making use of C.
A.(B.v) = C.v = (A.B).v
Within the final part, we learnt easy methods to multiply a matrix and a vector. Let’s do this twice for A.(B.v). Say the columns of B are the column vectors, b1, b2, …, bm. From equation (1) within the earlier part,

And what if we utilized the linear map comparable to C=A.B on to the vector, v. The column vectors of the matrix C are c1, c2, …, ck.

Evaluating the 2 equations above we get,

So, the columns of the product matrix, C=AB are obtained by making use of the linear map comparable to matrix A to every of the columns of the matrix B. And amassing these ensuing vectors right into a matrix provides us C.
We’ve got simply prolonged our matrix-vector multiplication outcome from the earlier part to the multiplication of two matrices. We simply break the second matrix into a group of vectors, multiply the primary matrix to all of them and accumulate the ensuing vectors into the columns of the outcome matrix.

So the primary row and first column of the outcome matrix, C is the dot product of the primary column of B and the primary row of A. And usually the i-th row and j-th column of C is the dot product of the i-th row of A and the j-th column of B. That is the definition of matrix multiplication most of us first study.

Associative proof
We are able to additionally present that matrix multiplication is associative now. As a substitute of the only vector, v, let’s apply the product C=AB individually to a gaggle of vectors, w1, w2, …, wl. Let’s say the matrix that has these as column vectors is W. We are able to use the very same trick as above to indicate:
(A.B).W = A.(B.W)
It’s as a result of (A.B).w1 = A.(B.w1) and the identical for all the opposite w vectors.
Sum of outer merchandise
Say we’re multiplying two matrices A and B:

Equation (3) could be generalized to indicate that the i,j ingredient of the ensuing matrix, C is:

We’ve got a sum over okay phrases. What if we took every of these phrases and created okay particular person matrices out of them. For instance, the primary matrix can have as its i,j-th entry: b_{i,1}. a_{1,j}. The okay matrices and their relationship to C:

This strategy of summing over okay matrices could be visualized as follows (harking back to the animation in part III-A that visualized a matrix multiplied to a vector):

We see right here the sum over okay matrices the entire identical dimension (nxm) which is identical dimension because the outcome matrix, C. Discover in equation (4) how for the primary matrix, A, the column index stays the identical whereas for the second matrix, B, the row index stays the identical. So the okay matrices we’re getting are the matrix merchandise of the i-th column of A and the i-th row of B.

Matrix multiplication as a sum of outer merchandise. Picture by writer.
Contained in the summation, two vectors are multiplied to supply matrices. It’s a particular case of matrix multiplication when utilized to vectors (particular instances of matrices) and referred to as “outer product”. Right here is yet one more animation to indicate this sum of outer merchandise course of:

This tells us why the variety of row vectors in B must be the identical because the variety of column vectors in A. As a result of they need to be mapped collectively to get the person matrices.
We’ve seen quite a lot of visualizations and a few math, now let’s see the identical factor through code for the particular case the place A and B are sq. matrices. That is primarily based on part 4.2 of the e book “Introduction to Algorithms”, [2].
III-C) Matrix multiplication: the structural decisions

Matrix multiplication appears to be structured in a bizarre means. It’s clear that we have to take a bunch of dot merchandise. So, one of many dimensions has to match. However why make the columns of the primary matrix be equal to the variety of rows of the second?
Received’t it make issues extra simple if we redefine it in a means that the variety of rows of the 2 matrices must be the identical (or the variety of columns)? This is able to make it a lot simpler to establish when two matrices could be multiplied.
The standard definition the place we require the rows of the primary matrix to align with the columns of the second has a couple of benefit. Let’s go first to matrix-vector multiplication. Animation (1) in part III-A confirmed us how the standard model works. Let’s visualize what it if we required the rows of the matrix to align with the variety of parts within the vector as a substitute. Now, the n rows of the matrix might want to align with the nparts of the vector.

We see that we’d have to start out with a column vector, v with n rows and one column and find yourself with a row vector, u with 1 row and m columns. That is awkward and makes defining an id ingredient for matrix multiplication difficult because the enter and output vectors can by no means have the identical form. With the standard definition, this isn’t a problem because the enter is a column vector and the output can be a column vector (see animation (1)).
One other consideration is multiplying a series of matrices. Within the conventional methodology, it’s so simple to see to begin with that the chain of matrices under could be multiplied collectively primarily based on their dimensionalities.

Additional, we will inform that the output matrix can have l rows and p columns.
Within the framework the place the rows of the 2 matrices ought to line up, this rapidly turns into a multitude. For the primary two matrices, we will inform that the rows ought to align and that the outcome can have n rows and l columns. However visualizing what number of rows and columns the outcome can have after which reasoning about climate it’ll be suitable with C, and so forth. turns into a nightmare.

And that’s the reason we require the rows of the primary matrix to align with the columns of the second matrix. However perhaps I missed one thing. Possibly there may be an alternate definition that’s “cleaner” and supervisor to side-step these two challenges. Would love to listen to concepts within the feedback 🙂
III-D) Matrix multiplication as a change of foundation
To this point, we’ve considered matrix multiplication with vectors as a linear map that takes a vector as enter and returns another vector as output. However there may be one other means to consider matrix multiplication — as a method to change perspective.
Let’s take into account two-dimensional area, R². We signify any vector on this area with two numbers. What do these numbers signify? The coordinates alongside the x-axis and y-axis. A unit vector that factors simply alongside the x-axis is [1,0] and one which factors alongside the y-axis is [0,1]. These are our foundation for the area. Each vector now has an deal with. For instance, the vector [2,3] means we scale the primary foundation vector by 2 and the second by 3.
However this isn’t the one foundation for the area. Another person (say, he who shall not be named) may need to use two different vectors as their foundation. For instance, the vectors e1=[3,2] and e2=[1,1]. Any vector within the area R² will also be expressed of their foundation. The identical vector would have totally different representations in our foundation and their foundation. Like totally different addresses for a similar home (maybe primarily based on totally different postal techniques).
Once we’re within the foundation of he who shall not be named, the vector e1 = [1,0]and the vector e2 = [0,1] (that are the idea vectors from his perspective by definition of foundation vectors). And the features that interprets vectors from our foundation system to that of he who shall not be named and vise-versa are linear maps. And so the translations could be represented as matrix multiplications. Let’s name the matrix that takes vectors from us to the vectors to he who shall not be named, M1 and the matrix that does the alternative, M2. How do we discover the matrices for these matrices?

We all know that the vectors we name e1=[3,2] and e2=[1,1], he who shall not be named calls e1=[1,0] and e2=[0,1]. Let’s accumulate our model of the vectors into the columns of a matrix.

And likewise accumulate the vectors, e1 and e2 of he who shall not be named into the columns of one other matrix. That is simply the id matrix.

Since matrix multiplication operates independently on the columns of the second matrix,

Pre-multiplying by an applicable matrix on each side provides us M1:

Doing the identical factor in reverse provides us M2:

This could all be generalized into the next assertion: A matrix with column vectors; w1, w2, …, wn interprets vectors expressed in a foundation the place w1, w2, …, wn are the idea vectors to our foundation.
And the inverse of that matrix interprets vectors from our foundation to the one the place w1, w2, …, wn are the idea.
All sq. matrices can therefore be considered “foundation changers”.
Observe: Within the particular case of an orthonormal matrix (the place each column is a unit vector and orthogonal to each different column), the inverse turns into the identical because the transpose. So, altering to the idea of the columns of such a matrix turns into equal to taking the dot product of a vector with every of the rows.
For extra on this, see the 3B1B video, [1].
Conclusion
Matrix multiplication is arguably one of the vital essential operations in trendy computing and in addition with virtually any knowledge science subject. Understanding deeply the way it works is essential for any knowledge scientist. Most linear algebra textbooks describe the “what” however not why its structured the best way it’s. Hopefully this weblog crammed that hole.
[1] 3B1B video on change of foundation: https://www.youtube.com/watch?v=P2LTAUO1TdA&t=2s
[2] Introduction to Algorithms by Cormen et.al. Third version
[3] Matrix multiplication as sum of outer merchandise: https://math.stackexchange.com/questions/2335457/matrix-at-a-as-sum-of-outer-products
[4] Catalan numbers wikipedia article https://en.wikipedia.org/wiki/Catalan_number

