Non-Negative Matrix Factorization (NMF)

Hello everyone!

In this post, I will talk about Non-Negative Matrix Factorization algorithm, a very interesting unsupervised learning algorithm. In the first part of this post I will give you a brief overview and explanation of the algorithm, and in the second half focus will be put on practical examples of NMF in action so you can build your intuition about how this algorithm works.

TL;DR: If you are searching for NMF algorithm implementation in Python, check out my GitHub repo.

What is Non-Negative Matrix Factorization?

Non-Negative Matrix Factorization (NMF) is a recent technique for linear dimensionality reduction and data analysis that yields a parts based, sparse non-negative representation for non-negative input data. Essentialy, NMF is an unsupervised learning algorithm coming from linear algebra that not only reduces data dimensionality, but also performs clustering simultaneously.

It has found a wide variety of applications, including text analysis, document clustering, face/image recognition, language modeling, speech processing and many others.

Understanding how NMF works

Before we go any further, let’s understand the intuition of Non-Negative Matrix Factorization.

Having a non-negative data matrix containing a set of items with a fixed number of features, NMF algorithm will produce a factorization of the data matrix and reveal the interesting latent factors underlying the interactions between items and their features. We can consider that each item can be described using these latent factors, and, using the information about how much each of latent factors is expressed for each item, we can cluster the items that share the same latent features. If you are familiar with statistical factor analysis you will find that one of the resulting matrices will look just like the one you would get by running a factor analysis – only all factor weights will be non-negative.

Let’s formalize what NMF actually does.

Given a set of multivariate n-dimensional data vectors, the vectors are placed in the columns of an n x m matrix V where m is the number of examples in the data set. This matrix is then approximately factorized into an n x r matrix W (weights matrix) and an r x m matrix H (features matrix), where r is the number of factors defined by the user. Usually r is chosen to be smaller than n or m, so that W and H are smaller than the original matrix V. This results in a compressed version of the original data matrix.

Multiplying the weights matrix by the features matrix
Multiplying the weights matrix by the features matrix

The columns of W and the rows of H represent the latent factor bundles that are believed to be responsible for the observed data in V. The building blocks are not individual factors but weighted bundles of factors that serve a common purpose.

The reason why it’s called non-negative matrix factorization is that it returns factors and weights with no negative values. This means that all the features in the data matrix must have positive values. This also prevents latent factors taking away pieces of other latent factors as NMF will not find results that explicitly exclude certain factors from the original data matrix.

Use cases

If you are new to linear algebra or machine learning, I suppose concepts described in the previous part might not be so clear to you. By going through several NMF use cases I hope you will get a better understanding of the NMF algorithm.

Use Case 1: Topic Modeling

Let’s suppose we have a collection of transcripts of parliament speeches and let’s set a goal of uncovering common topics of these speeches.

For start, we need to define our data matrix. For topic modeling we use document-term matrix which describes frequency (number of occurrences) of each word per each document.

For illustration purposes, we will construct matrix that is only 5 x 4. In real life scenario, the number of columns and rows would be much higher. We construct the data matrix such that rows correspond to documents and columns to words:

 

labour energy market employment
Speech 1 36 3 45 54
Speech 2 4 34 23 31
Speech 3 9 65 11 0
Speech 4 17 3 3 0
Speech 5 0 14 7 4

 

Now that we have defined our data matrix, we proceed to the algorithm execution. The algorithm requires one user-defined parameter, the number of factors to be extracted (r in the previous section). Suppose we want to extract 2 factors from our data. In that case, after running the algorithm we get resulting (weights) and (factors) matrices.

Matrix W (weights matrix) is a document by factor matrix, with dimensions of 5 x 2:

 

Factor 1 Factor 2
Speech 1 0.021135218 0.63411542
Speech 2 0.26893587 0.24248544
Speech 3 0.56521061 2.2204e-16
Speech 4 0.028056074 0.088332775
Speech 5 0.11666223 0.035066365

 

Matrix H (factors matrix) is a factor by word matrix, with dimensions of 2 x 4:

 

labour energy market employment
Factor 1 10.975128 118.16503 21.246259 2.2204e-16
Factor 2 55.024872 0.83496782 67.753741 89

 

Let’s interpret the result. First, we will take a look at the factors matrix. What we want to identify here is which words have the strongest weight for each feature.

For Factor 1, word energy” has the highest overall weight, followed by “market”. The other two words have lower weights.

Factor 2 has high weights for words “employment”, “market” and “labour”, while it has low weight for the word “energy”.

Having this in mind, we can say that factor 1 shows a set of words pertaining to energy topic, while factor 2 shows a set of words pertaining to labour topic.

The final step is to go back to matrix W. By finding the highest weight for each speech we can conclude what was the main topic of the speech. For example, the main topic of Speech 1 was labour topic, while the main topic of Speech 3 was energy topic.

 

Use Case 2: Stock Trading Data Analysis

As well as dealing with somewhat nominal data like word counts, NMF is suited for problems involving true numerical data. This use case shows how NMF algorithm can be applied to stock market trading volume data. This data may show patterns of important trading days or the ways that underlying latent factors can drive the volume of multiple stocks. The trading volume for a specific stock is the number of shares that are bought and sold within a given period, usually one day.

P&G trading data

This chart from Yahoo! Finance shows trading data for The Procter & Gamble Company stock. The line at the top is the closing price, the price of the last transaction of the day. The bar chart below shows the trading volume.
You’ll notice that the volume tends to be much higher on days when there is a big change in the stock price. This often happens when a company makes a major announcement or releases financial results. Spikes can also occur due to news about the company or the industry. In the absence of external events, trading volume is usually, but not always, constant for a given stock.

For modeling of stock trading we will use volume. The reason behind this decision is not only that it describes changes quite well, but also because the volume can never be negative, and this is exactly what we need to use Non-Negative Matrix Factorization (NMF).

Again, the first step is to construct the data matrix. In this case, columns will correspond to different stocks we want to analyze and rows will correspond to dates. Compare this to with the previous use case: the speeches are now days, the words are now stocks and the frequencies are now trading volumes.

In this example we will analyze stocks of Google and Yahoo!

 

GOOG YHOO
2015-08-21 4,227,300 18,271,700
2015-08-20 2,624,600 15,525,800
2015-08-19 2,131,600 8,974,100
2015-08-18 1,452,600 11,422,700

 

For this purpose, I will use the tool I have developed. We will analyze trading volume of Google and Yahoo! stocks from 2014-12-12 to 2015-08-21 using 10 factors (in this tool they are called features).

Most of the factors have only one stock highly expressed, meaning that this stock event was so strong that it became a factor itself. Also, there are no underlying causes reflected on multiple stocks.

On the other hand, one of the factors (Feature 2) is equally present for both stocks which you can see on this bar chart.

Stock analysis

Left chart is generated via data from the factors matrix, while the right chart is generated via data from the weights matrix.

To find out if there is any connection between stock movements in this factor we want to take a look at the dates most related to this factor. The most related date is 2015-07-17. Searching for news articles about Google and Yahoo! published on this date gives us the answer: Google stocks skyrocketed after quarterly earnings topped analyst estimates, while Yahoo! filed the papers to spin off its Alibaba assets into a separate corporate entity, raising the stock price of Yahoo! that day.

Our analysis has shown that although both companies had significant change in trading on the same day, the underlying cause was different. For some other companies or in some other industries the underlying cause can be the same. Think of regulatory changes, for example.

 

Use Case 3: Recommendation Systems

In collaborative filtering recommendation systems such as the ones used by Amazon, Netflix and many others, there is a group of users and a set of items – movies, books, products. Given that each user has rated some items, we would like to predict how each user would rate the items they haven’t rated yet so we can make recommendations.

Imagine an online bookstore where users can review books and grade them on a scale from 1 to 5. If we create a data matrix to represent this case, we will get something like this:

 

Book 1 Book 2 Book 3 Book 4
User 1 5 3 / 1
User 2 4 / / 1
User 3 1 1 / 5
User 4 1 / / 4
User 5 / 1 5 4

Slashes are put if a user has not reviewed that particular book. These are the values that we will try to predict.

The intuition behind using NMF to solve this problem is that there should be some latent factors that determine how a user rates an item. For example, two users would give high ratings to a certain movie if they both like the actors / actresses of the movie, or if the movie is an action movie, which is a genre preferred by both users. If we can discover these latent factors we can predict the missing ratings.

To feed the data matrix to NMF algorithm all the missing values should be replaced with zeros.

The idea is to let the algorithm find W and H matrices, and then to reconstruct the original matrix by multiplying W with H. After running the algorithm with 2 features the reconstructed matrix we get looks something like this:

 

Book 1 Book 2 Book 3 Book 4
User 1 4.97 2.98 2.18 0.98
User 2 3.97 2.40 1.97 0.99
User 3 1.02 0.93 5.32 4.93
User 4 1.00 0.85 4.59 3.93
User 5 1.36 1.07 4.89 4.12

Reconstructed known values are not too off from the original data matrix and we also get predictions for the missing ones. We can take actions with these new insights by, for example, recommending products for which we predicted a rating higher than defined threshold. For our bookstore, we might recommend books for which we predicted a rating higher than 3.

 

Further reading

If you would like to read more about Non-Negative Matrix Factorization I recommend you to start with these two papers:

  1. Algorithms for Non-Negative Matrix Factorization
  2. The Why and How of Non-Negative Matrix Factorization

If you want to play with NMF algorithm and you have a basic knowledge of Python you can use off the shelf implementation present in scikit-learn, an open source machine learning library written in Python.

You can also take a look at my Java implementation of the algorithm using multiplicative update rules on GitHub.