Može li računar da zameni novinare Telegrafa? (FOTO)

 

Van Srbije živim već neko vreme i glavno sredstvo informisanja o domaćim dešavanjima mi je putem internet medijskih portala.

Nažalost, čini mi se da sve teže i teže postaje pronaći dobre vesti koje su odgovarajućeg kvaliteta. Čak i tradicionalne medijske kuće su postale sklone izbacivanju takozvanih “clickbait” članaka koje u prvom planu nemaju vest niti njen kvalitet već isključivo povećanje poseta web portalu i dalje širenje vesti na društvenim mrežama upotrebom pompeznih naslova.

Pre oko godinu dana Dnevnjak ekipa je izbacila odličan snimak gde su satirično prikazali kako se prave naslovi za vesti.

Gledajući razne naslove i inspirisan snimkom, postavio sam sebi sledeće pitanje:

Da li je zaista potrebna  ljudska inteligencija da bi se napisali clickbait naslovi?

Postavka zadatka

Da bih odgovorio na svoje pitanje rešio sam da izgradim model sposoban da piše clickbait naslove.

Kao i za svaki drugi zadatak prvi korak je prikupljanje podataka koji će se koristiti za modelovanje – u našem slučaju srpskih clickbait naslova.

Nije mi trebalo puno vremena da odlučim koji medijski portal da izaberem za izgradnju skupa podataka – Telegraf, arhetip clickbait novinarstva, se nametnuo kao logičan izbor.

U drugom koraku je izgrađen model nad prikupljenim podacima i generisani su naslovi pomoću modela. U ovom slučaju, to će biti rekurentna neuronska mreža, odnosno specifična implementacija Long Short-Term Memory mreže.

Prikupljanje podataka

Veličina našeg skupa podataka mora biti dosta velika kako bi izlazi modela koji će nad tim skupom biti izgrađen bili smisleni. Veći skup podataka uglavnom znači da će model bolje naučiti domen.

Ručno prikupljanje podataka bi bilo besmisleno jer bi izgradnja skupa podataka odgovarajuće veličine potrajala predugo.

Kako možemo da automatizujemo prikupljanje naslova?

Hajde da pogledamo kako izgledaju stranice sa vestima na Telegrafu. Brzim odlaskom na Vesti i inspekcijom izvornog koda zaključujemo da se naslovi nalaze u okviru HTML span elemenata. Dalje, pri dnu se nalazi dugme Starije koje nas vodi do starijih vesti, sa veoma zgodnom strukturom URL-a:  http://www.telegraf.rs/vesti/page/2.

Šta iz ovoga zaključujemo?

  1. Izdvajanjem teksta unutar span elemenata možemo da dobijemo sve naslove sa jedne stranice
  2. Izmenom kategorije i rednog broja u URL adresi možemo da prikupimo naslove sa velikog broja stranica iz njihovih span elemenata

Hajde da to i pretočimo u kod. Za svrhe prikupljanja naslova rešio sam da iskoristim Scrapy, Python framework za ekstrahovanje sadržaja web stranica.

Framework je odličan i za 10-15 minuta sam imao funkcionalno rešenje za preuzimanje naslova.

Koliko je jednostavno napraviti svoj scraper možete videti iz priloženog:


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import scrapy

_URL_TEMPLATE = 'http://www.telegraf.rs/{}/page/{}'
_MAX_PAGE_COUNT = 999
_CATEGORIES = 'vesti', 'sport', 'jetset', 'zanimljivosti', 'zivot-i-stil'
_SKIPPED_SPAN_VALUES = '0.20', 'starije', 'novije', u'©Telegraf 2016', u'Sva prava zadržana.'

class TelegrafSpider(scrapy.Spider):
name = 'naslovi'

def start_requests(self):
urls = (_URL_TEMPLATE.format(cat, i) for cat in _CATEGORIES for i in range(_MAX_PAGE_COUNT))

for url in urls:
    yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
for title in response.css('span::text').extract():
    if title in _SKIPPED_SPAN_VALUES:
        continue
    yield {
        'title': title.strip()
    }

Kombinovanjem naslova iz kategorija Vesti, JetSet, Zanimljivosti Život i stil napravljen je skup podataka koji sadrži 63119 naslova koje ćemo koristiti za treniranje modela.

Ispod se nalazi 10 nasumično odabranih naslova koji ilustruju skup podataka:

Stigla je nova beba Kardašijan! Da li biste detetu IKADA dali OVO IME?! (FOTO) (VIDEO)
Milica Pavlović, Aleksandra Bursać, Ivana… Pogledajte ko je bio na venčanju zvezde “Granda”! (FOTO)
BEZOBRAZNI KUĆNI SNIMAK: Srpska pevačica se milovala u krevetu pa prste stavila… (VIDEO)
Evo kako je Filip prokomentarisao priče o raskidu sa Marijom Anom! (FOTO)
Pre 30 godina je bila bomba, gola se mazala šlagom, nestala, ušla na Farmu, a ovako sada izgleda seksi Suzana! (FOTO) (VIDEO)
POKUŠAO DA SILUJE MAJKU I ĆERKU: Drama u porodičnoj kući u Ivanjici
ŠUMADIJSKI RULET: Srbi se PREKRSTE, pa krenu preko ovog mosta! (FOTO)
OGROMAN USPEH SRPSKIH LEKARA: Ugrađen prvi implantabilni monitor srca!
(UZNEMIRUJUĆI VIDEO) KAKAV LUDAK: Pucao sam sebi u glavu zbog opklade i PREŽIVEO (VIDEO)
Fenomenalan trik da se ZAUVEK otarasite MUVA, i to bukvalno PREKO NOĆI, a potrošićete samo 75 dinara

Kombinovani skup podataka i pojedinačne skupove možete da preuzmete za svoje istraživačke potrebe. Linkovi se nalaze na dnu teksta.

Izgradnja modela

Modelovanje jezika je jako zanimljiv zadatak u mašinskom učenju iz više aspekata, od prirode ulaznih podataka do kompleksnosti odnosa između karaktera, reči, rečenica i njihovog značenja.

U ovom slučaju ćemo se poslužiti Long Short-Term Memory (LSTM) rekurentnom neuronskom mrežom koju ćemo istrenirati da generiše clickbait naslove.

LSTM rekurentna neuronska mreža
LSTM rekurentna neuronska mreža. Izvor: colah.github.io

LSTM arhitektura rekurentnih neuronskih mreža se pokazala odlično kod modelovanja sekvencijalnih podataka. Danas je jedna od najčešće korišćenih arhitektura za modelovanje govora, rukopisa,  prevođenja, označavanja slika i video zapisa…

Glavna karakteristika koja LSTM neuronske mreže čini moćnim u odnosu na klasične neuronske mreže jeste njihova sposobnost da “pamte” kontekst.

Na primer, neka zadatak modela bude da predvidi poslednju reč u izrazu. Neka je izraz “Jede mi se nešto slatko. Čokolada je slatka. Idem da kupim čokoladu.” . Klasične neuronske mreže bi imale poteškoće da uspešno predvide poslednju reč u izrazu. Međutim, zadatak predviđanja čokolade bi bio značajno lakši za LSTM mreže zbog njihove sposobnosti da kroz svoje stanje nauče i zapamte odnose između entiteta, gde same entitete mreža uči bez pomoći čoveka.

Za one koji bi želeli više da saznaju o LSTM mrežama savetujem da pročitaju tekst Christophera Olaha.
Pored njega, ako biste ozbiljno želeli da uđete u materiju, zapratite Andreja Karpathyja. Andrej je sjajan istraživač sa Stanford univerziteta koji sa zajednicom deli svoj rad na savremenim modelima neuronskih mreža.

Model LSTM mreže koji ćemo trenirati baziran je na njegovoj implementaciji objavljenoj na GitHubu. U pitanju je model koji predviđa sledeći karakter (slovo) u sekvenci karaktera.

Iako ovo možda ne deluje značajno na prvi pogled, setite se da mreža neće predviđati sledeće slovo na bazi prethodnog slova, već takođe i oslanjajući se na svoje stanje, a stanje daje širi kontekst: više prethodnih karaktera, prvi karakter, sav tekst koji je pre toga viđen, i ko zna kakve ostale implicitne atribute koje mreža sama nauči kroz treniranje nad skupom podataka.

Model je istreniran sa rekurentnom mrežom sa 2 sloja i 128 jedinica po sloju. Ukupno vreme treniranja neuronske mreže je trajalo oko 12 sati a tokom treninga su zapamćene različite konfiguracije mreže kako bi se ispratio napredak u učenju.

Rezultati

Hajde da pogledamo šta je naš đak naučio u novinarskoj školi Telegrafa:

Nakon prvog prolaska kroz skup podataka, mreža je generisala sledeće naslove:

JEDNI NARADINE U SRBI: Pogolita izgledao na izgradi PODINIRE PREDSED! (FOTO)
VAO NA REKLA PUCA U MINA: Klaca Srpinti zgredi! (VIDEO)
ČLIŠAVA TREGNA NEVEZA: Pampunjaši svaku “Zoruovanom pita! (VIDEO)

Vidimo da je mreža već naučila da se Telegrafovi naslovi završavaju sa (FOTO) i (VIDEO), kao i da je praksa da prvi deo rečenice pre dvotačke bude ispisan velikim slovima. Ipak, ne deluje baš kao da naslove možemo da pročitamo.

Nakon oko pet prolazaka, generisani naslovi već počinju da dobijaju više (Telegraf) smisla:

ZA DOBRO JUTRO: Nikad žena preće? Evo kako!
ZVAĆINOVO: Hočin Tačić rado! (18+) (VIDEO)
Viking donacija zbog guza hoće na državljaju
HAOS U DAMCI: Pevačica otkrio Karadžiću u kišu (FOTO)

Oko osmog prolaza model je naučio malo više o mračnim stranama Telegraf naslova:

SILOVAO DA SE ZAMALO PROSTITETNIM BEOGRADA OGROMNIM ALBANSKO
Putinski seks-ulom idem sa živeti!
Vučić: (VIDEO)

36 prolaza kasnije:

PINKOVE GLAS PRIČA: Karan Kalifi Marija Foksabi, ali ga napravili PUCALE na Instagramu?
EKSKLUZIVNO: Aleksandar Švercina Milijara Šabanova beba u nju (VIDEO)
NIĆI RAZBILA: Sloboda Gagijen, pomislio klopa u Telegraf (FOTO) (VIDEO)
TRUDNICA IZVEDENIH KRIZA U BEOGRADU: Na zamaklu, Srbiji u ritnu na izborima i kafića vakcinise (VIDEO)


42 prolaza, izgleda da nas model implicitno upozorava da je Telegraf zlo:

ŽIVI SE PLATI: Svetski život BEZ TELEGRAFA SE KOJI ĆE VAM SE VAM SE PREŽBUNJA!
BIĆA LAZI: Milan Hramrić zabranio Barara? Sodina izazvala protestu! (FOTO)

50 prolaza kasnije:

ZBOGODISLI MIS PUTINA: Napravite dualja (FOTO)
SILOVAO DA JE OVO! Ovako zatvorile smuvao dete o maju!
OSTAO TE: Hrvati otrov na svetu, ključjusa i banane, za jednim četvrtivno! (FOTO)

I moj favorit:

HRVATSKA NUKLEARNA GLAVA: Tamara ispala Malomečević u ćansi detalja

 

Zaključak

Prvi trening mreže je dao dosta zanimljive rezultate. Iako gramatika definitivno nije jača strana delovi naslova ne zvuče skroz suludo i nasumično, naročito ako se posmatra prvi deo naslova, ispisan velikim slovima. U obzir se mora uzeti i složenost gramatike srpskog jezika; kada bismo naslove preveli na engleski koji nije osetljiv na padeže, rod i slično rezultati bi delovali dosta bolje.

Posmatrajući vrednost funkcije troškova kroz više iteracija deluje kao da je ona konvergirala jako brzo. Moguće je da se se funkcija zaglavila i da nije mogla da dostigne minimum zbog visoke vrednosti stope učenja, iako je ona iterativno smanjivana.

Funkcija troškova
Funkcija troškova

Funkcija troškova je, najprostije rečeno, razlika između stvarne vrednosti i onoga što je model predvideo. Njenu vrednost želimo da minimizujemo kroz proces učenja. Stopa učenja se može zamisliti kao veličina koraka na putu ka minimumu – ako je korak preveliki nećemo moći da se spustimo do minimuma jer ćemo ga preskakati. Ilustracija će pomoći da razumete koncept. U primeru se radi o pronalaženju minimuma u dvodimenzionalnom prostoru. Kod savremenih neuronskih mreža broj dimenzija se meri desetinama i stotinama miliona.

Izmene konfiguracije modela koje potencijalno mogu da poboljšaju performanse su promena stope učenja i broja jedinica u slojevima. Svaka od ovih izmena bi neminovno dovela do dužeg trajanja treninga koji je već sada iznosio 12 sati što predstavlja ograničenje.

Ipak, nadam se da ste se nasmejali nekim od ovih naslova isto koliko i ja kada sam ih čitao. A nadam se i da ste naučili nešto novo.

Disclaimer

Ovo je moj lični blog. Sva izražena mišljenja na blogu su isključivo moja i ne pripadaju mom poslodavcu.
Preuzimanje naslova, treniranje modela i pisanje članka su obavljeni za vreme mog slobodnog vremena, van radnih sati.
Rezultati koje je model generisao su naučeni iz originalnih naslova i nisu korigovani (u pitanju su sirovi izlazi modela).

Ceo rad je izveden u istraživačke svrhe i bez loših namera.

Skupovi podataka

Svi upotrebljeni skupovi podataka se mogu preuzeti kao tekstualne datoteke sa mog GitHub naloga.

Priznanja

Autor upotrebljene implementacije LSTM modela je Andrej Karpathy. Ponovo, toplo preporučujem da pratite njegov rad.

Monte Carlo Simulations and Connecting Flights

Introduction

This blog post was mostly written while waiting at Gate B1 at Sofia International Airport (SOF).

I decided to return to Dublin from Sofia with Austrian Airlines, which includes a connection with only 50 minutes long layover at Vienna airport.

I don’t like layovers this short – it can easily happen that I miss my next flight and fall into limbo of finding alternative flights.

So, to amuse, I asked myself: What is the chance of me making it to the next flight?

Situation modelling

Let’s define the situation:

My flight from Sofia departures to Vienna at 13:40 PM (GMT+2) according to schedule. My next flight from Vienna to Frankfurt takes off at 15:10 PM (GMT+1). In between, I have to get to the gate on time.

So, what are my chances?

To answer my main question, I needed to find answers to the following questions:

  1. What is the average delay of flight from Sofia to Vienna?
  2. What is the time needed to get from my arrival gate to my next flight departure gate?
  3. How often does flight from Vienna to Frankfurt gets delayed (and for how long)?

When we have them, we want to calculate total time it will take to get to the gate of my next flight – my time of arrival at the gate.

We will then calculate when my next flight might take off by taking into account its delay.

By deducting these two time values, we will get time difference – this is the outcome we are interested in.

We are interested in values greater than 0, because this means I will make it on time. If the value is below 0, it means I’m late.

To model this problem, I first checked out a website called flightstats.com. There, I found out what the average delay time is and approximately how distribution of delays looks like. We cannot describe delays with Gaussian distribution, as we can see that most often scenario is flight arriving more or less on time, and, while other outcomes with longer delays are possible, they are not so frequent.

We will use (negative) exponential probability distribution as data from FlightStats.com follows its shape. This distribution takes only one parameter, called rate, which is equal to 1/mean. You can read more about it on Wikipedia. One feature of exponential probability distribution that is particularly useful in this case is that we cannot get negative values, as this distribution is supported on the interval [0, ∞). Flight cannot have negative delay time, right?

FlightStats.com stats tell us that the average delay (mean) is 35 minutes. Alright, so now we know our parameter value will be 1/35.

How do we proceed? We would like to simulate many different outcomes so we could know what to expect. The ideas is to get many trials with different outcomes and combine them all so we can get an idea of what our outcome distribution looks like.

Monte Carlo to the rescue!

With Monte Carlo simulations, we will sample 100,000 different values from used distributions to get 100,000 different outcomes, combining values from our distributions of delay times, terminal transfer time etc. I chose to sample 100,000 values arbitrarily, you can choose other sample size.  The bigger the sample, the better approximation gets.

When can we expect our first flight to arrive? Let’s combine flight departure time, sampled delay times and flight duration time.

Milestone 1: We can calculate 100,000 possible outcomes of arrival time by summing departure time and 100,000 sampled values from exponential distribution of delay time, and adding flight duration.

For transfer at Vienna airport their website states it takes approximately 20 minutes. Alright, let’s assume that the time needed to move from gate A to gate B is normally distributed with a mean of 20 and standard deviation of 5 minutes.

Milestone 2: We can calculate expected time of arrival at the gate of my next flight by summing values from milestone 1 with sampled values of time needed to get from arrival to departure gate.

Before we move onto the next step, we have to take into consideration two time corrections:

  1. Timezone difference – Vienna is GMT+1 and Sofia is GMT+2, so we will deduct time by one hour
  2. Gate closing time – gates close 20 minutes before departure; we will add 20 minutes on total time to adjust

Milestone 3: Total time to get to the gate has been corrected with timezone difference and gate closing time adjustment

Let’s also check how often flight from Vienna to Frankfurt is delayed, as this gives me bonus time to make it on time! 🙂

On average, it’s 9 minutes, so we will model this delay again with exponential distribution, where parameter will take value of 1/9.

Milestone 4: We calculate 100,000 possible outcomes of Vienna to Frankfurt flight departure time by adding sampled delay times to scheduled departure time

Our last step is to deduct sampled total time to get to the gate from sampled departure times of my next flight.

Analysis outcome: distribution of sampled values telling us what’s the difference between departure time and time needed to get to the gate on time

Now, let’s combine everything together.
For this purpose, we will use R, programming language for statistical computing.

We will model everything using minutes and absolute time.

 sofiaViennaDepartureTime = 13 * 60 + 40 
 sofiaViennaDelayTime = rexp(100000, 1/35)
 sofiaViennaFlightTime = 100
 sofiaViennaArrivalTime = sofiaViennaDepartureTime + sofiaViennaDelayTime + sofiaViennaFlightTime
 timezoneCorrection = 60
 sofiaViennaArrivalTime = sofiaViennaArrivalTime - timezoneCorrection
 viennaAirportTransfer = rnorm(100000, 20, 5)
 gateClosingCorrection = 20
 totalTimeToFlight = sofiaViennaArrivalTime + viennaAirportTransfer + gateClosingCorrection
 viennaFrankfurtDepartureTime = 15 * 60 + 10
 viennaFrankfurtDelay = rexp(100000, 1/9)
 viennaFrankfurtTotalTime = viennaFrankfurtDepartureTime + viennaFrankfurtDelay
 result.set = viennaFrankfurtTotalTime - totalTimeToFlight
 sum(result.set > 0) / 100000
[1] 0.3963

After adding and deducing all different random variable samples we got our final result and stored it in result.setTo calculate my probability, I just need to count number of values greater than 0 and divide them by total number of values, which is equal to 100,000.

So, I have .3963 chance of arriving on time to catch my next flight, or, in other words, I should catch my next flight 4 out of 10 times!

If we plot resulting values using histogram, it would look something like this:

Flight buffer time histogram

Conclusion

Is this model accurate? Probably not. There are two main reasons: first of all, I made a lot of assumptions regarding probability distributions – maybe exponential and normal distributions I used don’t fit this data so well.

Also, there are only 74 observations for delay time for flight from Sofia to Vienna – maybe our sample mean is biased and one outlier is heavily affecting mean value. With a larger sample, mean delay might be smaller causing distribution to have a different shape, resulting in more positive outcomes.

Was this a nice brain exercise? It most definitely was and it helped me kill some time at the airport! 🙂

I hope you enjoyed this post and found out something new from it. As always, feedback is highly appreciated and I would love to hear your opinion!

For your information, I made it to my next flight and published this post from Frankfurt International Airport! 🙂

Now, off to my flight to Dublin!

Non-Negative Matrix Factorization (NMF)

Hello everyone!

In this post, I will talk about Non-Negative Matrix Factorization algorithm, a very interesting unsupervised learning algorithm. In the first part of this post I will give you a brief overview and explanation of the algorithm, and in the second half focus will be put on practical examples of NMF in action so you can build your intuition about how this algorithm works.

TL;DR: If you are searching for NMF algorithm implementation in Python, check out my GitHub repo.

What is Non-Negative Matrix Factorization?

Non-Negative Matrix Factorization (NMF) is a recent technique for linear dimensionality reduction and data analysis that yields a parts based, sparse non-negative representation for non-negative input data. Essentialy, NMF is an unsupervised learning algorithm coming from linear algebra that not only reduces data dimensionality, but also performs clustering simultaneously.

It has found a wide variety of applications, including text analysis, document clustering, face/image recognition, language modeling, speech processing and many others.

Understanding how NMF works

Before we go any further, let’s understand the intuition of Non-Negative Matrix Factorization.

Having a non-negative data matrix containing a set of items with a fixed number of features, NMF algorithm will produce a factorization of the data matrix and reveal the interesting latent factors underlying the interactions between items and their features. We can consider that each item can be described using these latent factors, and, using the information about how much each of latent factors is expressed for each item, we can cluster the items that share the same latent features. If you are familiar with statistical factor analysis you will find that one of the resulting matrices will look just like the one you would get by running a factor analysis – only all factor weights will be non-negative.

Let’s formalize what NMF actually does.

Given a set of multivariate n-dimensional data vectors, the vectors are placed in the columns of an n x m matrix V where m is the number of examples in the data set. This matrix is then approximately factorized into an n x r matrix W (weights matrix) and an r x m matrix H (features matrix), where r is the number of factors defined by the user. Usually r is chosen to be smaller than n or m, so that W and H are smaller than the original matrix V. This results in a compressed version of the original data matrix.

Multiplying the weights matrix by the features matrix
Multiplying the weights matrix by the features matrix

The columns of W and the rows of H represent the latent factor bundles that are believed to be responsible for the observed data in V. The building blocks are not individual factors but weighted bundles of factors that serve a common purpose.

The reason why it’s called non-negative matrix factorization is that it returns factors and weights with no negative values. This means that all the features in the data matrix must have positive values. This also prevents latent factors taking away pieces of other latent factors as NMF will not find results that explicitly exclude certain factors from the original data matrix.

Use cases

If you are new to linear algebra or machine learning, I suppose concepts described in the previous part might not be so clear to you. By going through several NMF use cases I hope you will get a better understanding of the NMF algorithm.

Use Case 1: Topic Modeling

Let’s suppose we have a collection of transcripts of parliament speeches and let’s set a goal of uncovering common topics of these speeches.

For start, we need to define our data matrix. For topic modeling we use document-term matrix which describes frequency (number of occurrences) of each word per each document.

For illustration purposes, we will construct matrix that is only 5 x 4. In real life scenario, the number of columns and rows would be much higher. We construct the data matrix such that rows correspond to documents and columns to words:

 

labour energy market employment
Speech 1 36 3 45 54
Speech 2 4 34 23 31
Speech 3 9 65 11 0
Speech 4 17 3 3 0
Speech 5 0 14 7 4

 

Now that we have defined our data matrix, we proceed to the algorithm execution. The algorithm requires one user-defined parameter, the number of factors to be extracted (r in the previous section). Suppose we want to extract 2 factors from our data. In that case, after running the algorithm we get resulting (weights) and (factors) matrices.

Matrix W (weights matrix) is a document by factor matrix, with dimensions of 5 x 2:

 

Factor 1 Factor 2
Speech 1 0.021135218 0.63411542
Speech 2 0.26893587 0.24248544
Speech 3 0.56521061 2.2204e-16
Speech 4 0.028056074 0.088332775
Speech 5 0.11666223 0.035066365

 

Matrix H (factors matrix) is a factor by word matrix, with dimensions of 2 x 4:

 

labour energy market employment
Factor 1 10.975128 118.16503 21.246259 2.2204e-16
Factor 2 55.024872 0.83496782 67.753741 89

 

Let’s interpret the result. First, we will take a look at the factors matrix. What we want to identify here is which words have the strongest weight for each feature.

For Factor 1, word energy” has the highest overall weight, followed by “market”. The other two words have lower weights.

Factor 2 has high weights for words “employment”, “market” and “labour”, while it has low weight for the word “energy”.

Having this in mind, we can say that factor 1 shows a set of words pertaining to energy topic, while factor 2 shows a set of words pertaining to labour topic.

The final step is to go back to matrix W. By finding the highest weight for each speech we can conclude what was the main topic of the speech. For example, the main topic of Speech 1 was labour topic, while the main topic of Speech 3 was energy topic.

 

Use Case 2: Stock Trading Data Analysis

As well as dealing with somewhat nominal data like word counts, NMF is suited for problems involving true numerical data. This use case shows how NMF algorithm can be applied to stock market trading volume data. This data may show patterns of important trading days or the ways that underlying latent factors can drive the volume of multiple stocks. The trading volume for a specific stock is the number of shares that are bought and sold within a given period, usually one day.

P&G trading data

This chart from Yahoo! Finance shows trading data for The Procter & Gamble Company stock. The line at the top is the closing price, the price of the last transaction of the day. The bar chart below shows the trading volume.
You’ll notice that the volume tends to be much higher on days when there is a big change in the stock price. This often happens when a company makes a major announcement or releases financial results. Spikes can also occur due to news about the company or the industry. In the absence of external events, trading volume is usually, but not always, constant for a given stock.

For modeling of stock trading we will use volume. The reason behind this decision is not only that it describes changes quite well, but also because the volume can never be negative, and this is exactly what we need to use Non-Negative Matrix Factorization (NMF).

Again, the first step is to construct the data matrix. In this case, columns will correspond to different stocks we want to analyze and rows will correspond to dates. Compare this to with the previous use case: the speeches are now days, the words are now stocks and the frequencies are now trading volumes.

In this example we will analyze stocks of Google and Yahoo!

 

GOOG YHOO
2015-08-21 4,227,300 18,271,700
2015-08-20 2,624,600 15,525,800
2015-08-19 2,131,600 8,974,100
2015-08-18 1,452,600 11,422,700

 

For this purpose, I will use the tool I have developed. We will analyze trading volume of Google and Yahoo! stocks from 2014-12-12 to 2015-08-21 using 10 factors (in this tool they are called features).

Most of the factors have only one stock highly expressed, meaning that this stock event was so strong that it became a factor itself. Also, there are no underlying causes reflected on multiple stocks.

On the other hand, one of the factors (Feature 2) is equally present for both stocks which you can see on this bar chart.

Stock analysis

Left chart is generated via data from the factors matrix, while the right chart is generated via data from the weights matrix.

To find out if there is any connection between stock movements in this factor we want to take a look at the dates most related to this factor. The most related date is 2015-07-17. Searching for news articles about Google and Yahoo! published on this date gives us the answer: Google stocks skyrocketed after quarterly earnings topped analyst estimates, while Yahoo! filed the papers to spin off its Alibaba assets into a separate corporate entity, raising the stock price of Yahoo! that day.

Our analysis has shown that although both companies had significant change in trading on the same day, the underlying cause was different. For some other companies or in some other industries the underlying cause can be the same. Think of regulatory changes, for example.

 

Use Case 3: Recommendation Systems

In collaborative filtering recommendation systems such as the ones used by Amazon, Netflix and many others, there is a group of users and a set of items – movies, books, products. Given that each user has rated some items, we would like to predict how each user would rate the items they haven’t rated yet so we can make recommendations.

Imagine an online bookstore where users can review books and grade them on a scale from 1 to 5. If we create a data matrix to represent this case, we will get something like this:

 

Book 1 Book 2 Book 3 Book 4
User 1 5 3 / 1
User 2 4 / / 1
User 3 1 1 / 5
User 4 1 / / 4
User 5 / 1 5 4

Slashes are put if a user has not reviewed that particular book. These are the values that we will try to predict.

The intuition behind using NMF to solve this problem is that there should be some latent factors that determine how a user rates an item. For example, two users would give high ratings to a certain movie if they both like the actors / actresses of the movie, or if the movie is an action movie, which is a genre preferred by both users. If we can discover these latent factors we can predict the missing ratings.

To feed the data matrix to NMF algorithm all the missing values should be replaced with zeros.

The idea is to let the algorithm find W and H matrices, and then to reconstruct the original matrix by multiplying W with H. After running the algorithm with 2 features the reconstructed matrix we get looks something like this:

 

Book 1 Book 2 Book 3 Book 4
User 1 4.97 2.98 2.18 0.98
User 2 3.97 2.40 1.97 0.99
User 3 1.02 0.93 5.32 4.93
User 4 1.00 0.85 4.59 3.93
User 5 1.36 1.07 4.89 4.12

Reconstructed known values are not too off from the original data matrix and we also get predictions for the missing ones. We can take actions with these new insights by, for example, recommending products for which we predicted a rating higher than defined threshold. For our bookstore, we might recommend books for which we predicted a rating higher than 3.

 

Further reading

If you would like to read more about Non-Negative Matrix Factorization I recommend you to start with these two papers:

  1. Algorithms for Non-Negative Matrix Factorization
  2. The Why and How of Non-Negative Matrix Factorization

If you want to play with NMF algorithm and you have a basic knowledge of Python you can use off the shelf implementation present in scikit-learn, an open source machine learning library written in Python.

You can also take a look at my Java implementation of the algorithm using multiplicative update rules on GitHub.