Susan Vermeer

Next in Line! Analyzing Sequential Data Using a Markov Chain Approach

Busy or tired while texting a friend? You most likely have experienced making a glaring mistake at least once. After receiving lots of laughing emojis you recheck what you just sent: “I’m getting Pringles tonight!” turned into “I’m getting pregnant tonight!” Oops… But even though automated spelling correction - or “autocorrect”, as it’s more widely known - can result in quite hilarious or downright embarrassing moments, it usually helps us to type faster.

Based on the context of other words in the message and the first letters typed, our mobile phones attempt to predict which word is being typed next. This function is built on the notion of mathematical Markov chains. Based on its most recent values – in this case, the words that are already included in the message – a Markov chain predicts the next value. In addition to autocorrect, Markov chains have been successfully applied in a wide range of other domains, including economics and finance (e.g., predicting asset prices), sports (e.g., baseball analysis), games (e.g., snakes and ladders), search engine algorithms (e.g., Google PageRank algorithm), and speech recognition. There is even an online tool to generate Donald Trump tweets based on a dataset of more than 11,000 of his tweets. (Feel free to try it out yourself: https://filiph.github.io/markov/)

Despite their mathematical simplicity, Markov chains have not yet been widely implemented in communication research. In a recent study, we demonstrate how Markov chains can be used to model the behaviour of online news users. The volume of clickstream and user data collected by news organizations has reached enormous proportions. As a result, news organizations - as well as journalism scholars - face novel methodological challenges to describe and analyze this wealth of information. We advocate the use of Markov chains – providing an effective and compact way to represent Web usage data, and especially suited to detect Web pages that are often viewed in a sequence. To do so, we used online tracking data from approximately 350 users, which resulted in 1 million Web pages from 175 different websites (news websites, search engines, social media) collected over 8 months in 2017/18. Using this data set, we show in a step-to-step way how raw data can be transformed into relevant and meaningful information.

Using Markov chains to describe and analyze patterns of news use is not only useful for journalism scholars, but it is also essential for journalism practice. In particular, the sequentiality of news has gained importance. The behaviour of news users can be an important source of support and inspiration for editorial groups. Moreover, such insights help to determine whether a certain news article should be offered merely to premium members or not. It could be helpful to develop and improve personalization and content recommendation.

Besides journalism scholars, Markov chains could also be highly relevant for sequential data of mobile communication scholars (e.g., analyzing transitions between mobile application usages), marketing communication scholars (e.g., analyzing consumer behaviour in online shopping), and health communication scholars (e.g., examining usage of healthcare information technologies).

Interested in using Markov chains for your own data set? In collaboration with Damian Trilling, Susan Vermeer has published this work in Journalism Studies: “Toward a better understanding of news user journeys: A Markov chain approach”. Additionally, they provide a Python module (df2markov) that automatically performs all necessary calculations and can be used by anyone for their own data sets (https://github.com/uvacw/df2markov).