Dynamic Topic Modeling on XMPP Literature – Part 1

By now, we have gathered over 200 research papers with prominent appearance of our favorite real-time protocol with even more to come via ACM Digital Library (see Daniel’s previous post on that). Reading all these papers clearly shows that XMPP is applied in many different and highly relevant research domains such as IoT, cloud computing, e-health, etc. With such a comprehensive collection of works over more than a decade in our hands, the researcher in us cries for what? Correct: SURVEY PAPER!

The most well-cited survey paper From Instant Messaging to Cloud Computing, an XMPP review by Hornsby & Walsh is now five years old. Since then a lot more research work was published – roughly 70% of the papers in our collection were published later than 2010! (Well, let’s see if Daniel’s treasure hunt on ACM Digital library can push this percentage even higher!)

For a survey paper, it is particularly interesting to answer the question which major topics were covered over time and which topics trended when. As computer scientists, we tend to answer such questions by analyzing larger text corpora with natural language processing and statistical techniques. In particular, dynamic topic modeling is a suitable technique for our purpose.

In collaboration with his students, our dear colleague Dr. Michael Derntl has developed the D-VITA tool, which realizes dynamic topic modeling, using Latent Dirichlet Allocation (LDA). For our survey paper, we decided to apply the tool on our data set. Unfortunately, most of the work happens *before* the actual application of the algorithm – data cleaning. For each paper, we need to fetch the PDF version, extract plain text, remove artifacts such as running heads, page numbers, etc. until the only thing we have left is a lower-case sequence of words, separated by spaces. Since we include papers from multiple different publication outlets with multiple different paper templates, this is a quite messy and hard-to-automate job. Until now, we managed to create our first topic models with Michael’s help for about 40 papers our collection, but we’ll continue and discuss intermediate results. Just like Number 5, the topic modeling algorithm needs more input to work better. Stay tuned!