Topic modeling of newspapers
As the Covid-19 pandemic affects so many lives, it has been imperative for governments, individuals and businesses to track the way it has affected society. They need to foresee the consequences of the disease and lockdown measures and adapt accordingly. Tracking discussions that take place in media and news websites is a way to monitor the current situation and detect the socio-economic effects of Covid-19 and track if different regions are going back to the normal life or moving toward a new normal.
This work shows how news content has changed in a duration in which the outbreak erupted in the UK, and how the news has evolved with rising Covid-19 cases and restrictions imposed on the society by the UK government. The analysis helps understand the tone and focus of media during a certain period. It also demonstrates how natural language processing can be used to digest and discern trends from different sources of information.
Code Available on Github
To understand better how news has evolved with the outbreak of Covid-19, we defined society-related indicators such as the number of publications of different topics and their sentiment and monitored them over time. For instance, we looked into the diversity of topics discussed in news and how their distribution and sentiment have changed over time. For our analysis, we used NLP approaches of topic modelling and sentiment analysis following the NLP workflow shown below.
In the NLP workflow, from left to right, first, we ingest data obtained from Socialgist containing UK News Articles. We also obtained the Oxford University’s Government Response Tracker  such as policies in containment and closure, economic policies and income supports, and health system policies. We also added the confirmed Covid-19 cases in the UK to analyze the trend of topics along with the trend of government measures and the spread of the disease.
The result is a clean dataset ready for topic modelling and sentiment analysis. Finally, we created plots to visualize the evolution of news content over time.
For this analysis, we focused on the following UK-based news providers from 1st of January till the end of May 2020:
The first two are for analyzing news that target the general public and cover a variety of topics. Also, we analyzed financial news to give particular attention to financial topics as we know businesses have been widely affected by Covid-19.
In this blog post, we share our topic modelling results. You can continue reading our sentiment analysis work at “Sentiment analysis of newspapers”. Additionally, all the code for this analysis is made publically available on GitHub.
Results of Topic Modeling
Topic modelling is an unsupervised learning method in natural language processing where a collection of textual content is clustered into topics. One of the well-known algorithms is LDA (latent Dirichlet allocation) that discovers abstract topics by calculating the distribution of words per topic and distribution of topics per document. We used Gensim library to discover topics and pyLDAviz to visualize them.
This is an example of an LDA plot where ten topics are clustered. On the left, the clusters are shown which their size indicates the marginal topic distribution. On the right, the most important words of a topic are shown with their frequency measure within that topic (red bars) versus their overall frequency in the entire corpus (blue