Media Text Mining

Media text mining :

What could we learn about Europe, if we analysed hundreds of thousands of headlines from Europe’s biggest newspapers in the most influential EU countries? We have developed a text mining software that uses data science techniques to answer this question. The interactive dashboard below presents the first results of our analysis, presented at the EuroPCom 2019 conference (in the European Committee of the Regions).


Details on our analysis and method:

We have developed a text mining software capable of analysing the headlines of more than 2000~ daily news articles from Europe’s 25 biggest newspapers in the 6 most influential EU countries: Germany, France, Italy, Spain, Poland and the UK.

Why newspapers?

National newspapers (both in print and online) play a crucial role in shaping public opinion. Research shows that newspapers and their websites are the second most important source of information on national and European politics for European citizens, after television (Eurobarometer 2017). We have selected the 2 to 6 biggest national newspapers per country, based on their total circulation in Germany, France, Italy, Spain, Poland and the UK.

Why these countries?

These six countries are known as ‘the big 6′, because they are the most powerful countries in the EU. They represent 70% of the EU’s population, 73% of the EU’s GDP and are ranked among the most influential countries in the EU in a survey of 877 EU decision makers (ECFR 2018).

How does our software work?

We are using the programming language R and Python to analyse and visualise data. We are only analysing headlines and short descriptions of news articles, accessed via RSS feeds. We overcome the language barrier through the use of cloud based translation services by Amazon Web Services and Google Cloud Platform. Some headlines are translated into English in order to enable comparative analysis. Research shows that analysis based on machine translated text can lead to highly similar results as analysis based on human translated text (de Vries et al. 2018).



– We are conservative in filtering out our data, in order to avoid false positives. This means, for example, that a ‘headline on the EU elections’ is defined as an article containing one or more filter strings such as ‘European election|EU election|…’ in its title or lead. While these filter strings may miss certain articles on the European elections, we prefer using a limited amount of filter words to avoid false positives.

– Our software does not analyse the full text of articles, but only their titles and short descriptions (‘leads’). This means, for example, that our count of articles mentioning the lead candidates does not take into account articles which mention lead candidates in their full text. We therefore only analyse articles where the lead candidate is truly the subject of the article.

– Depending on the country and its specific newspaper landscape, we analyse different numbers of leading national newspapers: 5 in Germany, 4 in France, 4 in Italy, 3 in Spain, 3 in Poland, 6 in the UK. We are confident that our selection of newspapers provides a good overview of the countries’ specific newspaper landscape – except for Poland. Polish newspapers provide only limited machine readable access to their articles and two important newspapers do not provide access at all. Our data on Poland is therefore limited.

– The analysis started on 24.03.19


Note: Our text and data mining analyses are conducted in a non-commercial way for research purposes only. Our software only produces aggregate data analyses based on headlines and short article descriptions/leads.