Media text mining [Archive]

Media text mining [beta]:

What do Europeans think about the EU? To what extent do national perspectives on controversial topics such as Brexit or immigration differ?

In order to answer these questions, we have developed a text mining software capable of analysing the headlines of more than 2000~ daily news articles from Europe’s 25 biggest newspapers in the 6 most influential EU countries: Germany, France, Italy, Spain, Poland and the UK.


First insights:

1. EU elections interest index: Our initial finding is the strong lack of press coverage / attention given to the European elections: In the two weeks before the elections, only 2,4% of headlines in the most influential European newspapers were about the European elections. In April 2019, on average only around 1% of headlines were about the elections.

2. Lead candidate analysis: The EU party group lead candidates are only mentioned in 251 headlines out of 18 978 EU-related headlines analysed since 24.03.19. Some of them are not mentioned in a single headline.

3. Brexit sentiment analysis: UK and continental EU newspapers use remarkably similar emotional terms to describe Brexit. Negative sentiments with words like ‘chaos’, ‘reject’, or ‘crisis’ dominate coverage on Brexit.

Note: Our software is still in beta. See more details on our method below the dashboard.

Details on our analysis and method:


Why newspapers?

National newspapers (both in print and online) play a crucial role in shaping public opinion. Research shows that newspapers and their websites are the second most important source of information on national and European politics for European citizens, after television (Eurobarometer 2017). We have selected the 2 to 6 biggest national newspapers per country, based on their total circulation in Germany, France, Italy, Spain, Poland and the UK.

Why these countries?

These six countries are known as ‘the big 6′, because they are the most powerful countries in the EU. They represent 70% of the EU’s population, 73% of the EU’s GDP and are ranked among the most influential countries in the EU in a survey of 877 EU decision makers (ECFR 2018).

How does our software work?

We are using the programming language R to analyse and visualise data. We are only analysing headlines and short descriptions of news articles, accessed via RSS feeds. We overcome the language barrier through the use of cloud based translation services by Amazon Web Services and Google Cloud Platform. Some headlines are translated into English in order to enable comparative analysis. Research shows that analysis based on machine translated text can lead to highly similar results as analysis based on human translated text (de Vries et al. 2018).



– We are conservative in filtering out our data, in order to avoid false positives. This means, for example, that a ‘headline on the EU elections’ is defined as an article containing one or more filter strings such as ‘European election|EU election|…’ in its title or lead. While these filter strings may miss certain articles on the European elections, we prefer using a limited amount of filter words to avoid false positives.

– Our software does not analyse the full text of articles, but only their titles and short descriptions (‘leads’). This means, for example, that our count of articles mentioning the lead candidates does not take into account articles which mention lead candidates in their full text. We therefore only analyse articles where the lead candidate is truly the subject of the article.

– Depending on the country and its specific newspaper landscape, we analyse different numbers of leading national newspapers: 5 in Germany, 4 in France, 4 in Italy, 3 in Spain, 3 in Poland, 6 in the UK. We are confident that our selection of newspapers provides a good overview of the countries’ specific newspaper landscape – except for Poland. Polish newspapers provide only limited machine readable access to their articles and two important newspapers do not provide access at all. Our data on Poland is therefore limited.

– The analysis started on 24.03.19


Note: Our text and data mining analyses are conducted in a non-commercial way for research purposes only. Our software only produces aggregate data analyses based on headlines and short article descriptions/leads.