- Published on 11 July 2017
The era of "fake news" is upon us. Navigating social media is a constant exercise of judgement, but data science can be a helpful to distinguish real from fabricated trending topics. In EPJ Data Science, Emilio Ferrara and team set out to determine from very early on whether information is being organically or artificially disseminated on social media.
Guest post by Emilio Ferrara, originally published on SpringerOpen blog
Every day, billions of individuals participate in online social media platforms. These digital ecosystems expose their users to tailored information based on individual interests, friendship networks, and the news from the offline world. Each “story”, which in concert with related ones forms a “meme” or information campaign, can emerge organically, from grassroots activity, or in some cases sustained by advertisement or other coordinated efforts.
Most information campaigns are genuine and benign; however, we recently witnessed the emergence of “bad actors” exploiting social media to alter public opinion, with the intent to deceive, or just create chaos. For example, our research showed that before the 2016 US presidential elections fake news became the vehicle to spread disinformation, attack candidates, and generate confusion online. Similarly, we demonstrated how ISIS and other extremist groups exploited Twitter for terrorist propaganda and recruitment purposes.
It is therefore of paramount importance to be able to detect, in their early stage, memes and information campaigns that are artificially sustained, and separate them from the organic ones. This problem has important social implications and poses numerous technical challenges, in part due to the scarcity of large scale annotated datasets with examples of both types of information campaigns.
In EPJ Data Science, we make progress in the direction of discriminating between trending memes that are either organic or promoted by means of advertisement. This classification proves very challenging: ads usually cause bursts of collective attention that can easily be mistaken for those yielded by organic trends. Fortunately, we can rely on Twitter for labeled examples: when a hashtag is promoted by an advertiser, Twitter clearly states so. This feature allowed us to collect a dataset of millions of tweets belonging to promoted information campaigns, as well as millions of tweets belonging to organic trends.
We propose a machine-learning framework and new techniques to classify such memes. Our algorithm exploits hundreds of time-varying features to capture changing network and diffusion patterns, content and sentiment information, timing signals, and user metadata.
We conceptualize two different prediction problems: the early detection of promoted information campaigns right at trending time poses significant challenges due to the minimal volume of activity data available for prediction prior to trending; campaign detection after trending is easier due to the large volume of activity data generated by the many users joining that conversation.
Our framework achieves 75% accuracy for early detection, increasing to above 95% after trending. We evaluate the robustness of the algorithm by introducing several factors, such as random temporal shifts on trend time-series, to reproduce situations that may occur in the real world. We finally explore which features predict promoted campaigns best, finding that content cues provide consistently useful signals; user features are more informative for early detection, while network and timing features are more helpful once more data is available.
In the future, we will extend this framework to monitor social media to detect coordinated information efforts such as fake news, conspiracy theories, anti-vaccination campaigns, etc.