Massive amounts of misinformation have been observed to spread in uncontrolled fashion across social media. Examples include rumors, hoaxes, fake news, and conspiracy theories. The resulting information cascades contain instances of both accurate and inaccurate information, unfold over multiple time scales, and often reach audiences of considerable size. According to Shao et al.’s “Hoaxy: A Platform for Tracking Online Misinformation”, the sharing of fact-checking content typically lags that of misinformation by 10-20 hours. Moreover, fake news are dominated by very active users, while fact checking is a more grassroots activity. In platforms such as Hoaxy, the system collects data from two main sources: news websites and social media. From the first group, we can obtain data about the origin and evolution of both fake news stories and their fact checking. From the second group authors collect instances of these news stories (i.e., URLs) that are being shared online. However, automatic tracking of online misinformation is not present in such platform. Other authors base the research in the identification of sources promoting the fake news by collaborative analysis to identify users that are consistently uploading and/or promoting fake information on social networks, using credibility classification for the tweet and the user.
The ambition of the FANDANGO project is to improve the efficiency of the mentioned techniques. This will be reached by the concentration of diverse data sources in a common system that allows to fuse the information of the diverse data-sources as well as the results provided by these techniques to overcome individual disadvantages of every technique.
Content-based analysis to detect fake news
There exists a wide range of assessment methods for content-based analysis relying on two major categories: Linguistic and Networking approaches. In the former category, several types of analysis such as syntax, semantic and discourse are identified. An analysis of “liars” yields to infer that there exist some hard-to-detect language “lackages”: negative emotion word usage, patterns of pronoun, conjunctions, among others. The main goal of linguistic approaches is to identify such “predictive detection cues”. In Rubin et al.’s “Deception Detection for News: Three Types of Fakes”, the syntax analysis is performed by the creation of advanced knowledge bases and its integration in personalization models allows to reach up to an 85% accuracy in fake news detection. Moreover, in Rubin et al.’s “Towards News Verification: Deception Detection Methods for News Discourse” a system using syntax analysis is implemented through Probability Context Free Grammars (PCFG). Sentences are transformed to a set of rewrite rules (a parse tree) to describe syntax structure, for example noun and verb phrases, which are in turn rewritten by their syntactic constituent parts. Furthermore, at the discourse level, deception cues present themselves both in CMC communication and in news content. A description of discourse can be achieved through the Rhetorical Structure Theory (RST) analytic framework, that identifies instances of rhetorical relations between linguistic elements. Systematic differences between deceptive and truthful messages in terms of their coherence and structure has been combined with a Vector Space Model (VSM) that assesses each message’s position in multidimensional RST space with respect to its distance to truth and deceptive centers (Rubin and Lukoianoval, “Truth and Deception at the Rhetorical Structure Level”). At this level of linguistic analysis, the prominent use of certain rhetorical relations can be indicative of deception.
Moreover, Network Approaches using network properties and behavior are ways to complement content-based approaches that rely on deceptive language and leakage cues to predict deception. As real-time content on current events is increasingly proliferated through micro-blogging applications such as Twitter, deception analysis tools are all the more important. The use of knowledge graph supports a significant improvement towards scalable computational fact-checking methods. Queries based on extracted fact statements are assigned semantic proximity as a function of the transitive relationship between subject and predicate via other nodes. The closer the nodes, the higher the likelihood that a particular subject-predicate-object statement is true. There are several so-called ‘network effect’ variables that are exploited to derive truth probabilities (Ciampaglia et al., “Computational fact checking from knowledge networks”), so the outlook for exploiting structured data repositories for fact-checking remains promising. From the short list of existing published work in this area, results using sample facts from four different subject areas range from 61% to 95%. Success was measured based on whether the machine was able to assign higher true values to true statements than to false ones.
Currently, content-based techniques are still very popular, however there are a huge set of model-based techniques increasingly useful in real systems, thanks to the new techniques in parallelization, cloud computing and big data frameworks. These techniques are especially important in those scenarios that have to deal with a very specific domain or with a set of users with specific features. Some of the most representative model-based collaborative filtering (CF) techniques are Bayesian Belief Nets CF, Clustering CF, MDP-based CF, Latent semantic CF, Sparse Factor Analysis, and CF using dimensionality reduction techniques, such as Singular Value Decomposition SVD or Principal Component Analysis.
The ambition of FANDANGO is to develop a set of content-based algorithms for rating on fake news detection. The comparison of users/profiles against item feature vectors based on similarity measures will allow predict whether the news are deceptive or not. Additionally, Content-based techniques can be significantly improved by mixing its prediction results with the extraction of multiple features as the ones given by demographics to create ratings on the data/user sources.
Text, including Natural Language Processing in a multi-lingual environment
Modern machine learning for natural language processing is able to do things like translate from one language to another, because everything it needs to know is in the sentence its processing and on the other hand, identifying claims, tracing information through potentially hundreds of sources, and making a judgment on how truthful a claim could be based on a diversity of ideas- all that relies on a holistic understanding of the world, the ability to bridge concepts that aren’t connected by exact words or semantic meaning.
For now, AIs that can simply succeed at question-and-answer games are considered state of the art. As recently as 2014, it was bleeding edge when Facebook’s AI could read a short passage about the plot of the Lord of the Rings, and tell if Frodo had the Ring or not.
The Stanford Question Answering Dataset, or SQuAD, is a new benchmarking competition that measures how good AIs are at this sort of task. But parsing a few paragraphs of text for factuality is nowhere near the complex fact-checking machines AI designers are after. It is incredibly hard to know the whole state of the world to identify whether a fact is true or not. Even if there was a perfect way to encompass and encode all the knowledge of the world, the whole point of news is that we’re adding to that knowledge.
The novelty of news stories means the information needed to verify something newly published as fact might not be available online yet. A small but credible source could publish something true that the AI marks as false simply because there is no other corroboration on the internet—even if that AI is powerful enough to constantly read and understand all the information ever published.
Some of the technologies for automated fact checking already exist in some form. ClaimBuster uses natural language processing (NLP) techniques to try to identify factual claims within a text. It won’t automatically fact check them, but it can assist a journalist by pointing them to the most “checkable” statements.
We also have knowledge bases that provided structured data to query statements against. Wikidata, a Wikimedia Foundation project, provides it free to anyone who wants to use it.
The FANDANGO goals is to explore how artificial intelligence technologies, particularly machine learning and natural language processing in a multilingual approach that might be leveraged to combat the fake news problem among all EU countries. We believe that these NLP technologies hold promise for significantly automating parts of the procedure human fact checkers use today to determine if a story is real or a hoax.
Assessing the veracity of a news story is a complex and cumbersome task. Fortunately, the process can be broken down into steps or stages. A helpful first step towards identifying fake news is to understand what other news organizations in all Europe are saying about the topic. This include a deep NLP architecture that can match semantic analysis of different languages. We believe automating this process could serve as a useful building block in an AI-assisted fact-checking pipeline.
Image analysis based algorithms
Image analysis, in the context of FANDANGO, refers mainly to techniques for image forgery and semantic analysis. While the former is an important task to detect if an image was artificially manipulated, the latter will help to correlate the content of an image to its context.
Many methods have been developed for image forgery thus far. The most usual attack been the move-copy one (see Fridrich et al., “Detection of copy-move forgery in digital images”) where the attacker adds or subtracts from the image an object of interest. In order to detect such attacks, algorithms have been developed (see Zhili et al., Amerini et al., Jian et al. or Cozzolino et al.) in order to detect the slight changes that such operations will produce on an image.
On the other hand, semantic analysis is trying to develop algorithms able to extract high level semantic information of an image such as identify the person within an image. In the context of FANDANGO, it is important to verify which person or people are in an image in order to correlate with the given context. To do so, techniques such as face verification will be used. Face verification is a technique trying to identify a person from its facial characteristics so as to map a facial image to a specific person. Thus far a multitude of algorithms and tools have been developed to cope with this problem (see Taigman et al., Sun et al., Hu et al., Chen et al. or Goswami et al.) with a wide range of different methods trying to cope mainly with the vast variations that a person’s face may have under different illumination conditions and poses.
FANDANGO will go beyond the state of the art, in both domains, by developing tools able to tackle the above mentioned problems. For move-copy detection, FANDANGO will develop novel deep learning algorithms able to detect such attacks even from sophisticated image processing tools. The idea is to train a convolutional neural network (CNN) in order to detect differentiations on the edges of objects of interest within the same image. By correlating the edges of different objects on the same image, the artificially created ones will be detected. On the other hand, regarding face verification, FANDANGO will take a leap forward by developing techniques for face verification “in the wild”. To do so, deep learning techniques taking into account the context will be developed in order to achieve better face verification accuracy.
Video analysis based algorithms
As in image analysis, video analysis in the context of FANDANGO, will be used to:
- detect forgery in videos, and
- to semantically analyse the video so as to correlate it to its context.
Forgery detection in videos is a very old and still active research area. Other than methods already used in the context of image based copy move detection, in videos many methods are using the motion as a feature to detect forgery (Hsu et al., Subramanyam et al., Zhang et al. or Su et al.). In these methods, the idea is to detect ghosts of the missing information (in the case of subtractions) and therefore to decide if a video has been forged. An other category of algorithms for forgery detection in videos is to model the noise and therefore detect key frames where the noise is altered which is a result of forgery (Ravi et al., Wahab et al.). Such methods have been proven to work better in cases of CCTV cameras, but they lack to do so in common videos where the context is much more complicated (such as news videos).
Finally, video semantic analysis, refers to techniques able to extract high level semantic entities from a video. Also known as video summarisation (Lee et al., Mundur et al.), these techniques can provide high level semantic labelling of the videos that can therefore be used to correlate a video with its context. Moreover, the use of metadata in video summarisation has been investigated with great success mainly in web videos where there is an abundance of metadata due to people interactions (Wang et al.).
FANDANGO will work intensively in both areas to cope with these issues. The main idea in forgery detection is the development of a recurrent neural network (RNN) able to correlate features in the temporal dimension so as to detect forged videos. By doing so, we passively integrate both image based technique and video ones. On the other hand, regarding video summarisation, a CNN will be developed that will extract high level semantics from videos within context. To do so, we will train the network in a context dependent manner through transfer learning techniques that will enhance summarisation capabilities.