25 March 2019

Fake News Detection

The onslaught of fake news (from
firstamendmentwatch.org/countering-fake-news/).
Welcome back. So, tell me. What do you think about fake news? I mean news that’s purposely false or misleading, not news that Mr. Trump doesn’t like. Wouldn’t it be nice if fake news could be detected and removed automatically before it gets out or, at least, before it’s spread?

Well, collaborators from the University of Michigan and University of Amsterdam in the Netherlands have gone a long way toward that goal. They demonstrated that the ability to discriminate real from fake news with linguistic-based models was comparable to that of humans.


Association for Computational
Linguistics’ definition of
computational linguistics

(from www.aclweb.org/portal/).
Linguistic Approach
The researchers examined a variety of linguistic elements for algorithms of fake-news detection models. 


Two of the elements would be familiar to anyone (i.e., you and me):

- Punctuation, of which 12 types were considered.

- Readability, measured by content features, such as the number of characters, complex words, long words, number of syllables, word types and number of paragraphs, as well as different readability metrics.

Other linguistic elements considered would not be as familiar:

- Ngrams, the sequence of syllables, letters, words, phonemes or other linguistic items in a text, where the “n” in ngram signifies the number of items in the sequence (e.g., unigram-1, bigram-2)

- Psycholinguistic features, measured by words that relate linguistic behavior to psychological processes. The researchers extracted the proportions of words in different psycholinguistic categories, guided by the Linguistic Inquiry and Word Count (LIWC), the gold standard for computerized text analysis.

- Syntax features, the sequence in which words or linguistic elements are put together to form meaningful sentences. The researchers used a natural language parser, the Stanford Parser, to extract a set of features derived from rules based on context-free grammars.

Fake News Data Sets
To test algorithms with the different linguistic elements, they constructed two data sets.

One data set began with 240 verified news articles from mainstream news websites covering six domains (sports, business, entertainment, politics, technology, and education). Crowdsourcing was used to prepare shorter fake versions of the articles for which the writers tried to emulate a journalistic style.

The second data set covered celebrities and was obtained directly from the web as 250 pairs of news articles, one legitimate, the other fake. Claims made in the legitimate articles were evaluated on gossip-checking sites and other online news sources.

Excerpts from an example of legitimate and fake celebrity news (from web.eecs.umich.edu/~mihalcea/papers/perezrosas.coling18.pdf).
Detection Testing
The researchers tested the fake-news detection capability of the different linguistic features separately and in combination.

Fake-news detection performance
by two humans (A1, A2) and the
automatic linguistic system (Sys)
on the fake news data sets
(from
web.eecs.umich.edu/~mihalcea/papers/perezrosas.coling18.pdf).
They achieved the highest accuracy-- 74% on the multidomain data set and 76% on the celebrity data set--when all features were included. These results were slightly better on the multidomain data set and slightly worse on the celebrity data set than those obtained by two humans.

The best performing algorithms with the multidomain data set relied on stylistic features (i.e., punctuation and readability), followed by those that used psycholinguistic features. For the celebrity data set, the best performance was obtained using the LIWC features, followed by the ngrams and syntactic features.

Wrap Up
To improve automatic fake-news detection further, the researchers recommend incorporating meta features (e.g., number of links to and from an article, comments on the article) and features from different modalities (e.g., visual makeup of a website).

At the outset of the study, they opted for a linguistic rather than a fact-checking approach, given that automatic fact-checking against information from other sources is not straightforward, particularly for just-published news. Nevertheless, they recommend improving fact-checking approaches and integrating them with linguistic approaches.

Fake news is clearly a serious problem as evidenced by its probable effect on the last presidential election. Three cheers for any actions toward its demise. Thanks for stopping by.

P.S.
Paper on study presented at 27th International Conference on Computational Linguistics, Santa Fe, N.M., 20-26 Aug 2018: web.eecs.umich.edu/~mihalcea/papers/perezrosas.coling18.pdf
Article on study on ScienceDaily website: www.sciencedaily.com/releases/2018/08/180821112007.htm
27th International Conference on Computational Linguistics and study abstract:
coling2018.org/
arxiv.org/abs/1708.07104
Linguistic Inquiry and Word Count, Version 1.3.1, 2015: liwc.wpengine.com/
Stanford Parser (syntax): nlp.stanford.edu/software/lex-parser.shtml

A version of this blog post appeared earlier on www.warrensnotice.com.

No comments:

Post a Comment