The onslaught of fake news (from firstamendmentwatch.org/countering-fake-news/). |
Well, collaborators from the University of Michigan and University of Amsterdam in the Netherlands have gone a long way toward that goal. They demonstrated that the ability to discriminate real from fake news with linguistic-based models was comparable to that of humans.
Association for Computational Linguistics’ definition of computational linguistics (from www.aclweb.org/portal/). |
The researchers examined a variety of linguistic elements for algorithms of fake-news detection models.
Two of the elements would be familiar to anyone (i.e., you and me):
- Punctuation, of which 12 types were considered.
- Readability, measured by content features, such as the number of characters, complex words, long words, number of syllables, word types and number of paragraphs, as well as different readability metrics.
Other linguistic elements considered would not be as familiar:
- Ngrams, the sequence of syllables, letters, words, phonemes or other linguistic items in a text, where the “n” in ngram signifies the number of items in the sequence (e.g., unigram-1, bigram-2)
- Psycholinguistic features, measured by words that relate linguistic behavior to psychological processes. The researchers extracted the proportions of words in different psycholinguistic categories, guided by the Linguistic Inquiry and Word Count (LIWC), the gold standard for computerized text analysis.
- Syntax features, the sequence in which words or linguistic elements are put together to form meaningful sentences. The researchers used a natural language parser, the Stanford Parser, to extract a set of features derived from rules based on context-free grammars.
Fake News Data Sets
To test algorithms with the different linguistic elements, they constructed two data sets.
One data set began with 240 verified news articles from mainstream news websites covering six domains (sports, business, entertainment, politics, technology, and education). Crowdsourcing was used to prepare shorter fake versions of the articles for which the writers tried to emulate a journalistic style.
The second data set covered celebrities and was obtained directly from the web as 250 pairs of news articles, one legitimate, the other fake. Claims made in the legitimate articles were evaluated on gossip-checking sites and other online news sources.
Excerpts from an example of legitimate and fake celebrity news (from web.eecs.umich.edu/~mihalcea/papers/perezrosas.coling18.pdf). |
The researchers tested the fake-news detection capability of the different linguistic features separately and in combination.
Fake-news detection performance by two humans (A1, A2) and the automatic linguistic system (Sys) on the fake news data sets (from web.eecs.umich.edu/~mihalcea/papers/perezrosas.coling18.pdf). |
The best performing algorithms with the multidomain data set relied on stylistic features (i.e., punctuation and readability), followed by those that used psycholinguistic features. For the celebrity data set, the best performance was obtained using the LIWC features, followed by the ngrams and syntactic features.
Wrap Up
To improve automatic fake-news detection further, the researchers recommend incorporating meta features (e.g., number of links to and from an article, comments on the article) and features from different modalities (e.g., visual makeup of a website).
At the outset of the study, they opted for a linguistic rather than a fact-checking approach, given that automatic fact-checking against information from other sources is not straightforward, particularly for just-published news. Nevertheless, they recommend improving fact-checking approaches and integrating them with linguistic approaches.
Fake news is clearly a serious problem as evidenced by its probable effect on the last presidential election. Three cheers for any actions toward its demise. Thanks for stopping by.
P.S.
Paper on study presented at 27th International Conference on Computational Linguistics, Santa Fe, N.M., 20-26 Aug 2018: web.eecs.umich.edu/~mihalcea/papers/perezrosas.coling18.pdf
Article on study on ScienceDaily website: www.sciencedaily.com/releases/2018/08/180821112007.htm
27th International Conference on Computational Linguistics and study abstract:
coling2018.org/
arxiv.org/abs/1708.07104
Linguistic Inquiry and Word Count, Version 1.3.1, 2015: liwc.wpengine.com/
Stanford Parser (syntax): nlp.stanford.edu/software/lex-parser.shtml
A version of this blog post appeared earlier on www.warrensnotice.com.
No comments:
Post a Comment