Preprint / Version 1

A Minimal Approach to Fake News Detection

##article.authors##

  • Daniel Markusson Crestwood Preparatory College

DOI:

https://doi.org/10.58445/rars.1454

Keywords:

fake news, xgboost, machine learning, journalism

Abstract

The need for efficient categorization of fake and real media increases as the ubiquity of generative AI and motivated bad actors make producing fake news ever easier. Researchers have estimated that in 2021, $2.6 billion dollars of ad revenue can be attributed to misinformation publishing sites (Skibinski, 2021), providing ample motivation for the aforementioned bad actors to fabricate stories. This paper seeks to create an effective machine learning solution that gives readers the ability to classify articles they want to read as fake or real, enabling the consumption of solely accurate news. As users tend to prefer simple solutions, we provide a parsimonious model consisting of only 5 features, yet still able to achieve 71% testing accuracy. Among the most effective predictors of a real article is that of “perceived effort” - predicated by an article’s length, number of authors, and readability.

References

Allcott, H., & Gentzkow, M. (2017). Social Media and Fake News in the 2016 Election. Journal of Economic Perspectives, 31(2), 211–236. https://doi.org/10.1257/jep.31.2.211

Azzimonti, M., & Fernandes, M. (2023). Social media networks, fake news, and polarization. European Journal of Political Economy, 76, 102256. https://doi.org/10.1016/j.ejpoleco.2022.102256

Banic, V. & Smith, A. (2016). Fake News: How a Partying Macedonian Teen Earns Thousands Publishing Lies. NBC News. Retrieved from https://www.nbcnews.com/news/world/fake-news-how-partying-macedonian-teen-earns-thousands-publishing-lies-n692451

Burgess, J. (2022). The ‘digital town square’? What does it mean when billionaires own the online spaces where we gather? The Conversation. https://theconversation.com/the-digital-town-square-what-does-it-mean-when-billionaires-own-the-online-spaces-where-we-gather-18204

Butcher, S. (2024). 2024 may be the year online disinformation finally gets the better of us. Politico. Retrieved from https://www.politico.eu/article/eu-elections-online-disinformation-politics/

Chall, J. S., & Dale, E. (1995). Readability revisited. Brookline Books.

Conradi, P. (2023). Was Slovakia election the first swung by deepfakes? The Times. Retrieved from https://www.thetimes.com/world/russia-ukraine-war/article/was-slovakia-election-the-first-swung-by-deepfakes-7t8dbfl9b

Dale E; Chall J (1948). "A Formula for Predicting Readability". Educational Research Bulletin. 27: 11–20+28.

David, A. (2024, June 18). Misinformation might sway elections — but not in the way that you think. Nature. https://www.nature.com/articles/d41586-024-01696-z

Dawber, A. & Tomlinson H. (2023). Deepfakes of Donald Trump ‘arrest’ spread on social media. The Times. Retrieved from https://www.thetimes.com/business-money/technology/article/donald-trump-deepfakes-ai-twitter-g50n7vnbm

DeVoe, K. M. (2009). Bursts of Information: Microblogging. The Reference Librarian, 50(2), 212–214. https://doi.org/10.1080/02763870902762086

Editors at Sky News. (2023, October 9). Deepfake audio of Sir Keir Starmer released on first day of Labour conference. Sky News. https://news.sky.com/story/labour-faces-political-attack-after-deepfake-audio-is-posted-of-sir-keir-starmer-12980181

Gao, Y., Liu, F., & Gao, L. (2023). Echo chamber effects on short video platforms. Scientific Reports, 13(1), 6282. https://doi.org/10.1038/s41598-023-33370-1

Gottfried, J. & Shearer, E. (2017). News Use Across Social Media Platforms 2017. Pew Research Center. Retrieved from https://www.pewresearch.org/journalism/2017/09/07/news-use-across-social-media-platforms-2017/

Hermida, A. (2010). TWITTERING THE NEWS: The emergence of ambient journalism. Journalism Practice, 4(3), 297–308. https://doi.org/10.1080/17512781003640703

Hooi, B., Shah, N., Beutel, A., Gunnemann, S., Akoglu, L., Kumar, M., Makhija, D., & Faloutsos, C. (2015). BIRDNEST: Bayesian Inference for Ratings-Fraud Detection. https://doi.org/10.48550/arXiv.1511.06030

Jindal, N., & Liu, B. (2008). Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining, 219–230. https://doi.org/10.1145/1341531.1341560

Kaplan, A. & Haenlein, M. (2011). The early bird catches the news: Nine things you should know about micro-blogging. Business Horizons, 54, 105-113. https://doi.org/10.1016/j.bushor.2010.09.004

Kumar, S., West, R., & Leskovec, J. (2016). Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes. In Proceedings of the 25th International Conference on World Wide Web, 591–602. https://doi.org/10.1145/2872427.2883085

Kumar, S., & Shah, N. (2018). False Information on Web and Social Media: A Survey.

Kumar, S., Hooi, B., Makhija, D., Kumar, M., Faloutsos, C., & Subrahmanian, V. (2018). REV2: Fraudulent User Prediction in Rating Platforms. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (pp. 333–341). Association for Computing Machinery. https://dl.acm.org/doi/10.1145/3159652.315972

Levendusky, M. (2013). Partisan Media Exposure and Attitudes Toward the Opposition. Political Communication, 30(4), 565–581. https://doi.org/10.1080/10584609.2012.737435

Matthew Dahl, Varun Magesh, Mirac Suzgun, & Daniel E. Ho. (2024). Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. https://doi.org/10.48550/arXiv.2401.01301

Pérez-Rosas V., Kleinberg B., Lefevre A., & Mihalcea R. (2017). Automatic Detection of Fake News. https://doi.org/10.48550/arXiv.1708.07104

Sandulescu, V., & Ester, M. (2015). Detecting Singleton Review Spammers Using Semantic Similarity. In Proceedings of the 24th International Conference on World Wide Web. ACM. https://doi.org/10.48550/arXiv.1609.02727

Shah, N., Beutel, A., Hooi, B., Akoglu, L., Gunnemann, S., Makhija, D., Kumar, M., & Faloutsos, C. (2015). EdgeCentric: Anomaly Detection in Edge-Attributed Networks. https://doi.org/10.48550/arXiv.1510.05544

Shu, K., Mahudeswaran, D., Wang, S., Lee, D., & Liu, H. (2018). FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media. arXiv Preprint arXiv:1809. 01286. https://doi.org/10.48550/arXiv.1809.01286

Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. https://doi.org/10.48550/arXiv.1708.01967

Shu, K., Wang, S., & Liu, H. (2017). Exploiting Tri-Relationship for Fake News Detection. arXiv Preprint arXiv:1712. 07709. https://doi.org/10.48550/arXiv.1712.07709

Skibinski, M. (2021). Special Report: Top brands are sending $2.6 billion to misinformation websites each year. NewsGuard. Retrieved from https://www.newsguardtech.com/special-reports/brands-send-billions-to-misinformation-websites-newsguard-comscore-report/

Subrahmanian, V., Azaria, A., Durst, S., Kagan, V., Galstyan, A., Lerman, K., Zhu, L., Ferrara, E., Flammini, A., & Menczer, F. (2016). The DARPA Twitter Bot Challenge. Computer, 49(6), 38–46. https://doi.org/10.48550/arXiv.1601.05140

Vasist, P. N., Chatterjee, D., & Krishnan, S. (2023). The Polarizing Impact of Political Disinformation and Hate Speech: A Cross-country Configural Narrative. Information Systems Frontiers. Advance online publication. https://doi.org/10.1007/s10796-023-10390-w

Downloads

Posted

2024-08-10