Comparison of the Accuracies of Methods Based on Language and Graph Neural Network Models for Determining Author Profile Features from Russian Texts
https://doi.org/10.56304/S2304487X21060109
Abstract
The efficiency of methods of author profiling: determining the gender and age group of the author of the text, as well as the presence of deliberate distortion of the features of the text by gender, age or style has been analyzed. The corpus used in this work consists of various crowdsourced datasets. This is the most representative corpus of Russian-language texts containing markup for this task in Russian. In addition to classification algorithms based on support vector machines and deep language models, the use of graph neural networks in which text is represented by a set of morphosyntactic features is explored. The study allows us to estimate the current level of accuracy in solving the problems of determining the features of the author’s profile. The resulting accuracy and collected dataset can be useful as a benchmark for testing new machine learning methods.
Keywords
About the Authors
A. G. SboevRussian Federation
123182
115409
Moscow
R. B. Rybka
Russian Federation
123182
Moscow
I. A. Moloshnikov
Russian Federation
123182
Moscow
A. V. Naumov
Russian Federation
123182
Moscow
A. A. Selivanov
Russian Federation
123182
Moscow
References
1. Gröndahl T., Asokan N. Text analysis in adversarial settings: Does deception leave a stylistic trace? // ACM Computing Surveys (CSUR). 2019. V. 52. №. 3. P. 1–36.
2. Barsever D., Singh S., Neftci E. Building a Better Lie Detector with BERT: The Difference Between Truth and Lies //2020 International Joint Conference on Neural Networks (IJCNN). IEEE. 2020. P. 1–7.
3. Levitan S. I. Deception in spoken dialogue: Classification and individual differences. Columbia University, 2019.
4. Smetanin S. The applications of sentiment analysis for Russian language texts: Current challenges and future perspectives // IEEE Access, 2020. V. 8. P. 110693–110719.
5. Ott M., Choi Y., Cardie C., Hancock J. T. Finding deceptive opinion spam by any stretch of the imagination // arXiv preprint arXiv:1107.4557, 2011.
6. Wang W. Y. “Liar, Liar pants on Fire”: A new benchmark dataset for fake news detection // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers), Vancouver, Canada, 2017. P. 422–426.
7. Pérez-Rosas V., Mihalcea R. Experiments in open domain deception detection // Proceedings of the 2015 conference on empirical methods in natural language processing. 2015. P. 1120–1125.
8. Bevendorff J., Chulvi B., Sarracén G., Kestemont M., et al. Overview of PAN 2021: Authorship Verification, Profiling Hate Speech Spreaders on Twitter, and Style Change Detection // International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham, 2021. P. 419–431.
9. Potthast M. et al. Overview of the Author Obfuscation Task at PAN 2018: A New Approach to Measuring Safety //CLEF (Working Notes), 2018.
10. Rangel F. et al. Overview of the track on author profiling and deception detection in arabic // Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019). CEUR Workshop Proceedings. In: CEUR-WS.org., Kolkata, India, 2019.
11. Siagian M. et al. DBMS-KU Approach for Author Profiling and Deception Detection in Arabic // Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019), 2019.
12. Zhang Z. et al. Using Single BERT For Three Tasks Of Style Change Detection // CLEF 2021 Labs and Workshops, Notebook Papers. CEUR-WS.org., 2021.
13. Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding // arXiv preprint arXiv:1810.04805, 2018.
14. Safara F. et al. An author gender detection method using whale optimization algorithm and artificial neural network // IEEE Access, 2020. V. 8. P. 48428–48437.
15. Sboev A. et al. Automatic gender identification of author of Russian text by machine learning and neural net algorithms in case of gender deception // Procedia computer science, 2018. V. 123. P. 417–423.
16. Sboev A. et al. A gender identification of text author in mixture of Russian multi-genre texts with distortions on base of data-driven approach using machine learning models // AIP Conference Proceedings. AIP Publishing LLC. 2019. V. 2116. № 1. P. 270006.
17. Sboev A. et al. To the question of data-driven identification of author’s age for Russian texts with age deceptions using machine learning // Journal of Physics: Conference Series. IOP Publishing. 2019. V. 1205. № 1. P. 012049.
18. Litvinova T. A., Zagorovskaya O. V., Litvinova O. A. Russian text corpora for deception detection studies // International Journal of Open Information Technologies, 2017. V. 5. № 11. P. 58–63.
19. Litvinova O. et al. Deception detection in Russian texts // Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, 2017. P. 43–52.
20. Litvinova T. et al. “Ruspersonality”: A Russian corpus for authorship profiling and deception detection // 2016 International FRUCT Conference on Intelligence, Social Media and Web (ISMW FRUCT). IEEE, 2016. P. 1–7.
21. Sboev A. et al. A Neural Network Model to Include Textual Dependency Tree Structure in Gender Classification of Russian Text Author // Advanced Technologies in Robotics and Intelligent Systems. Springer, Cham, 2020. P. 405–412.
22. Vaswani A. et al. Attention is all you need // Advances in neural information processing systems, 2017. P. 5998–6008.
23. Srivastava N., Hinton G., Krizhevsky A. et al. Dropout: a simple way to prevent neural networks from overfitting // The journal of machine learning research, 2014. V. 15. № 1. P. 1929–1958.
24. Kipf T. N. Deep learning with graph-structured representations, 2020.
25. Bergstra J. et al. Algorithms for hyper-parameter optimization //Advances in neural information processing systems, 2011. V. 24.
26. Jin H., Song Q., Hu X. Auto-keras: An efficient neural architecture search system // Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019. P. 1946–1956.
27. Weill C. et al. Adanet: A scalable and flexible framework for automatically learning ensembles // arXiv preprint arXiv:1905.00080, 2019.
28. Miikkulainen R. et al. Evolving deep neural networks // Artificial intelligence in the age of neural networks and brain computing. Academic Press, 2019. P. 293–312.
29. Jaderberg M. et al. Population based training of neural networks // arXiv preprint arXiv: 1711.09846, 2017.
30. Kuratov Y., Arkhipov M. Adaptation of deep bidirectional multilingual transformers for russian language // arXiv preprint arXiv:1905.07213, 2019.
31. Small and fast BERT for the Russian language, 2021. URL: https://habr.com/ru/post/562064/. [accessed 10. 06. 2021].
32. Feng F. et al. Language-agnostic BERT sentence embedding / /arXiv preprint arXiv: 2007.01852, 2020.
33. Artetxe M., Schwenk H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond // Transactions of the Association for Computational Linguistics, 2019. V. 7. P. 597–610.
34. Cer D. et al. Universal sentence encoder // arXiv preprint arXiv: 1803. 11175. – 2018.
35. Zhang B. et al. Improving massively multilingual neural machine translation and zero-shot translation // arXiv preprint arXiv: 2004. 11867, 2020.
36. Tiedemann J. Parallel data, tools and interfaces in OPUS // Lrec, 2012. V. 2012. P. 2214–2218.
37. Qi P. et al. Stanza: A Python natural language processing toolkit for many human languages // arXiv preprint arXiv: 2003.07082, 2020.
38. Droganova K., Lyashevskaya O., Zeman D. Data conversion and consistency of monolingual corpora: Russian UD treebanks // Proceedings of the 17th international workshop on treebanks and linguistic theories (tlt 2018), 2018. V. 155. P. 53–66.
39. Liaw R. et al. Tune: A research platform for distributed model selection and training // arXiv preprint arXiv:1807. 05118, 2018.
Review
For citations:
Sboev A.G., Rybka R.B., Moloshnikov I.A., Naumov A.V., Selivanov A.A. Comparison of the Accuracies of Methods Based on Language and Graph Neural Network Models for Determining Author Profile Features from Russian Texts. Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI". 2021;10(6):529-539. (In Russ.) https://doi.org/10.56304/S2304487X21060109