Сравнение точностей методов на основе языковых и графовых нейросетевых моделей для определения признаков авторского профиля по текстам на русском языке

А. Г. Сбоев; Р. Б. Рыбка; И. А. Молошников; А. В. Наумов; А. А. Селиванов

doi:10.56304/S2304487X21060109

Сравнение точностей методов на основе языковых и графовых нейросетевых моделей для определения признаков авторского профиля по текстам на русском языке

А. Г. Сбоев, Р. Б. Рыбка, И. А. Молошников, А. В. Наумов, А. А. Селиванов

https://doi.org/10.56304/S2304487X21060109

Полный текст:

PDF (Rus)

сгенерировать QR код

Аннотация

В работе анализируются эффективность методов авторского профилирования-определения пола и возрастной группы автора текста, а также наличия в тексте признаков намеренного искажения по полу, возрасту или стилю. При этом используется собранный в ходе исследования с использованием методов краудсорсинга корпус, наиболее представительный в настоящее время на русском языке с разметкой для задачи определения наличия имитации характеристик авторского профиля. Помимо классификационных алгоритмов, основанных на методах опорных векторов и глубоких нейронных сетях, обученных по принципу языковых моделей, мы исследуем применение графовых нейронных сетей, в которых текст представляется набором морфо-синтаксических признаков. В результате проведенного исследования оценен текущий уровень точности решения для задач определения признаков авторского профиля, знание которого в совокупности с собранным набором данных, на котором они получены, может быть полезно в качестве ориентира для апробации новых методов машинного обучения.

Ключевые слова

авторское профилирование, искажение текстов, анализ текстов, машинное обучение, нейронные сети, графовые нейронные сети

Об авторах

А. Г. Сбоев

НИЦ “Курчатовский институт”; Национальный исследовательский ядерный университет “МИФИ”
Россия

123182

115409

Москва

Р. Б. Рыбка

НИЦ “Курчатовский институт”
Россия

123182

Москва

И. А. Молошников

НИЦ “Курчатовский институт”
Россия

123182

Москва

А. В. Наумов

НИЦ “Курчатовский институт”
Россия

123182

Москва

А. А. Селиванов

НИЦ “Курчатовский институт”
Россия

123182

Москва

Список литературы

1. Gröndahl T., Asokan N. Text analysis in adversarial settings: Does deception leave a stylistic trace? // ACM Computing Surveys (CSUR). 2019. V. 52. №. 3. P. 1–36.

2. Barsever D., Singh S., Neftci E. Building a Better Lie Detector with BERT: The Difference Between Truth and Lies //2020 International Joint Conference on Neural Networks (IJCNN). IEEE. 2020. P. 1–7.

3. Levitan S. I. Deception in spoken dialogue: Classification and individual differences. Columbia University, 2019.

4. Smetanin S. The applications of sentiment analysis for Russian language texts: Current challenges and future perspectives // IEEE Access, 2020. V. 8. P. 110693–110719.

5. Ott M., Choi Y., Cardie C., Hancock J. T. Finding deceptive opinion spam by any stretch of the imagination // arXiv preprint arXiv:1107.4557, 2011.

6. Wang W. Y. “Liar, Liar pants on Fire”: A new benchmark dataset for fake news detection // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers), Vancouver, Canada, 2017. P. 422–426.

7. Pérez-Rosas V., Mihalcea R. Experiments in open domain deception detection // Proceedings of the 2015 conference on empirical methods in natural language processing. 2015. P. 1120–1125.

8. Bevendorff J., Chulvi B., Sarracén G., Kestemont M., et al. Overview of PAN 2021: Authorship Verification, Profiling Hate Speech Spreaders on Twitter, and Style Change Detection // International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham, 2021. P. 419–431.

9. Potthast M. et al. Overview of the Author Obfuscation Task at PAN 2018: A New Approach to Measuring Safety //CLEF (Working Notes), 2018.

10. Rangel F. et al. Overview of the track on author profiling and deception detection in arabic // Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019). CEUR Workshop Proceedings. In: CEUR-WS.org., Kolkata, India, 2019.

11. Siagian M. et al. DBMS-KU Approach for Author Profiling and Deception Detection in Arabic // Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019), 2019.

12. Zhang Z. et al. Using Single BERT For Three Tasks Of Style Change Detection // CLEF 2021 Labs and Workshops, Notebook Papers. CEUR-WS.org., 2021.

13. Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding // arXiv preprint arXiv:1810.04805, 2018.

14. Safara F. et al. An author gender detection method using whale optimization algorithm and artificial neural network // IEEE Access, 2020. V. 8. P. 48428–48437.

15. Sboev A. et al. Automatic gender identification of author of Russian text by machine learning and neural net algorithms in case of gender deception // Procedia computer science, 2018. V. 123. P. 417–423.

16. Sboev A. et al. A gender identification of text author in mixture of Russian multi-genre texts with distortions on base of data-driven approach using machine learning models // AIP Conference Proceedings. AIP Publishing LLC. 2019. V. 2116. № 1. P. 270006.

17. Sboev A. et al. To the question of data-driven identification of author’s age for Russian texts with age deceptions using machine learning // Journal of Physics: Conference Series. IOP Publishing. 2019. V. 1205. № 1. P. 012049.

18. Litvinova T. A., Zagorovskaya O. V., Litvinova O. A. Russian text corpora for deception detection studies // International Journal of Open Information Technologies, 2017. V. 5. № 11. P. 58–63.

19. Litvinova O. et al. Deception detection in Russian texts // Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, 2017. P. 43–52.

20. Litvinova T. et al. “Ruspersonality”: A Russian corpus for authorship profiling and deception detection // 2016 International FRUCT Conference on Intelligence, Social Media and Web (ISMW FRUCT). IEEE, 2016. P. 1–7.

21. Sboev A. et al. A Neural Network Model to Include Textual Dependency Tree Structure in Gender Classification of Russian Text Author // Advanced Technologies in Robotics and Intelligent Systems. Springer, Cham, 2020. P. 405–412.

22. Vaswani A. et al. Attention is all you need // Advances in neural information processing systems, 2017. P. 5998–6008.

23. Srivastava N., Hinton G., Krizhevsky A. et al. Dropout: a simple way to prevent neural networks from overfitting // The journal of machine learning research, 2014. V. 15. № 1. P. 1929–1958.

24. Kipf T. N. Deep learning with graph-structured representations, 2020.

25. Bergstra J. et al. Algorithms for hyper-parameter optimization //Advances in neural information processing systems, 2011. V. 24.

26. Jin H., Song Q., Hu X. Auto-keras: An efficient neural architecture search system // Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019. P. 1946–1956.

27. Weill C. et al. Adanet: A scalable and flexible framework for automatically learning ensembles // arXiv preprint arXiv:1905.00080, 2019.

28. Miikkulainen R. et al. Evolving deep neural networks // Artificial intelligence in the age of neural networks and brain computing. Academic Press, 2019. P. 293–312.

29. Jaderberg M. et al. Population based training of neural networks // arXiv preprint arXiv: 1711.09846, 2017.

30. Kuratov Y., Arkhipov M. Adaptation of deep bidirectional multilingual transformers for russian language // arXiv preprint arXiv:1905.07213, 2019.

31. Маленький и быстрый BERT для русского языка. – 2021. URL: https://habr.com/ru/post/562064/. [дата обращения 10. 06. 2021].

32. Feng F. et al. Language-agnostic BERT sentence embedding / /arXiv preprint arXiv: 2007.01852, 2020.

33. Artetxe M., Schwenk H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond // Transactions of the Association for Computational Linguistics, 2019. V. 7. P. 597–610.

34. Cer D. et al. Universal sentence encoder // arXiv preprint arXiv: 1803. 11175. – 2018.

35. Zhang B. et al. Improving massively multilingual neural machine translation and zero-shot translation // arXiv preprint arXiv: 2004. 11867, 2020.

36. Tiedemann J. Parallel data, tools and interfaces in OPUS // Lrec, 2012. V. 2012. P. 2214–2218.

37. Qi P. et al. Stanza: A Python natural language processing toolkit for many human languages // arXiv preprint arXiv: 2003.07082, 2020.

38. Droganova K., Lyashevskaya O., Zeman D. Data conversion and consistency of monolingual corpora: Russian UD treebanks // Proceedings of the 17th international workshop on treebanks and linguistic theories (tlt 2018), 2018. V. 155. P. 53–66.

39. Liaw R. et al. Tune: A research platform for distributed model selection and training // arXiv preprint arXiv:1807. 05118, 2018.

Рецензия

Для цитирования:

Сбоев А.Г., Рыбка Р.Б., Молошников И.А., Наумов А.В., Селиванов А.А. Сравнение точностей методов на основе языковых и графовых нейросетевых моделей для определения признаков авторского профиля по текстам на русском языке. Вестник НИЯУ МИФИ. 2021;10(6):529-539. https://doi.org/10.56304/S2304487X21060109

For citation:

Sboev A.G., Rybka R.B., Moloshnikov I.A., Naumov A.V., Selivanov A.A. Comparison of the Accuracies of Methods Based on Language and Graph Neural Network Models for Determining Author Profile Features from Russian Texts. Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI". 2021;10(6):529-539. (In Russ.) https://doi.org/10.56304/S2304487X21060109

Контент доступен под лицензией Creative Commons Attribution 4.0 License.

ISSN 2304-487X (Print)

Логин
Пароль
	Запомнить меня
Регистрация нового пользователя Забыли Ваш пароль?

Войти

Вестник НИЯУ МИФИ

Сравнение точностей методов на основе языковых и графовых нейросетевых моделей для определения признаков авторского профиля по текстам на русском языке

Полный текст:

Аннотация

Ключевые слова

Об авторах

Список литературы

Рецензия

Для цитирования:

For citation:

Использование куки-файлов