References

vestnikmephi

Вестник НИЯУ МИФИ

Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI"

2304-487X

National Research Nuclear University "MEPhI"

10.1134/S2304487X19060130

vestnikmephi-61

Research Article

ПРИКЛАДНАЯ МАТЕМАТИКА И ИНФОРМАТИКА

APPLIED MATHEMATICS AND COMPUTER SCIENCE

Модель нейронной сети для включения синтаксической структуры предложения в задачу классификации пола автора русского текст

Neural Network Model for Classification of Text’s Author Gender with Including Sentence Dependency Structure

Сбоев

А. Г.

Sboev

A. G.

123098

115409

Москва

123098

115409

Moscow

sag111@mail.ru

Селиванов

А. А.

Selivanov

A. A.

123098

Москва

123098

Moscow

Рыбка

Р. Б.

Moloshnikov

I. A.

123098

Москва

123098

Moscow

Молошников

И. А.

Rybka

R. B.

123098

Москва

123098

Moscow

Богачев

Д. С.

Bogachev

D. S.

123098

141701

Москва

123098

141701

Moscow

Национальный исследовательский центр “Курчатовский институт”; Национальный исследовательский ядерный университет “МИФИ”РоссияNational Research Center “Kurchatov Institute”; National Research Nuclear University “MEPhI” (Moscow Engineering Physics Institute)Russian Federation

Национальный исследовательский центр “Курчатовский институт”РоссияNational Research Center “Kurchatov Institute”Russian Federation

Национальный исследовательский центр “Курчатовский институт”; Московский физико-технический институт (Национальный исследовательский университет)РоссияNational Research Center “Kurchatov Institute”; The Moscow Institute of Physics and Technology (MIPT)Russian Federation

2019

12022023

86569576

2023

Сбоев А.Г., Селиванов А.А., Рыбка Р.Б., Молошников И.А., Богачев Д.С.

Sboev A.G., Selivanov A.A., Moloshnikov I.A., Rybka R.B., Bogachev D.S.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://vestnikmephi.elpub.ru/jour/article/view/61

The research proposes the neural network methods to include a textual dependency tree structure in classification tasks of Russian texts. Author profiling task of gender identification was chosen to test the models, and two corpora used in experiments: based on a crowdsource, and in-person polling. The first approach is based on a long short-term memory (LSTM) layers, and developed graph embedding algorithm. The second one is based on a graph convolution network and LSTM. Two syntactic parsers were used to obtain dependency trees from the texts. Input data was represented in different forms: morphological binary vectors, FastText vectors, and their combination. The developed models result was compared to the state-of-the-art, that is neural network model based on a convolutional and LSTM layers. Finally, we demonstrate that including textual dependency tree structure to input feature space improves f1-score of gender classification task on 4 % for the RusPersonality dataset, and 7 % for the crowdsource dataset in average. The developed models resulting f1-score is 84% and 83 %, respectively.

машинное обучениеискусственные нейронные сетиобработка естественного языкаавтоматизированный анализ текстовграфовые нейронные сетиавторское профилированиеопределение пола автора текста

machine learningartificial neural networksnatural language processingautomated text analysisgraph neural networksauthor profilingauthor gender identification

Исследование выполнено при финансовой поддержке РФФИ в рамках научного проекта № 18-29-10084 “мк”

Исследование долга по финансам поддержка РФФИ в рамках научного проекта № 18-29-10084 «мк»

References1

Mikolov T., Sutskever I., Chen K., Corrado G. S., Dean J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. MIT Press. 2013. V. 2. P. 3111–3119.

Greff K., Srivastava R. K., Koutnık J., Steunebrink B. R., Bas R., Schmidhuber J. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems. IEEE. 2016. V. 28. № 10. P. 2222–2232.

Hassan A., Mahmood A. Deep learning approach for sentiment analysis of short texts. Proceedings of 2017 3rd international conference on control, automation and robotics (ICCAR). IEEE. 2017. P. 705–710.

Tai K. S., Socher R., Manning C. D. Improved semantic representations from tree-structured long short-term memory networks. In: arXiv preprint arXiv:1503.00075. 2015.

Miyazaki R., Komachi M. Japanese Sentiment Classification using a Tree-Structured Long Short-Term Memory with Attention. In: arXiv preprint arXiv:1704.00924. 2017.

Sboev A., Moloshnikov I., Gudovskikh D., Rybka R. A comparison of Data Driven models of solving the task of gender identification of author in Russian language texts for cases without and with the gender deception. Journal of Physics: Conference Series. IOP Publishing. 2017. V. 937. № 1. P. 012046.

Sboev A., Moloshnikov I., Gudovskikh D., Selivanov A., Rybka R., Litvinova T. Automatic gender identification of author of Russian text by machine learning and neural net algorithms in case of gender deception. Procedia computer science. 2018. № 123. P. 417–423.

Sboev A., Moloshnikov I., Gudovskikh D., Selivanov A., Rybka R., Litvinova T. Deep Learning neural nets versus traditional machine learning in gender identification of authors of RusProfiling texts. Procedia computer science. 2018. № 123. P. 424–431.

Le Cun Y., Bengio Y. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks. 1995. № 3361 (10).

Grover A., Leskovec J. node2vec: Scalable feature learning for networks. Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining 2016. ACM. 2016. P. 855–864.

Narayanan A., Chandramohan M., Venkatesan R., Chen L., Liu Y., Jaiswal S. graph2vec: Learning distributed representations of graphs. arXiv preprint arX-iv:1707.05005. 2017.

Kipf T.N., Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. 2016.

Veličković P., Cucurull G., Casanova A., Romero A., Lio P., Bengio Y. Graph attention networks. arXiv preprint arXiv:1710.10903. 2017.

Xinyi Z., Chen L. Capsule graph neural network, 2018.

Mikolov T., Sutskever I., Chen K., Corrado G. S., Dean J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 2013. P. 3111–3119.

Shervashidze, N., Schweitzer, P., Jan van Leeuwen E., Mehlhorn K., Borgwardt K. M. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research. 2011. P. 2539–2561.

Goldberg Y., Levy O. Word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. 2014.

Straka M., Straková J. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics. Vancouver, Canada. 2017. P. 88–99.

Rybka R., Sboev A., Moloshnikov I., Gudovskikh D. “Morpho-syntactic parsing based on neural networks and corpus data. Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT). St. Petersburg. 2015. P. 89–95.

Springenberg J. T., Dosovitskiy A., Brox T., Riedmiller M. Striving for simplicity: The all convolutional net. 2014. arXiv preprint, arXiv:1412.6806.

Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research. 2014. № 15 (1). P. 1929–1958.

Smith L. N. Cyclical learning rates for training neural networks. IEEE Proceedings of the Winter Conference on Applications of Computer Vision (WACV). IEEE. 2017. P. 464–472.

The authors declare that there are no conflicts of interest present.