References

vestnikmephi

Вестник НИЯУ МИФИ

Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI"

2304-487X

National Research Nuclear University "MEPhI"

10.1134/S2304487X19060129

vestnikmephi-68

Research Article

ПРИКЛАДНАЯ МАТЕМАТИКА И ИНФОРМАТИКА

APPLIED MATHEMATICS AND COMPUTER SCIENCE

Генеративно-дискриминативная нейросетевая модель для задачи авторского профилирования

Generative-Discriminative Neural Network Model for the Task of Author Profiling

Сбоев

А. Г.

Sboev

A. G.

123098

115409

Москва

123098

115409

Moscow

sag111@mail.ru

Рыбка

Р. Б.

Rybka

R. B.

123098

Москва

123098

Moscow

Грязнов

А. В.

Gryaznov

A. V.

123098

Москва

123098

Moscow

Молошников

И. А.

Moloshnikov

I. A.

123098

Москва

123098

Moscow

Национальный исследовательский центр “Курчатовский институт”; Национальный исследовательский ядерный университет “МИФИ”РоссияNational Research Center “Kurchatov Institute”; National Research Nuclear University “MEPhI” (Moscow Engineering Physics Institute)Russian Federation

Национальный исследовательский центр “Курчатовский институт”РоссияNational Research Center “Kurchatov Institute”Russian Federation

2020

12022023

915057

2023

Сбоев А.Г., Рыбка Р.Б., Грязнов А.В., Молошников И.А.

Sboev A.G., Rybka R.B., Gryaznov A.V., Moloshnikov I.A.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://vestnikmephi.elpub.ru/jour/article/view/68

The paper considers the generative-discriminative model (GAN) as applied to the task of analyzing text data, in particular, determining the gender of the author of a Russian-language text. The approach developed belongs to semi-supervised learning algorithms, when both labeled and unlabeled samples are involved in the model fitting process. The GAN model is implemented as a deep neural network consisting of fully connected, recurrent and convolution layers. The basis of the generative part of the GAN model is a variational auto-encoder, which encodes the input sample into the space of hidden variables and then decodes the latter into the original representation. When decoding, the class label of the input example is used, known in the case of the labeled set or predicted by the classifier for unlabeled samples. The input for the model is a sequence of words, each encoded by a vector of the principal components of its morphological features. To provide the function of recovering texts with more than 50 words, the principles of the work of language models are used. The discriminant part is configured to determine whether a given sample was generated by an auto-encoder or taken from the original set. Quality assessment of the GAN model was carried out on a set of texts from LiveJournal blogs. It is shown that the use of the generative-discriminative model allows to improve the quality of classification by 2 % in the F1 metric, while reducing the standard deviation by 2–3 times when learning on a small number of labeled examples. Various modes of training and variations in the topology of the GAN model are investigated, and the most effective modes of operation of models of this type for the task of classifying texts are demonstrated.

машинное обучениеискусственные нейронные сетиобработка естественного языкаавтоматизированный анализ текстовгенеративно-дискриминативные нейронные сетиавторское профилированиеопределение пола автора текста

machine learningartificial neural networksnatural language processingautomated text analysisgenerative-discriminative neural networksauthor profilingauthor gender identification

References1

Guo J. et al. Long text generation via adversarial training with leaked information // Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

Nie W., Narodytska N., Patel A. RelGAN: Relational generative adversarial networks for text generation. 2018.

Radford A. et al. Language models are unsupervised multitask learners // OpenAI Blog. 2019. Vol. 1. № 8.

Bao J. et al. CVAE-GAN: fine-grained image generation through asymmetric training // Proceedings of the IEEE International Conference on Computer Vision. 2017. С. 2745–2754.

Xu W. et al. Variational autoencoder for semi-supervised text classification // Thirty-First AAAI Conference on Artificial Intelligence. 2017.

Semeniuta S., Severyn A., Barth E. A hybrid convolutional variational autoencoder for text generation // arXiv preprint arXiv:1702.02390. 2017.

Shen D. et al. Hierarchically-Structured Variational Autoencoders for Long Text Generation. 2018.

Chung J. et al. Empirical evaluation of gated recurrent neural networks on sequence modeling // arXiv preprint arXiv:1412.3555. 2014.

Chollet F. et al. Keras. 2015.

Abadi M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems // arXiv preprint arXiv:1603.04467. 2016.

Arjovsky M., Shah A., Bengio Y. Unitary evolution recurrent neural networks // International Conference on Machine Learning. 2016. P. 1120–1128.

Kingma D. P., Ba J. Adam: A method for stochastic optimization // arXiv preprint arXiv:1412.6980. 2014.

Litvinova T. A., Sboev A. G., Panicheva P. V. Profiling the Age of Russian Bloggers // Conference on Artificial Intelligence and Natural Language. Springer, Cham, 2018. P. 167–177.

Straka M., Hajic J., Straková J. UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing // Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). 2016. P. 4290–4297.

Tipping M. E., Bishop C. M. Probabilistic principal component analysis // Journal of the Royal Statistical Society: Series B (Statistical Methodology).

The authors declare that there are no conflicts of interest present.