Preview

Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI"

Advanced search

Application of a Multitasking Model for Practical Tasks of Heading Generation, Definition of Lemmas and Keywords

https://doi.org/10.1134/S2304487X20030074

Abstract

   The efficiency of multitask deep learning methods for combined neural network models is comprehensively studied in application to a selected set of tasks: generating headers, defining lemmas, and keywords. The multitask model is built using Multi-head Attention layers and is used to develop models for generating headers and a model based on LSTM layers for lemmatization. Open corpuses RIA Novosti, containing news texts and headings for them, and a corpus with morphological, syntactic markup and lemmas for word forms SynTagRus from the universal dependencies project are used. For the task of highlighting keywords, we have assembled a new corpus consisting of news texts using the crowdsourcing platform. The results of the work show an increase in the accuracy by 1 % in the F1 score for the lemmatization problem when using features from the multitask model compared to using only morphological features, state-of- art accuracy (0.42 ROUGE F1 score) is achieved for the title generation task. An algorithm for highlighting keywords without additional network training is proposed on the basis of the model for generating headings obtained in this work.

About the Authors

I. A. Moloshnikov
National Research Center Kurchatov Institute
Russian Federation

123182

Moscow



A. V. Gryanov
National Research Center Kurchatov Institute
Russian Federation

123182

Moscow



D. S. Vlasov
National Research Center Kurchatov Institute
Russian Federation

123182

Moscow



R. B. Rybka
National Research Center Kurchatov Institute
Russian Federation

123182

Moscow



A. G. Sboev
National Research Center Kurchatov Institute; National Research Nuclear University MEPhI (Moscow Engineering Physics Institute)
Russian Federation

123182

115409

Moscow



References

1. Korobov M. Morphological Analyzer and Generator for Russian and Ukrainian Languages. Analysis of Images, Social Networks and Texts. 2015. P. 320–332.

2. De Smedt T., Daelemans W. Pattern for Python. Journal of Machine Learning Research. 2012. № 13. P. 2031–2035.

3. Straka M., Straková J. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. 2017.

4. NLTK, NLTK 3.3 release: May 2018. http://www.nltk.org

5. Peters M. E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., Zettlemoyer L. Deep contextualized word representations. arXiv preprint arXiv:1802.05365. 2018.

6. Devlin J., Chang M. W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018. Oct 11.

7. Liu Y., Lapata M. Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345. 2019. Aug 22.

8. Liu Y. Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318. 2019. Mar 25.

9. Chen X., Gao S., Tao C., Song Y., Zhao D., Yan R. Iterative document representation learning towards summarization with polishing. arXiv preprint arXiv:1809.10324. 2018. Sep 27.

10. Stepanov M. A. Generaciya zagolovkov novostnih statei, ispolzuya stemi, lemmi i grammemi. Kompyuternaya lingvistika i intellectual’nie tehnologii (po materialam ezhegodnoi mezhdunarodnoi konferencii “Dialog”). 2019. № 18. Additional vol.

11. Gusev I. Importance of copying mechanism for news headline generation // arXiv preprint arXiv:1904.11475. 2019.

12. Gavrilov D., Kalaidin P., Malykh V. Self-attentive model for headline generation // European Conference on Information Retrieval. 2019. C. 87–93.

13. Sokolov A. M. Phrase-based attentional transformer dlya generacii zagolovkov. Kompyuternaya lingvistika i intellectual’nie tehnologii (po materialam ezhegodnoi mezhdunarodnoi konferencii “Dialog”). 2019. № 18. Additional vol.

14. Dong L., Yang N., Wang W., Wei F., Liu X., Wang Y., Gao J., Zhou M., Hon H. W. Unified language model pre-training for natural language understanding and generation. InAdvances in Neural Information Processing Systems 2019 (P. 13042–13054).

15. Song K., Tan X., Qin T., Lu J., Liu T. Y. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450. 2019. May 7.

16. Rose S., Engel D., Cramer N., Cowley W. Automatic keyword extraction from individual documents. Text mining: applications and theory. 2010. P. 1–20.

17. El-Beltagy S. R., Rafea A. KP-miner: Participation in semeval-2. Proceedings of the 5th international workshop on semantic evaluation. 2010. P. 190–193.

18. Campos R., Mangaravite V., Pasquali A., Jorge A., Nunes C., Jatowt A. YAKE! Keyword extraction from single documents using multiple local features. Information Sciences. 2020. № 509. P. 257–89.

19. Gydovskikh D. V., Moloshnikov I. A., Naumov et al. A probabilistically entropic mechanism of topical clusterisation along with thematic annotation for evolution analysis of meaningful social information of internet sources. Lobachevskii Journal of Math. 2017. V. 38. P. 910–913. doi: 10.1134/S1995080217050134

20. Mihalcea R., Tarau P. Textrank: Bringing order into text. Proceedings of the 2004 conference on empirical methods in natural language processing. 2004. P. 404–411.

21. Wan X., Xiao J. CollabRank: towards a collaborative approach to single-document keyphrase extraction. Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). 2008. P. 969–976.

22. Bougouin A., Boudin F. TopicRank: Topic ranking for automatic keyphrase extraction. 2014. № 55. P. 45–69.

23. Florescu C., Caragea C. Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017. V. 1. P. 1105–1115.

24. Boudin F. Unsupervised keyphrase extraction with multipartite graphs. arXiv preprint arXiv:1803.08721. 2018.

25. Witten I. H., Paynter G. W., Frank E., Gutwin C., Nevill-Manning C. G. Kea: Practical automated keyphrase extraction. IGI global. Design and Usability of Digital Libraries: Case Studies in the Asia Pacific. 2005. P. 129–152.

26. Nguyen T. D., Luong M. T. WINGNUS: Keyphrase extraction utilizing document logical structure. Association for Computational Linguistics. Proceedings of the 5th international workshop on semantic evaluation. 2010. P. 166–169.

27. Turney P. D. Learning algorithms for keyphrase extraction. Information retrieval. 2000. № 2 (4). P. 303–36.

28. Page L., Brin S., Motwani R., Winograd T. The pagerank citation ranking: Bringing order to the web. Stanford InfoLab, 1999.

29. Moloshnikov I. A., Gryaznov A. V., Vlasov D. S., Sboev A. G. Vibor effectivnogo neirosetevogo metoda formirovaniya zagolovkov. NIYaU MIFI. VI Mezhdunarodnaya konferenciya “Lazernie, pazmennie issledovaniya i tehnologii-LaPlaz-2020”, sbornik nauchnih trudov. 2020. V. 1. P. 80–81.

30. Kanerva J., Ginter F., Miekka N., Leino A., Salakoski T. Turku neural parser pipeline: An end-to-end system for the conll 2018 shared task. Proceedings of the CoNLL 2018 Shared Task: Multilingual parsing from raw text to universal dependencies. 2018. P. 133–142.

31. Lin C. Y., Hovy E. Automatic evaluation of summaries using n-gram co-occurrence statistics. Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 2003. P. 150–157.

32. Headline Generation Shared Task on Dialogue'2019, http://www.dialog-21.ru/media/4661/cameraready-submission-157.pdf


Review

For citations:


Moloshnikov I.A., Gryanov A.V., Vlasov D.S., Rybka R.B., Sboev A.G. Application of a Multitasking Model for Practical Tasks of Heading Generation, Definition of Lemmas and Keywords. Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI". 2020;9(3):236-244. (In Russ.) https://doi.org/10.1134/S2304487X20030074

Views: 120


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2304-487X (Print)