Preview

Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI"

Advanced search

Processing Unstructured Text Information Using by Machine Learning Algorithms

https://doi.org/10.56304/S2304487X20040057

Abstract

   The number of documents processed during the organization's operational processes is dramatically growing from year to year. As a result, there is a strong demand for automatic processing of such information. Unfortunately, an essential part of such documents is unstructured information. Unstructured document formats are very variable and strongly different from document to document. In this case, automation, through the rule based coding and parsing templates is quite complex and ineffective for further support. It could be analyzed manually, but it is time-consuming and resource intensive. This paper presents the general architecture of an information extraction system based on modern approaches to Natural Language Processing (NLP) and machine learning. The article presents the statistical results of the experiments carried out to solve a number of practical problems of processing unstructured documents. The presented solution allows to process a great amount of unstructured information without coding and preparing parse templates.

About the Authors

I. A. Lisenkov
National Research Nuclear University MEPhI (Moscow Engineering Physics Institute)
Russian Federation

115409

Moscow



V. A. Kuznetsov
National Research Nuclear University MEPhI (Moscow Engineering Physics Institute)
Russian Federation

115409

Moscow



N. M. Leonova
National Research Nuclear University MEPhI (Moscow Engineering Physics Institute)
Russian Federation

115409

Moscow



References

1. Allam Z. On Smart Contracts and Organisational Performance: A Review of Smart Contracts through the Blockchain Technology, Review of Economic and Business Studies. 2018. V. 11 (2). P. 137–156.

2. Bozzano G. L. Introduction to Electronic Document Management Systems. 2012.

3. Nelson P. Search and unstructured data analytics: 5 trends to watch in 2020. Accenture Search and Content Analytics Blog. 2020.

4. Shilakes C., Tylman J. “Enterprise Information Portals”. Merrill Lynch. 1998.

5. Reinsel D., Gantz J., Rydning J. The Digitization of the World from Edge to Core. An IDC White Paper. 2018.

6. Lisenkov I. A., Vertakov P. A. Architecture of the Software Complex for Full Cycle of Research in the Field of Machine Learning. In Past, Present and Future Science [Nauka vchera, segodnya, zavtra]. 2017. V. 9 (43). P. 46–56.

7. Ming Zh. Progress in Neural NLP: Modeling, Learning, and Reasoning. ELSEVIER Engineering. 2020. V. 6 (3). P. 275–290.

8. Li J., Sun A., Han J., Li C. A Survey on Deep Learning for Named Entity Recognition. In IEEE Transactions on Knowledge and Data Engineering. 2020. P. 1–20.

9. Bartoli A., De Lorenzo A., Medvet E., Tarlao F. Inference of Regular Expressions for Text Extraction from Examples. In IEEE Transactions on Knowledge and Data Engineering. 2016. V. 28 (5). P. 1217–1230.

10. Kuznetsov V. A., Lisenkov I. A. Application of Genetic Algorithm for Information Extraction. In Advanced innovative developments. Prospects and experience of application, problems of implementations in production [Peredovye innovacionnye razrabotki. Perspektivy i opyt ispol’zovanija, problemy vnedrenija v proizvodstvo]. 2019. V. 2. P. 232–234.

11. Goldberg Y. Neural Network Methods in Natural Language Processing. Morgan & Claypool, 2017. P. 65.

12. Nello Cristianini, John Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. – Cambridge University Press, 2000. – ISBN 978-1-139-64363-4.

13. Daniel T. Larose, Discovering Knowledge in Data: An Introduction to Data Mining (https://web.archive.org/web/20140531051709/http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0471666572.html)

14. Breiman L., Friedman J., Olshen R., and Stone C. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.

15. De-sheng WANG, Jun-zhi LIU, A-xing ZHU, Shu WANG, Can-ying ZENG, Tian-wu MA, Automatic extraction and structuration of soil-environment relationship information from soil survey reports, Journal of Integrative Agriculture, 18, Issue 2, 2019, 328–339.

16. Etzioni, Oren & Cafarella, Michael & Downey, Doug & Popescu, Ana-Maria & Shaked, Tal & Soderland, Stephen & Weld, Daniel & Yates, Alexander. Unsupervised named-entity extraction from the Web: An experimental study. Artificial Intelligence. 2005. V. 165. P. 91–134.

17. Anupama Gupta, Imon Banerjee, Daniel L. Rubin, Automatic information extraction from unstructured mammography reports using distributed semantics, Journal of Biomedical Informatics. 2018. V. 78. P. 78–86.

18. Omid Ghiasvand, Rohit J. Kate, Learning for clinical named entity recognition without manual annotations, Informatics in Medicine Unlocked. 2018. V. 13. P. 122–127.

19. Shilakes C., Tylman J. (1998). “Enterprise Information Portals”. Merrill Lynch.

20. Belerao K. Tweet Segmentation for Named Entity Recognition. Journal of Artificial Intelligence Research. 2017. V. 3. P. 22–25.

21. Tharwat A. “Classification assessment methods”, Applied Computing and Informatics, Vol. ahead-of-print No. ahead-of-print. 2020.


Review

For citations:


Lisenkov I.A., Kuznetsov V.A., Leonova N.M. Processing Unstructured Text Information Using by Machine Learning Algorithms. Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI". 2020;9(4):376-384. (In Russ.) https://doi.org/10.56304/S2304487X20040057

Views: 172


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2304-487X (Print)