Processing Unstructured Text Information Using by Machine Learning Algorithms

I. A. Lisenkov; V. A. Kuznetsov; N. M. Leonova

doi:10.56304/S2304487X20040057

Processing Unstructured Text Information Using by Machine Learning Algorithms

I. A. Lisenkov, V. A. Kuznetsov, N. M. Leonova

https://doi.org/10.56304/S2304487X20040057

Full Text:

PDF (Rus)

Generate QR code

Abstract

The number of documents processed during the organization's operational processes is dramatically growing from year to year. As a result, there is a strong demand for automatic processing of such information. Unfortunately, an essential part of such documents is unstructured information. Unstructured document formats are very variable and strongly different from document to document. In this case, automation, through the rule based coding and parsing templates is quite complex and ineffective for further support. It could be analyzed manually, but it is time-consuming and resource intensive. This paper presents the general architecture of an information extraction system based on modern approaches to Natural Language Processing (NLP) and machine learning. The article presents the statistical results of the experiments carried out to solve a number of practical problems of processing unstructured documents. The presented solution allows to process a great amount of unstructured information without coding and preparing parse templates.

Keywords

information extract, natural language processing, deep learning, unstructured information

About the Authors

I. A. Lisenkov

National Research Nuclear University MEPhI (Moscow Engineering Physics Institute)
Russian Federation

115409

Moscow

V. A. Kuznetsov

National Research Nuclear University MEPhI (Moscow Engineering Physics Institute)
Russian Federation

115409

Moscow

N. M. Leonova

National Research Nuclear University MEPhI (Moscow Engineering Physics Institute)
Russian Federation

115409

Moscow

References

1. Allam Z. On Smart Contracts and Organisational Performance: A Review of Smart Contracts through the Blockchain Technology, Review of Economic and Business Studies. 2018. V. 11 (2). P. 137–156.

2. Bozzano G. L. Introduction to Electronic Document Management Systems. 2012.

3. Nelson P. Search and unstructured data analytics: 5 trends to watch in 2020. Accenture Search and Content Analytics Blog. 2020.

4. Shilakes C., Tylman J. “Enterprise Information Portals”. Merrill Lynch. 1998.

5. Reinsel D., Gantz J., Rydning J. The Digitization of the World from Edge to Core. An IDC White Paper. 2018.

6. Lisenkov I. A., Vertakov P. A. Architecture of the Software Complex for Full Cycle of Research in the Field of Machine Learning. In Past, Present and Future Science [Nauka vchera, segodnya, zavtra]. 2017. V. 9 (43). P. 46–56.

7. Ming Zh. Progress in Neural NLP: Modeling, Learning, and Reasoning. ELSEVIER Engineering. 2020. V. 6 (3). P. 275–290.

8. Li J., Sun A., Han J., Li C. A Survey on Deep Learning for Named Entity Recognition. In IEEE Transactions on Knowledge and Data Engineering. 2020. P. 1–20.

9. Bartoli A., De Lorenzo A., Medvet E., Tarlao F. Inference of Regular Expressions for Text Extraction from Examples. In IEEE Transactions on Knowledge and Data Engineering. 2016. V. 28 (5). P. 1217–1230.

10. Kuznetsov V. A., Lisenkov I. A. Application of Genetic Algorithm for Information Extraction. In Advanced innovative developments. Prospects and experience of application, problems of implementations in production [Peredovye innovacionnye razrabotki. Perspektivy i opyt ispol’zovanija, problemy vnedrenija v proizvodstvo]. 2019. V. 2. P. 232–234.

11. Goldberg Y. Neural Network Methods in Natural Language Processing. Morgan & Claypool, 2017. P. 65.

12. Nello Cristianini, John Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. – Cambridge University Press, 2000. – ISBN 978-1-139-64363-4.

13. Daniel T. Larose, Discovering Knowledge in Data: An Introduction to Data Mining (https://web.archive.org/web/20140531051709/http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0471666572.html)

14. Breiman L., Friedman J., Olshen R., and Stone C. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.

15. De-sheng WANG, Jun-zhi LIU, A-xing ZHU, Shu WANG, Can-ying ZENG, Tian-wu MA, Automatic extraction and structuration of soil-environment relationship information from soil survey reports, Journal of Integrative Agriculture, 18, Issue 2, 2019, 328–339.

16. Etzioni, Oren & Cafarella, Michael & Downey, Doug & Popescu, Ana-Maria & Shaked, Tal & Soderland, Stephen & Weld, Daniel & Yates, Alexander. Unsupervised named-entity extraction from the Web: An experimental study. Artificial Intelligence. 2005. V. 165. P. 91–134.

17. Anupama Gupta, Imon Banerjee, Daniel L. Rubin, Automatic information extraction from unstructured mammography reports using distributed semantics, Journal of Biomedical Informatics. 2018. V. 78. P. 78–86.

18. Omid Ghiasvand, Rohit J. Kate, Learning for clinical named entity recognition without manual annotations, Informatics in Medicine Unlocked. 2018. V. 13. P. 122–127.

19. Shilakes C., Tylman J. (1998). “Enterprise Information Portals”. Merrill Lynch.

20. Belerao K. Tweet Segmentation for Named Entity Recognition. Journal of Artificial Intelligence Research. 2017. V. 3. P. 22–25.

21. Tharwat A. “Classification assessment methods”, Applied Computing and Informatics, Vol. ahead-of-print No. ahead-of-print. 2020.

Review

For citations:

Lisenkov I.A., Kuznetsov V.A., Leonova N.M. Processing Unstructured Text Information Using by Machine Learning Algorithms. Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI". 2020;9(4):376-384. (In Russ.) https://doi.org/10.56304/S2304487X20040057

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2304-487X (Print)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Vestnik natsional'nogo issledovatel'skogo yadernogo universiteta "MIFI"

Processing Unstructured Text Information Using by Machine Learning Algorithms

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy