Türkçe Metinler İçin Pos Etiket Bilgisi İle Cümle Sonu Belirlenmesinde Derin Öğrenme Yöntemlerinin Başarısı

Yasin Bektaş; Selma Ayşe Özel

doi:10.5281/zenodo.8233545

Authors

Yasin Bektaş Mersin University, Vocational School of Erdemli, Department Of Computer Technologies, Mersin, Türkiye https://orcid.org/0000-0002-2761-5780
Selma Ayşe Özel Cukurova University, Faculty Of Engineering, Department of Computer Engineering, Adana, Türkiye https://orcid.org/0000-0001-9201-6349

DOI:

https://doi.org/10.5281/zenodo.8233545

Keywords:

Deep Learning, Natural Language Processing, Sentence Boundary Detection, Corpus

Abstract

As a result of today's technological developments, written and spoken texts have increased rapidly in the digital world. However, Natural Language Processing (NLP) applications have gained great importance today. The first and most important issue to be solved in NLP applications is to determine the sentence boundary of the text correctly. Punctuation marks such as periods, exclamation points, and question marks that are generally seen at the end of sentences are not only used to determine the boundary of sentences in the text. Therefore, the disambiguation of the purpose of using punctuation marks is a problem. In previous studies, the effects of POS (Part-Of-Speech) tag information at the end of the sentence were examined and successful results were obtained with classical classifiers. In this study, different numbers of POS tag information were added to 9 rules-based attributes and experimental evaluations were carried out with deep learning methods called Long Short Term Memory(LSTM) and Bidirectional Long Short Term Memory(BiLSTM). For the experiments, the Turkish National Corpus (TNC) and the parallel corpus called SETimes were used. TNC is a 50 million-word corpus of many fields and genres covering the period 1990-2009. SETimes is a parallel corpus of 10 languages, 9 of which belong to southeast Europe and one to English. Balanced sub-datasets with 30000 samples, with and without sentence endings, randomly selected from the specified corpus, were created and these datasets were used for testing. With the experiments performed, classical classifiers such as Back Propagation Neural Network, RBF (Radial Basis Function) Network, Naive Bayes classifier, Decision Tree and Support Vector Machine; and deep learning methods such as LSTM and BiLSTM were compared. As a result, it has been observed that the success of deep learning methods is significantly better than classical classifiers.

References

Aksan, Y., Aksan, M., Koltuksuz, A., Sezer, T., Mersinli, Ü., Demirhan, U. U., ... and Kurtoğlu, Ö., 2012. Construction of the Turkish national corpus (TNC). In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 3223-3227.

Aksan, Y., Özel, S. A., Bektaş, Y., Aksan, M., Demirhan, U. U., Mersinli, Ü., and Yılmazer, H., 2014. Türkçe Tümcelerin Sonunu Belirlemede Açık Kaynak/Ücretsiz Yazılımlar ve Performans Analizleri. Akademik Bilişim, Mersin 727-734.

Bektaş, Y., and Özel, S. A., 2018. The Effect of POS Tag Information on Sentence Boundary Detection in Turkish Texts. In 2018 Innovations in Intelligent Systems and Applications Conference (ASYU), Adana, 1-5.

Broomhead, D. S., and Lowe, D., 1988. Radial basis functions, multi-variable functional interpolation and adaptive networks. Royal Signals and Radar Establishment Malvern (United Kingdom).

Cortes, C., and Vapnik, V., 1995. Support-vector networks. Machine learning, 20(3), 273-297.

Demirhan, U. U., 2013. A description of the verb gel-with special reference to pattern grammar (Master's thesis, Sosyal Bilimler Enstitüsü).

Dinçer, B. T., and Karaoğlan, B., 2004. Sentence boundary detection in Turkish. In International Conference on Advances in Information Systems, Springer, Berlin, Heidelberg. 255-262.

Gers, F. A., Schraudolph, N. N., and Schmidhuber, J., 2002. Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3(Aug), 115-143.

Grishman, R., 1986. Computational linguistics: an introduction. Cambridge University Press, New York, 193p.

Han, J., and Kamber, M., 2006. Data Mining Concepts and Techniques, 2nd ed., Morgan Kaufmann Publishers, San Francisco, p800.

Hand, D. J., and Yu, K., 2001. Idiot's Bayes—not so stupid after all?. International statistical review, 69(3), 385-398.

Hilden, J., 1984. Statistical diagnosis based on conditional independence does not require it. Computers in biology and medicine, 14(4), 429-435.

Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural computation, 9(8), 1735-1780.

Kiss, T., and Strunk, J., 2006. Unsupervised multilingual sentence boundary detection. Computational linguistics, 32(4), 485-525.

Lee, D. Y., 2010. What corpora are available. The Routledge handbook of corpus linguistics, Roudledge Press, New York, 650p.

Liu, B., 2011. Web data mining: exploring hyperlinks, contents, and usage data (Vol. 1). Berlin: springer.

Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., and Khudanpur, S., 2010. Recurrent neural network based language model. In Interspeech, Vol. 2, No. 3, 1045-1048.

Mundluru, D., 2008. Automatically constructing wrappers for effective and efficient Web information extraction. University of Louisiana at Lafayette.

Niuniu, X., and Yuxun, L., 2010. Notice of Retraction: Review of decision trees. In 2010 3rd international conference on computer science and information technology, Vol. 5, 105-109).

Quinlan, J. R., 1986. Induction of decision trees. Machine learning, 1(1), 81-106.

Quinlan, J. R., 1993. C4. 5: Programming for machine learning. Morgan Kauffmann, 38(48), 49.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J., 1986. Learning representations by back-propagating errors. nature, 323(6088), 533-536.

Schuster, M., and Paliwal, K. K., 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11), 2673-2681.

TDK, 2023.http://www.tdk.gov.tr/icerik/yazim-kurallari/noktalama-isaretleri-aciklamalar, 01/06/2023

Tyers, F. M., and Alperen, M. S., 2010. South-east european times: A parallel corpus of balkan languages. In Proceedings of the LREC workshop on exploitation of multilingual resources and tools for Central and (South-) Eastern European Languages, 49-53.

Wiki, 2023, https://en.wikipedia.org/wiki/File:Recurrent_neural_network_unfold.svg#filehistory, 01/06/2023

Success of Deep Learning Methods in Determining the End of Sentence with Pos Tag Information in Turkish Texts

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Language

Information