Dzongkha Word Segmentation Using Deep Learning

Please use this identifier to cite or link to this item: http://nuir.lib.nu.ac.th/dspace/handle/123456789/1467

Full metadata record

DC Field	Value	Language
dc.contributor	YESHI JAMTSHO	en
dc.contributor	Yeshi Jamtsho	th
dc.contributor.advisor	Paisarn Muneesawang	en
dc.contributor.advisor	ไพศาล มุณีสว่าง	th
dc.contributor.other	Naresuan University. Faculty of Engineering	en
dc.date.accessioned	2020-10-12T08:37:01Z	-
dc.date.available	2020-10-12T08:37:01Z	-
dc.date.issued	2019	en_US
dc.identifier.uri	http://nuir.lib.nu.ac.th/dspace/handle/123456789/1467	-
dc.description	Master of Engineering (M.Eng.)	en
dc.description	วิศวกรรมศาสตรมหาบัณฑิต (วศ.ม.)	th
dc.description.abstract	Dzongkha is the national language of Bhutan. The preservation and promotion of the national language are of the utmost importance because the language represents the identity of the country. Focusing and advancing in the field of Natural Language Processing (NLP) and its applications can be the technological movements toward the said goal. However, there is no advancement seen in the field of Dzongkha language processing and its research. Also, the development of NLP applications is challenging because the Dzongkha is written as a string of syllables without an explicit word delimiter. For such language, the word segmentation is the first and fundamental step towards building NLP applications. The word forms the basic constituent for the NLP task such as translator and the participation of the word in the given sentence or phrase determines the meaning. In this thesis, the Dzongkha word segmentation is formulated as the syllable tagging problem because the word is formed as a combination of one or more syllables. The tag of the syllable represents the position of the syllable in a word. There are many techniques for tagging ranging from dictionary-based to modern approaches. The deep learning algorithm, particularly Deep Neural Network (DNN) and Bi-directional Long Short-Term Memory (Bi-LSTM) were proposed. The usage of deep learning algorithms avoids the need for manual feature engineering. In our experiments for the DNN model, the window approach was implemented to incorporate contextual information of the target syllable. The context size ranging from 0 to 3 were considered to determine the most suitable context size for the Dzongkha language. Two experimental sets were designed based on the usage of pretrained syllable embedding. Each set comprises of four models of various context sizes. Amongst the eight models, the model with context 2 using pretrained syllable embedding achieved the highest accuracy of 94.35% and F1-score of 94.40% with 94.47% precision and 94.35% recall. There is no thumb rule to determine the optimal hyperparameters for the deep learning algorithms. We have designed 24 Bi-LSTM models with different configurations, which can be broadly classified into two experimental sets, based on the neuron size: 256 and 512. Amongst these models, the model with the configuration of 256 neuron size, embedding dimensions of 128, the learning rate of 0.01 and without dropout achieved the highest accuracy of 95.25%, which is 0.90% higher than the DNN based model. Further, the proposed deep learning models have been compared with traditional machine learning algorithms like CRF and SVM, which shows the proposed model outperformed the traditional machine learning approaches. Out-of-vocabulary (OOV) is are the most prominent issue to be considered for language processing. Both of the models were designed to handle the OOV syllables. My work is the first of its to apply Deep Learning algorithms in the field of Dzongkha language processing and I consider the performance achieved by both of the models as the significant one.	en
dc.description.abstract	-	th
dc.language.iso	en	en_US
dc.publisher	Naresuan University	en_US
dc.rights	Naresuan University	en_US
dc.subject	Dzongkha word segmentation	en
dc.subject	Deep Learning	en
dc.subject	Natural Language Processing	en
dc.subject	Window approach	en
dc.subject	Deep Neural Network	en
dc.subject	Bi-LSTM RNN	en
dc.subject	Syllable tagging	en
dc.subject.classification	Computer Science	en
dc.title	Dzongkha Word Segmentation Using Deep Learning	en
dc.title	-	th
dc.type	Thesis	en
dc.type	วิทยานิพนธ์	th
Appears in Collections:	คณะวิศวกรรมศาสตร์

Files in This Item:

File	Description	Size	Format
61062939.pdf		2.09 MB	Adobe PDF	View/Open

Show simple item record