Please use this identifier to cite or link to this item:
|Title:||Dzongkha Word Segmentation Using Deep Learning|
Naresuan University. Faculty of Engineering
|Keywords:||Dzongkha word segmentation|
Natural Language Processing
Deep Neural Network
|Abstract:||Dzongkha is the national language of Bhutan. The preservation and promotion of the national language are of the utmost importance because the language represents the identity of the country. Focusing and advancing in the field of Natural Language Processing (NLP) and its applications can be the technological movements toward the said goal. However, there is no advancement seen in the field of Dzongkha language processing and its research. Also, the development of NLP applications is challenging because the Dzongkha is written as a string of syllables without an explicit word delimiter.
For such language, the word segmentation is the first and fundamental step towards building NLP applications. The word forms the basic constituent for the NLP task such as translator and the participation of the word in the given sentence or phrase determines the meaning. In this thesis, the Dzongkha word segmentation is formulated as the syllable tagging problem because the word is formed as a combination of one or more syllables. The tag of the syllable represents the position of the syllable in a word. There are many techniques for tagging ranging from dictionary-based to modern approaches. The deep learning algorithm, particularly Deep Neural Network (DNN) and Bi-directional Long Short-Term Memory (Bi-LSTM) were proposed. The usage of deep learning algorithms avoids the need for manual feature engineering.
In our experiments for the DNN model, the window approach was implemented to incorporate contextual information of the target syllable. The context size ranging from 0 to 3 were considered to determine the most suitable context size for the Dzongkha language. Two experimental sets were designed based on the usage of pretrained syllable embedding. Each set comprises of four models of various context sizes. Amongst the eight models, the model with context 2 using pretrained syllable embedding achieved the highest accuracy of 94.35% and F1-score of 94.40% with 94.47% precision and 94.35% recall.
There is no thumb rule to determine the optimal hyperparameters for the deep learning algorithms. We have designed 24 Bi-LSTM models with different configurations, which can be broadly classified into two experimental sets, based on the neuron size: 256 and 512. Amongst these models, the model with the configuration of 256 neuron size, embedding dimensions of 128, the learning rate of 0.01 and without dropout achieved the highest accuracy of 95.25%, which is 0.90% higher than the DNN based model. Further, the proposed deep learning models have been compared with traditional machine learning algorithms like CRF and SVM, which shows the proposed model outperformed the traditional machine learning approaches.
Out-of-vocabulary (OOV) is are the most prominent issue to be considered for language processing. Both of the models were designed to handle the OOV syllables. My work is the first of its to apply Deep Learning algorithms in the field of Dzongkha language processing and I consider the performance achieved by both of the models as the significant one.|
|Description:||Master of Engineering (M.Eng.)|
|Appears in Collections:||คณะวิศวกรรมศาสตร์|
Items in NU Digital Repository are protected by copyright, with all rights reserved, unless otherwise indicated.