Statistical learning of sub-words in Vietnamese language
D. Q. NguyenT. H. Le
Khoa Kỹ Thuật
Thể loại: Kỷ yếu
Sub-words have recently attracted much attention and employed to improve many natural language processing applications. In this paper, we suggest a procedure to extract sub-word units from a text collection. The subword units are evaluated on two Vietnamese databases to analyze and discuss their statistics and characteristics for Vietnamese language, including sub-word types, sub-word frequency, top sub-word distribution and unknown sub-words in different text types. The experimental results also point out several problems in training and testing data splitting in a current Vietnamese language processing example of Optical Character Recognition (OCR) error correction. Keywords: Sub-word, bigram, trigram, statistics.
Tài liệu tham khảo
Để đọc toàn văn của bài báo này, bạn có thể yêu cầu một bản sao đầy đủ trực tiếp từ các tác giả.