Statistical learning of sub-words in Vietnamese language

D. Q. NguyenT. H. Le

Khoa Kỹ Thuật

Thể loại: Kỷ yếu

Sơ lược nội dung

Sub-words have recently attracted much attention and employed to improve many natural language processing applications. In this paper, we suggest a procedure to extract sub-word units from a text collection. The subword units are evaluated on two Vietnamese databases to analyze and discuss their statistics and characteristics for Vietnamese language, including sub-word types, sub-word frequency, top sub-word distribution and unknown sub-words in different text types. The experimental results also point out several problems in training and testing data splitting in a current Vietnamese language processing example of Optical Character Recognition (OCR) error correction. Keywords: Sub-word, bigram, trigram, statistics.

Thông tin chung

Thể loại

Kỷ yếu

Năm xuất bản

05 Thg11 2021

Ngôn ngữ gốc

Tiếng Anh

Tạp chí công bố

AIP Conference Proceedings

Ấn phẩm số

Vol. 2420, No. 010001 (2021)

Loại tạp chí

Danh mục Scopus

Mã ISBN

978-0-7354-4137-8

Chất lượng

Không phân Q

Tài liệu tham khảo

Để đọc toàn văn của bài báo này, bạn có thể yêu cầu một bản sao đầy đủ trực tiếp từ các tác giả.