Candidate word generation for OCR errors using optimization algorithm
D.T. PhamD. Q. NguyenA. D. LeM. N. PhanP. Kromer
Thể loại: Kỷ yếu
OCR post-processing is an important step to improve OCR text accuracy. It includes two main tasks, error detection and error correction. Hill climbing algorithm is a heuristic search method used for solving optimization problems. In this paper, we present a novel OCR error correction approach using an adapted version of the Hill climbing algorithm. Correction candidates of OCR errors are explored by random character edits and evolved with the Hill climbing. The character edit patterns are obtained from the training data. The proposed model is evaluated on the benchmark dataset in the OCR post-correction competition of the International Conference on Document Analysis and Recognition 2017. It is shown that our model outperforms various baseline approaches in the competition. In addition, the randomness of the proposed algorithm is analyzed to verify its stability under parameter configurations.
Tài liệu tham khảo
Để đọc toàn văn của bài báo này, bạn có thể yêu cầu một bản sao đầy đủ trực tiếp từ các tác giả.