Authors
Myriam C Traub, Thaer Samar, Jacco Van Ossenbruggen, Lynda Hardman
Publication date
2018/5/23
Book
Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries
Pages
29-36
Description
Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of …
Total citations
2018201920202021202220232024122141
Scholar articles
MC Traub, T Samar, J Van Ossenbruggen, L Hardman - Proceedings of the 18th ACM/IEEE on Joint …, 2018