Authors
Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, Gully APC Burns
Publication date
2012/12
Source
Source code for biology and medicine
Volume
7
Pages
1-10
Publisher
BioMed Central
Description
Background
The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.
Results
Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the …
Total citations
20122013201420152016201720182019202020212022202320245616813151613201916197
Scholar articles
C Ramakrishnan, A Patnia, E Hovy, GAPC Burns - Source code for biology and medicine, 2012