View article

Authors

Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, Gully APC Burns

Publication date

2012/12

Source

Source code for biology and medicine

Volume

Pages

1-10

Publisher

BioMed Central

Description

Background

The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.

Results

Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the …

Total citations

Cited by 176

20122013201420152016201720182019202020212022202320245 6 16 8 13 15 16 13 20 19 16 19 7

Scholar articles

Layout-aware text extraction from full-text PDF of scientific articles

C Ramakrishnan, A Patnia, E Hovy, GAPC Burns - Source code for biology and medicine, 2012