High Performance Document Layout AnalysisThomas M. BreuelPARC, Palo Alto, CA, USAtmb@parc.comAbstract like proximity, texture, or whitespace. Segmenta-tionintoregionsareoftencarriedoutusingheuristic1In this paper , I summarize research in documentmethods based on morphology or “smearing” basedlayout analysis carried out over the last few yearsapproaches, projection profiles (recursive X-Y cuts),in our laboratory. Correct document layout analy-texture-based analysis, analysis of the backgroundsis is a key step in document capture conversionsstructure, and others (for a review and references,into electronic formats, optical character recognitionsee [7]). Each individual region is then considered(OCR), information retrieval from scanned docu-separately for tasks like text line finding and OCR.ments, appearance-based document retrieval, and re-The problem with this approach lies in the fact thatformatting of documents for on-screen display. Weobtaining a complete and reliable segmentation of ahave developed a number of novel geometric algo-documentintoseparateregionsisdifficulttoachieverithms and statistical methods. Layout analysis sys-in general. Some decisions about which regions totems built from these algorithms are applicable tocombine may well involve semantic constraints ona wide variety of languages and layouts, and havethe output of an OCR system. However, in order toproven to be robust to the presence of noise and spu-be able to pass the ...
Voir