Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.
ABBYY FineReader 10 Professional Edition - Intelligent, professional-level OCR software for recognition of scanned paper documents, PDFs and digital images
ALTO (Analyzed Layout and Text Object) is a XML Schema that details technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper. It most commonly serves as an extension schema used within the Metadata Encoding and Transmission Schema (METS) administrative metadata section. However, ALTO instances can also exist as a standalone document used independently of METS.
Audiveris is an Optical Music Recognition (OMR) module. Starting from the image of a music sheet, it provides high-level logical music information compliant with the MusicXML definition. Other tools such as a Midi Sequencer, or a Composition Editor can then read and update this standard data.
There are already commercial tools in this area but Audiveris is, to our knowledge, the first Java open-source OMR tool. It is a cross-platform tool, written entirely in Java, and tested on Windows, Solaris, Linux and Mac OS.
Audiveris works with printed music sheets only, the task of recognizing hand-written scores being significantly harder.
zerlegt digital elektronische, Papier-, Mikrofilm- oder Mikrofiche- Dokumente in ihre Bestandteile und schafft durchsuchbare Inhalte bei gleichzeitigem