Text Recognition

For simple and every-day tasks of recognition, like from modern printed publications, the CATS IT offers some optical character recognition (OCR) services. You can use Adobe Acrobat Pro on your workstation with it's “Edit PDF” functionality to convert individual pdf files to full text in a single recognition language. The CATS copy machines also offer OCR service which is based on Abbyy OCR Server. It is able to recognize texts containing letters or characters in multiple modern languages and scripts. In addition, you can install other tools on your own computer, like the free PDF-XChange, which provides OCR language packs, e.g., for Japanese or Chinese.

All these solutions work very well with contemporary (usually digitally printed) texts. However, if you wish to recognize text from historic documents (and “historic” usually starts around the 1970s), this can become quite a challenging task. To go deeper into text recognition, you have various options. One is to use the MediaLab at the HCTS, which provides the advanced Abbyy Finereader software. With this tool, it is not only possible to freely combine recognition languages, but also to train the recognition algorithm on your material. You can also define your own set of characters (define your own "language") and use it to train recognition. In addition, you can define the areas you want to recognize, and mark tables or lists separately. Finereader also offers a correction interface and various output formats, including plain text, docx/odt or xml. For example, we used Abbyy Finereader for digitization of the “Turkologischer Anzeiger”.

Another option is to make use of the Transkribus platform. This co-operative makes use of artificial intelligence to recognize even handwritten texts, and offers powerful transcription and annotation services. While its focus lies on Latin-based texts, a number of initiatives have successfully shown how Transkribus can be used for languages in non-Latin scripts. At the CATS, the digital project Naval Kishore Press successfully used the platform to recognize Hindi and Sanskrit texts written in Devanagari.

There exist a number of other platforms that deal with text recognition, using approaches of optical character recognition (OCR), handwritten text recognition (HTR), or computational text recognition (CTR). Examples are eScriptorium (currently tested by the University Library), or OCR4all, to name just a few. 

Even more advanced approaches use dedicated recognition pipelines, like those of the German OCR-D initiative. These pipelines cannot just be installed on your computer, like Adobe Acrobat, but consist of multiple steps that often require more advanced technical skills and stronger machinery, like High Performance Computing centers. One example for a project at the CATS working on developing such a text recognition pipeline for Republican China newspapers is the Early Chinese Periodicals Online (ECPO) project.

If you are interested in learning more about computational text recognition or think about implementing such algorithms for your research project, please contact Matthias Arnold.