Overview

Tesseract OCR is an open-source engine used for optical character recognition, capable of converting images containing text into machine-readable text. Originally developed at Hewlett-Packard, it is now maintained by Google and a community of contributors. Tesseract 4 introduced a new neural net (LSTM) based OCR engine focused on line recognition, while still supporting the legacy Tesseract OCR engine. It's compatible with various image formats like PNG, JPEG, and TIFF and supports multiple output formats including plain text, hOCR (HTML), PDF, TSV, ALTO, and PAGE. Developers can integrate it into applications using the C or C++ API. It relies on the Leptonica library for image handling, offering a flexible solution for text extraction from images. It's designed to be trained for recognizing different languages and customized character sets.

Common tasks

Optical Character Recognition Text Extraction Image to Text Conversion