1: Unlocking Historical Documents: A Practical Guide to ATR

What You Will Learn

By the end of this introduction and the subsequent pages, you will confidently be able to:

  • Define OCR, HTR, and ATR, and trace their evolution from punch-cards to neural networks
  • Explain why ATR transforms archival research—saving time and revealing hidden patterns
  • Map the modern ATR workflow on a conceptual level: from scanning to model training, transcription, and correction
  • Decide when to use off-the-shelf models versus custom-trained ones for your specific collection

Why You Are Here

Historical documents—be they 12th-century charters, 16th-century letters, or 19th-century diaries—hold treasures of insight. Yet their handwritten scripts, aging inks, and paper wear make them difficult to read and analyze at scale. Automated Text Recognition (ATR) unlocks these texts, turning images of pages into machine-readable transcripts so you can:

⚙️ Boost Efficiency: Replace manual transcription with automated pipelines

🔍 Accelerate Discovery: Search for people, places, and phrases across collections

📊 Expand Analysis: Combine close reading of individual pages with corpus-wide statistics

A Quick Story: At the University of Zürich, ATR has turned what once would have been decades of manual transcription into working packages that wrap up within months. For instance, the Bullinger Digital team used a Transformer-based TrOCR model to automatically transcribe almost 3,000 Reformation-era letters. The PARES project will employ ATR to digitize the archives of French philologist Gaston Paris (1839–1903), and the Heinrich Wölfflin Gesammelte Werke initiative is applying automated transcription to the collected works of Swiss art historian Heinrich Wölfflin (1864–1945). Thanks to ATR, these landmark projects can devote their time to interpretation, text mining, and deeper epistemological questions—rather than the painstaking work of character-by-character transcription.

Who This Module Is For

No matter your background—historian, archivist, librarian, digital humanist, or curious learner—you will find this module approachable. We assume no prior AI, programming, or deep-learning knowledge. Familiarity with reading and interpreting historical scripts is helpful, but not required

Roadmap at a Glance

ATR Overview: Why it matters now and what makes it different from OCR/HTR
Benefits for Historical Research: Real-world case studies and quick wins
History in Brief: From early OCR to today’s transformer-based models
Under the Hood: Key components of an ATR pipeline (optional deep dive)
Resources: Definitions, further reading, and useful links

Learning Objectives

By the end of this part, you will be able to:

  • Define Automated Text Recognition (ATR) as a comprehensive field.
  • Distinguish between Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR).
  • Trace the evolution from early recognition techniques to modern neural networks.
  • Explain why new approaches have dramatically improved the accuracy and feasibility of transcribing historical documents.

What is Automated Text Recognition (ATR)?

Automated Text Recognition (ATR) is the general term for all techniques that digitally recognize characters in documents. It serves as an umbrella term for technologies that convert images of text—whether printed or handwritten—into a machine-readable format that can be searched, edited, and analyzed on a computer. This process unlocks vast historical archives for new kinds of research, making large-scale analysis more time-efficient and cost-effective.

While it is common practice to distinguish between Optical Character Recognition (OCR) for printed text and Handwritten Text Recognition (HTR) for manuscripts, a look at the historical usage of these terms reveals a significant disparity. As the Google Books Ngram Viewer illustrates, the term "Optical Character Recognition" has a much longer history and is far more prevalent in published literature than "Handwritten Text Recognition" (as far as we can trust the Google Books corpus, of course).


Distribution of the terms "optical character recognition" (blue) and "handwritten text recognition" (red) in the Google Books Ngram Viewer corpus (1930-2022).

Despite this difference in usage, the distinction has served as a useful shorthand in the research community. "OCR" typically refers to extracting text from clean, printed sources, while "HTR" is used for the more complex task of transcribing handwritten lines (Hodel et al., 2014). More recently, ATR has emerged as a unifying term, particularly as the underlying technologies for both print and script have converged. Its recent appearance in specialized academic contexts means it has not yet achieved the widespread currency of OCR and is therefore not prominent enough to register significantly on the Ngram Viewer.

The Two Historical Pillars: OCR and HTR

To understand the power of modern ATR, it's helpful to look at its two foundational pillars and their conceptual origins. Long before digital computers, the challenge of automatically finding information in large document archives existed. One of the earliest and most ingenious solutions was Emanuel Goldberg's "Statistical Machine," developed in the 1920s to search through rolls of microfilm (Buckland, 1992). Goldberg's device used a form of analog pattern matching: a light beam passed through both the microfilm and a pre-made template mask of a specific character or word. When the character on the film aligned perfectly with the template, the light reaching a photocell would dip, signaling a match. This principle of comparing a document's patterns against a known template is the direct conceptual ancestor of the template-matching techniques used in early Optical Character Recognition.

Optical Character Recognition (OCR): The Standard for Print
OCR is a well-established technology designed specifically for the retro-digitalization of printed text. It became increasingly popular in the 1960s and 1970s, especially in the postal sector, to recognize addresses on letters. The 1970s were formative years when computer science embraced OCR as a valid field of research (Mori, Suen and Yamamoto, 1992). Most modern scanners come with built-in OCR capabilities, allowing you to turn a physical book or document into a searchable digital file.

A Cautionary Tale: The Scanner Bug
While we often assume a scanned image is a perfect replica of the original, the digitization process itself can introduce subtle but critical errors. A famous example is the Xerox scanner bug, where the machine's compression algorithm would silently alter numbers in documents.
This video explains the issue and serves as a crucial reminder that the quality of the initial scan is the foundation for all subsequent text recognition.

Follow this LINK for the video.

The classic OCR process works in a series of steps:

  • Distinguishing Elements: The software first differentiates between graphic and text elements on the page.
  • Line and Character Segmentation: It then identifies the structure of the text, breaking it down into lines, words, and finally, individual characters.
  • Pattern Matching: Each character, now a small pixel grid (a raster graphic), is compared against a large index of existing character patterns. The most likely match, or "twin," from the index is then assigned to the character from the document (Fink, 2014).

This method proved highly effective for standardized, printed fonts. However, it struggled significantly with handwritten text, older fonts, or inconsistent printing, often producing unusable, illegible results (Su, Huang and Su, 2014).

Handwritten Text Recognition (HTR): The Challenge of Script
Handwritten text, with its cursive connections, inconsistent letterforms, and individual quirks, presented a much greater challenge (Fischer et al., 2012). Early attempts to apply OCR-like segmentation techniques to handwriting were not very successful and resulted in exceedingly high character error rates (Anwar, Kalra and Roy, 2022).

What's a "Character Error Rate" (CER)?

The Character Error Rate is a key metric for measuring the accuracy of a transcription model. It counts the minimum number of character edits (insertions, deletions, and substitutions) required to change the machine-generated text into the correct text, normalized by the length of the reference text (Morris, Maier and Green, 2004). A CER of 12% means that, on average, 12 out of every 100 characters in the transcript are incorrect. The lower the CER, the better the model's performance.
The breakthrough for HTR came with a fundamental change in technology, moving away from rigid character matching to more flexible methods.

The Modern Revolution: How Computers Learned to Read Context

Today's ATR technology has largely moved beyond the separate approaches of OCR and HTR, thanks to the power of deep learning and advanced neural networks (NN).

Instead of just matching the pixels of a single letter, modern systems analyze characters as part of a sequence, allowing them to understand context. This is accomplished through a combination of sophisticated components, often layered together:

  • Convolutional Neural Networks (CNNs): These networks scan the image to extract key visual features like edges, curves, and textures that define characters.
  • Recurrent Neural Networks (like BLSTM): These networks, specifically Bidirectional Long Short-Term Memory (BLSTM) layers, process the features sequentially, allowing them to consider the characters that came before and after. This helps the model make more accurate predictions—for example, it can learn that the letter 'u' often follows the letter 'q' in English.
  • Transformer Models & Self-Attention: The most recent breakthrough is the use of transformer architectures, which feature a "self-attention" mechanism. This allows the model to weigh the importance of all other characters in a sequence when interpreting a specific character, giving it a powerful understanding of context.

Because these modern models are trained on language context, their performance is language-specific. A model trained on German-language manuscripts will not perform as well on Italian texts, even if the script looks similar, because the underlying letter-prediction patterns are different (Puigcerver, 2017). This shift from segmented pixel-matching to contextual sequence prediction is why modern ATR is so powerful.


Early OCR (top) relied on matching the visual shape of isolated characters with Hidden Markov Models, which often failed with ambiguous or poorly preserved letters. Modern ATR with self-attention (bottom) examines a character in relation to its entire context, dramatically improving accuracy (illustrations taken from Ströbel (2023)).

A Closer Look: The READ Project

A pivotal moment for HTR arrived with the EU-funded READ (Recognition and Enrichment of Archival Documents) project, which ran from 2016-2019. By implementing neural networks, researchers dramatically improved accuracy. For some datasets, they cut the character error rate in half and hence signifiantly contributed to the improved intellegibility of historical documents (Michael, Weidemann and Labahn, 2018). This achievement was a turning point, proving that correcting a machine-generated transcript was officially more economical and time-efficient than transcribing a document manually from scratch. The project also furthered the development of the online platform Transkribus, which has become one of the leading tools for digital text transcription (Colutto et al., 2019).

The Modern ATR Pipeline

Regardless of the specific terminology used or whether we deal with printed or handwritten documents, the fundamental goal remains the same: to transform a static image of a page into structured, machine-readable text. Modern ATR systems achieve this through a multi-stage process, often referred to as a pipeline or workflow. Although the technical details can vary, most platforms follow a series of core conceptual steps, beginning with preparing the image and identifying the text's structure, before moving on to the recognition itself and a final clean-up phase. The following diagram illustrates this standard pipeline, which forms the basis of most modern text recognition platforms.


The modern ATR pipeline turns a document image into searchable text. This process generally involves analyzing the page layout, recognizing the characters, and then outputting a digital transcript that can be corrected and used for research (image taken and adapted from Ströbel (2023)).

This pipeline can be broken down into four main phases, each with a specific objective:

  1. Image Preprocessing: Before any text can be read, the initial document image is optimized for the machine. This step involves automated tasks like correcting the orientation, reducing digital "noise" or background interference, and enhancing the contrast between the ink and the paper. The goal is to produce a clean, standardized image that is easier for the subsequent stages to analyze.
  2. Layout Analysis and Segmentation: In this phase, the system distinguishes between textual and non-textual elements, such as illustrations or page decorations. It then identifies the structure of the text, determining the layout of columns, paragraphs, and, most importantly, individual lines. These detected lines of text are then isolated and passed one by one to the recognition engine.
  3. Text Recognition: This is the core of the pipeline where the computer "reads" the text. Using, e.g, a trained neural network model, the system analyzes the image of each text line. It extracts visual features like curves and edges with components like Convolutional Neural Networks. It then processes these features sequentially, often using Recurrent Neural Networks, to understand the context of each character in relation to others in the line. The final output of this stage is a string of characters—the initial, raw transcript.
  4. Post-Processing and Output: The raw transcript is rarely perfect. This final stage aims to refine the output. It can involve automated methods, such as checking the transcribed words against a dictionary of the document's language or using (large) language models to correct common errors. It also includes preparing the text for human review, where a user can perform post-correction to achieve a highly accurate final transcript, which can then be exported in various formats for research.

It is crucial to understand that the Text Recognition phase (Step 03) is where the most significant technological variations occur. The methods used here have evolved dramatically over time (see Part 4: Brief History of ATR for a more comprehensive overview).

Learning Objectives

By the end of this part, you will be able to:

  • Explain how ATR accelerates historical discovery by improving the efficiency of transcription and indexing.
  • Evaluate the practical value of imperfect transcriptions and describe how tools like "fuzzy search" can be used to analyze them.
  • Articulate the long-term benefits of creating sustainable, reusable digital knowledge repositories through ATR.
  • Recognize the role of ATR projects in the digital preservation of endangered historical documents.

Now that we have explored what Automated Text Recognition is and traced its technological development, we can ask the most important question: Why should a historian or archivist care? Beyond the technical marvels of deep learning, ATR fundamentally changes the scale and scope of what is possible in historical research. It offers three key promises: accelerating discovery, enabling new ways of searching, and creating sustainable knowledge for the future.

💡 From Manual Labor to Accelerated Discovery

The most immediate benefit of ATR is a dramatic increase in efficiency. For centuries, accessing the contents of large archival collections required painstaking manual transcription—a process that could take years or even decades for a single project.

ATR pipelines replace this manual labor, making the indexing and searching of large historical text corpora significantly more time-efficient and cost-effective. This shift has a democratizing effect on research; projects with relatively low resources can now access and analyze vast, unedited text collections that were previously out of reach, allowing them to consult more sources to answer their specific questions.

At institutions like the University of Zürich, this is already in practice. Projects that once would have required decades of work are now completed in a matter of months. The Bullinger Digital project, for example, used ATR to automatically transcribe thousands of Reformation-era letters, while the PARES project will use it to digitize the extensive archives of philologist Gaston Paris. By automating transcription, these landmark projects can devote their valuable time to interpretation and analysis rather than manual data entry.

🔎 Embracing Imperfection: Finding Needles in Haystacks

A common concern with automated methods is that the transcriptions are not always 100% perfect. However, a key insight of modern historical research is that even "imperfect" transcriptions are incredibly useful. While a project focused on the close reading of a single document requires near-perfect accuracy, a project examining thousands of documents has different needs. For large-scale discovery, finding a transcription where 90-95% of the text is correctly recognized is already a massive advantage, as it almost always allows for the reliable identification of relevant passages, names, or places.


To work effectively with these imperfect transcripts, researchers can use powerful search tools that account for potential errors. The most common of these is "fuzzy search".

What it is fuzzy search?
A fuzzy search is based on the Levenshtein Distance, which measures the number of character changes (insertions, removals, or substitutions) needed to get from one word to another.
How it works: If you search for the word "big" with a Levenshtein Distance of 1, the search will also return results for "bag," "bit," or "dig". 
Why it helps: This allows researchers to find terms of interest even if the ATR model has made a small error, such as misreading a "g" as a "q" or missing a character in a faded manuscript.

🏛️ Building Sustainable Knowledge for the Future

Perhaps the most profound impact of ATR is its contribution to the creation of sustainable knowledge repositories. When a document is transcribed for a single project, that work is often siloed. By converting historical documents into machine-readable, computer-encoded formats, we create a resource that can be preserved and re-analyzed by future researchers with entirely new questions.

 
This creates two important responsibilities:

  1. Technical Sustainability: We must ensure the digital environments where this data is stored are stable and maintained for the long term. 
  2. Common Standards: For data to be truly reusable, it is crucial to adhere to common standards for transcription, data collection, and indexing. 

Finally, ATR complements the essential duties of preserving historical documents for posterity. The process of creating high-quality digital images for ATR simultaneously addresses the problem of paper decay, ensuring that a durable surrogate of the document survives long after the physical object has degraded.

Learning Objectives

By the end of this part, you will be able to:

  • Identify key historical milestones in the development of HTR since the late 1990s.
  • Explain the role of foundational methods like CTC and standardized datasets like the IAM database in advancing the field.
  • Describe the evolution of deep learning architectures from early Deep NNs and CRNNs to modern Transformers.
  • Distinguish between line-level HTR and more recent "beyond-line" or full-page recognition methods.

     

While ATR feels like a cutting-edge technology, its conceptual roots go back nearly a century. This brief history traces the key milestones, from early analogue concepts and the rise of OCR to the deep learning revolution that unlocked the potential of handwritten text recognition. For an overview, we orient ourselves on the following key events:

Timeline of milestones in Handwritten Text Recognition (HTR) from Garrido-Munoz et al.'s survey (2025). They categorize the main events into four levels: datasets and competitions (green), general methods/architectures (yellow), up-to-line models (red), and beyond-line models (blue)

Precursors: The Limits of Classic OCR

To appreciate the modern ATR revolution, one must first understand the limitations it overcame. The earliest concepts of automated recognition, like Emanuel Goldberg's analogue machine from the 1920s, were based on template matching: comparing an image of a character against a known, perfect template. This principle was digitized in the 1970s and became the foundation of classic Optical Character Recognition.

This approach, which involves segmenting a line into individual characters and matching each one to a font library, worked well for clean, machine-printed text. However, it failed catastrophically when applied to historical documents and especially handwriting. The reasons were fundamental: template matching cannot handle the immense variability of handwriting, cursive scripts where characters are connected, inconsistent layouts, or document degradation like ink bleed and stains. This created a technological wall, meaning a new, more flexible approach was required.

The Foundations (c. 1999–2008): Setting the Stage for Machine Learning

The modern HTR era began not with an algorithm, but with data. For machine learning models to be trained and evaluated fairly, they need high-quality, standardized datasets. The release of the IAM Handwriting Database (Marti and Bunke) around 2002 was a critical milestone, providing thousands of handwritten English sentences that gave researchers a common benchmark to measure their progress against.

With a benchmark in place, the next step was a new algorithm. The major breakthrough came in 2006 from Alex Graves et al. with Connectionist Temporal Classification (CTC). CTC is a type of loss function that allows a neural network to be trained on sequence data (like a line of text) without needing to be told exactly where each character begins and ends. It solves the segmentation problem by calculating the probability of all possible transcriptions for a given sequence. This invention, applied specifically to HTR in 2008, was the key that unlocked segmentation-free, line-based handwriting recognition.

The Deep Learning Takeover (c. 2009–2017): Building the Modern Architecture

The 2010s saw an explosion of progress, fuelled by the general deep learning wave and driven by academic competitions like the International Conference on Document Analysis and Recognition (ICDAR). The dominant architecture to emerge during this period was the Convolutional Recurrent Neural Network (CRNN), a powerful two-part system first popularized by Shi et al. in 2015 and refined for HTR by Puigcerver in 2017.

  • The "Eyes" (CNN): The first part is a Convolutional Neural Network. The input image of a text line is fed through a series of CNN layers, which act as a sophisticated feature extractor. These layers learn to identify relevant visual patterns—curves, edges, loops, and textures—at a local level, turning the raw pixels into a rich sequence of feature maps.
  • The "Brain" (RNN): This sequence of features is then fed into the second part, a Recurrent Neural Network (often a Bidirectional LSTM). The RNN processes this information sequentially, considering the context of features that came before and after, to predict the most likely character at each step.

This CRNN+CTC combination became the gold standard for HTR for several years and was also the backbone for the much used and—at that time hyped—Transkribus system (Weidemann et al., 2018). Just as it was being perfected, however, the entire field of machine learning was about to be upended by the introduction of the Transformer architecture in 2017 (Vaswani et al.).

The Transformer Era (c. 2017–Present): Attention and the Push Beyond the Line

The Transformer architecture's key innovation is the self-attention mechanism. Unlike an RNN that processes information one step at a time, self-attention allows the model to look at all parts of the input sequence at once and calculate how relevant each part is to every other part. This parallel processing is not only faster, but often better at capturing complex, long-range dependencies within the data.

Researchers quickly adapted this powerful new architecture for text recognition. Models like Transformer for HTR (Kang et al., 2022) and TrOCR (Li et al., 2023) replaced the RNN component with a transformer, setting a new state-of-the-art for accuracy on line-level HTR.

Simultaneously, the power of these new architectures enabled researchers to pursue a more ambitious goal: Beyond-Line Recognition. This is the frontier of HTR, aiming to create end-to-end systems that can recognize entire paragraphs or full pages without needing a machine or a human to segment the text into lines first. Models like "Scan, Attend and Read" (Bluche et al., 2016) are pushing toward this goal, creating systems that can understand the 2D layout of a page in a much more holistic way.

The field, however, does not stand still. Most recently, research has begun to explore the potential of Large Language Models (LLMs) for HTR tasks. As demonstrated by Humphries et al. (2024), commercially available multimodal LLMs, originally designed for general language understanding and generation, are showing remarkable capabilities in transcribing historical handwritten documents. These models approach the task by leveraging their vast pre-existing knowledge and their ability to process and interpret both visual information (the document image) and textual information (the transcribed content) in a more integrated way. This emerging area suggests that the next wave of ATR tools might combine the specialized architectures developed for HTR with the broad contextual understanding and zero-shot or few-shot learning capabilities of LLMs, potentially offering new levels of accuracy and accessibility, especially for diverse and challenging historical manuscripts.

Learning Objectives

By the end of this optional part, you will be able to:

  • Differentiate between how humans and computers process the visual information of a document.
  • Explain how images are represented digitally using pixels and color channels (grayscale and RGB).
  • Describe the fundamental challenge a computer faces when trying to interpret a grid of pixel data.
  • Define early pattern-matching techniques and explain why they struggled with historical documents.

This chapter is an advanced, optional deep dive into the foundational concepts of how computers process images. Understanding these building blocks will provide a clearer picture of the challenges that early text recognition systems faced and clarify why the modern deep learning methods discussed elsewhere are so revolutionary.

Human vs. Machine: A Fundamental Difference in "Reading"

When humans read a script, the process feels automatic and intuitive. Our eyes follow a linear path—for Latin scripts, this is typically left-to-right—and we process letter combinations into words and sentences seamlessly.

A computer does not "read" in this way at all. It is a machine that executes processes based on binary code. To a computer, a scanned image of a historical document is not text; it is simply a vast collection of data. Before any recognition can happen, that image must be translated into a format the computer can process: a grid of pixels.

The World as a Grid: Pixels and Channels

Computers represent images digitally as a grid of tiny squares called pixels. Each pixel is assigned a numerical value that represents its specific brightness and color, with most standards using a range from 0 to 255. These values, arranged in a massive grid, create the final image. This can be done in two primary ways:

  • Grayscale: In the simplest model, each pixel has a single value representing its brightness, from black (0) to white (255). This creates an image with one "channel" of data.
  • RGB Color: To produce a color image, the computer uses three separate channels: one for Red, one for Green, and one for Blue (RGB). Each pixel has a value for each of the three channels. These three grids are then superimposed on top of each other to create the full-color image we see. 
An illustration of how a computer sees the letter "u" from a scan of a manuscript. On top, a single grayscale channel represents brightness. On the bottom, three separate Red, Green, and Blue (RGB) channels are combined to produce a color image.

From a Grid of Numbers to Meaningful Patterns

After digitizing a document, the computer is left with a massive grid of numbers. At this stage, it remains completely clueless as to what this information represents. The computer has no inherent understanding of reading direction, nor can it distinguish between a line of text, a decorative border, or a coffee stain.

The fundamental task of all text recognition, therefore, is to teach the computer how to recognize meaningful patterns within this pixel grid and then map those patterns to specific characters.

The Core Problem

A computer doesn't see a "page"; it sees a matrix (i.e., the grid of pixels mentioned before) of numerical values. The challenge of ATR is to develop algorithms that can answer questions like: "Where on this grid is the text located?", "How are these pixels grouped into lines, words, and characters?", "Does this specific cluster of pixels represent an 'a', a 'b', or a 'c'?"

Early Attempts at Pattern-Matching

Until a few years ago, engineers attempted to solve this problem with two main approaches, neither of which was very successful for complex historical documents.

  • Segmentation and Template Matching: The most intuitive approach was to try and replicate classic OCR methods. The algorithm would first attempt to segment the image, breaking it down into what it believed were individual characters, e.g., based on pixel densities (see figure below). Then, it would compare the pixel pattern of each segmented character to a pre-defined library of character templates, looking for the best match. While this worked for clean, printed text, the immense variation in handwriting made this approach unreliable.
    Different segmentation strategies based on minima and maxima of pixel
    densities (orange curve) of the handwritten Latin word literas (Eng. letters) and the blackletter print
    Monat (Eng. month). Illustration taken from Ströbel (2023).
  • Statistical Models (HMMs): Another approach used statistical models, most notably Hidden Markov Models (HMMs). An HMM attempts to find the most probable sequence of hidden states (the actual letters of the text) that would result in the sequence of observed states (the pixel features on the page). While more flexible than rigid template matching, HMMs also struggled to model the complexity of handwritten documents, leading to exceedingly high character error rates that were not practical for real-world use.
    An illustration of the HMM process for OCR (given correct segments). In blue: the observed layer (i.e., the pixel grids of the different segments). In orange: The hidden layer, i.e., the characters the model needs to infer based on the observed layer. The transition probabilities are the inherent language model of an HMM, stating how probable that an 'i' follows an 'l', etc.

    An illustration of the HMM process for OCR (given correct segments). In blue: the observed layer (i.e., the pixel grids of the different segments). In orange: The hidden layer, i.e., the characters the model needs to infer based on the observed layer. The transition probabilities are the inherent language model of an HMM, stating how probable that an 'i' follows an 'l', etc.

The limitations of these early methods made it clear that a more powerful approach was needed—one that could learn the patterns of handwriting directly from examples rather than relying on pre-defined templates or simpler statistical models. This need paved the way for the deep learning revolution.

Learning Objectives

By the end of this optional part, you will be able to:

  • Explain the fundamental role of deep learning and neural networks in modern ATR.
  • Describe the function and interplay of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs/LSTMs) in typical HTR models.
  • Define the "self-attention" mechanism in transformer models.
  • Compare and contrast the sequential processing of RNNs with the contextual processing of transformers.

Having explored how computers digitize documents and the limitations of early pattern-matching, this advanced chapter dives into the core technologies that power today's sophisticated ATR systems: neural networks and their evolution to transformer architectures. These learning-based systems have revolutionized the field.

The Deep Learning Revolution in Text Recognition

The foundations for newer Automatic Text Recognition algorithms consist of neural networks and deep learning. Deep learning is a branch of machine learning based on artificial neural networks with multiple layers, enabling the recognition of complex patterns and correlations in large amounts of data. Instead of relying on predefined rules or templates, these models learn to recognize text directly from examples.

Key Neural Network Components: The CRNN Era

For several years, the dominant architecture for ATR was the Convolutional Recurrent Neural Network (CRNN), which combines the strengths of two specialized types of neural networks, often linked in a cascade. This approach marked a significant step forward from earlier attempts.

  1. Convolutional Neural Networks (CNNs) – The "Eyes" of the System
    CNNs are designed to process grid-like data, such as images. In an HTR model, CNNs extract local visual features, like edges and textures, from the input images of text lines with the help of so-called kernels. You can think of kernels as different glasses you put on. Each pair of glasses lets you discover different features in the image. These features are crucial for distinguishing characters. Often, max-pooling layers are used after CNN layers to reduce the dimensions of the feature maps, which helps highlight the most salient features and reduce computational load.
    Inside a CNN – Kernels at Work
    Convolutional Neural Networks use specialized filters called kernels to scan an image and detect specific features. Here is a simplified look at how this fundamental operation, known as convolution, works:
    - The Kernel: A kernel is a small matrix of numbers, often 3x3 pixels. Each kernel is designed to detect a very specific low-level feature, like a vertical edge, a horizontal edge, or a particular texture.
    - The Convolution: The kernel systematically "slides" across the input image (which, as established in the previous chapter, is a grid of pixel values). At each position, the kernel overlays a small patch of the image.
    - Feature Calculation: An element-wise multiplication is performed between the kernel's values and the values of the image patch it is currently covering. All the results of these multiplications are then summed up to produce a single numerical value.
    - The Feature Map: This single value becomes one pixel in a new image called a feature map. As the kernel scans the entire input image, it generates a complete feature map that highlights where the specific feature detected by that kernel is most prominent.
    - Multiple Perspectives: A CNN employs many different kernels, each creating its own feature map. This allows the network to extract a diverse set of features from the input image. Deeper layers in the CNN then learn to combine these elemental features to recognize more complex patterns essential for character identification.

    To see interactive examples of how different kernels affect an image, you can explore this LINK.
  2. Recurrent Neural Networks (RNNs/LSTMs) – Understanding Sequence and Context
    While CNNs identify visual features, Recurrent Neural Networks are designed to handle sequential data. Specifically, Bidirectional Long Short-Term Memory Networks (BLSTM) layers process the sequence of features extracted by the CNNs. BLSTMs consider long-term dependencies and can capture the context of each character based on the characters that come both before and after it in the sequence. The BLSTM work with a vectorized version of the features extracted by the CNN (see figure below).
  3. The CRNN Architecture – Combining Strengths
    In a CRNN model, the features extracted by the CNNs and max-pooling layers serve as sequential input for the BLSTM layers. This interplay of CNNs (for feature extraction) and LSTMs (for sequence processing) enables the HTR system to accurately recognize handwritten texts by combining both local image features and broader sequential information to interpret characters and words correctly. Historically, Bidirectional Recurrent Neural Networks (BRNN) were first used, which were then often superseded by combinations involving CNNs. 
A typical Convolutional Recurrent Neural Network (CRNN) architecture (taken from Granell et al. (2018)). CNNs extract visual features from the text line, and LSTMs then process these features sequentially to determine the most likely character sequence.

The Transformer Era – A Paradigm Shift with Self-Attention

Nowadays, most state-of-the-art models are based on transformer architectures. While CRNNs marked a significant advance, transformers offer a distinct approach to understanding context within a sequence.

The key advantage of transformers is their use of self-attention mechanisms.

  • What is Self-Attention? Instead of processing a sequence step-by-step, like an RNN (see above), a self-attention mechanism allows the model to consider the specific context of a character (or sign) by simultaneously examining its relationship to all other characters in the entire sequence. For each character it processes, the model can weigh how much importance it should assign to every other character in the line, regardless of their distance.

Understanding Self-Attention
Imagine the model trying to decipher a tricky character. A self-attention mechanism allows it to ask: "To understand this specific character, which other characters in this entire line are most informative?" It might "pay more attention" to characters far away if they provide crucial context, something traditional RNNs struggle with over very long sequences. This allows for a more global understanding of the line, helping to calculate the next probable letter more accurately.

For a more detailed explanation, please follow this LINK.

This ability to weigh the relevance of all parts of the sequence makes transformers highly effective. Building on the original Transformer architecture (Vaswani et al., 2017), several key developments led to its successful application in image-based text recognition. A common "recipe" for creating powerful models like TrOCR (Li et al., 2021) involves the following "ingredients":

  1. The Base Transformer: This is the original sequence-to-sequence architecture designed for tasks like machine translation, excelling at understanding context through self-attention.
  2. The Vision Transformer (ViT): The first crucial adaptation was enabling transformers to "see." The ViT basically replaces the CNN in the traditional CRNN models for ATR. The ViT model achieves this by breaking an input image into a series of fixed-size, non-overlapping patches. These patches are then treated like words in a sentence—flattened into vectors and fed as a sequence into a Transformer encoder. This allows the model to learn relationships between different parts of the image.
    The Vision Transformer (ViT) architecture divides an input image into patches, which are then processed as a sequence by a Transformer encoder (image taken from Dosovitskiy et al. (2022)).
  3. BEiT (BERT Pre-training of Image Transformers): To make Vision Transformers even more powerful, a pre-training strategy called BEiT (Bao et al. (2021) was introduced. Inspired by BERT's masked language modeling for text, BEiT uses masked image modeling. During the so-called pre-training phase, some patches of an input image are randomly masked (hidden), and the model is trained to predict these missing patches based on the surrounding unmasked ones. This forces the model to learn rich and robust visual representations from vast amounts of unlabeled image data.
    BEiT pre-trains Vision Transformers using a "masked image modeling" approach, where the model learns to predict missing patches of an image, forcing it to learn general visual representations (image taken by Bao et al. (2021).
  4. The Combination (e.g., TrOCR): Finally, effective models like TrOCR combine these elements. They typically use a Vision Transformer (often pre-trained with a BEiT-like strategy) as an image encoder. This encoder processes the image of a text line and converts it into a contextualized representation. This representation is then fed to a standard Transformer text decoder, which generates the recognized text sequence character by character.

    The TrOCR model architecture combines a pre-trained Vision Transformer as an image encoder and a Transformer as a text decoder to achieve end-to-end text recognition (image taken from Li et al. (2021)).

This sophisticated combination of architectures and pre-training strategies allows modern ATR models to understand not just the pixels but also the deeper contextual and semantic relationships in the text, leading to significant gains in accuracy. Consequently, semantic relationships become relevant; for instance, a model trained primarily on scripture of a certain language will not perform as well on the same script if used for a different language, because the character predictions and contextual relationships operate differently across languages. However, experiments showed that only a little amount of data is necessary to adapt, e.g., TrOCR to different manuscripts and different languages (see, e.g. Ströbel et al. (2022)).

 
The journey from early neural networks to sophisticated CRNNs and now to powerful transformer models demonstrates the rapid innovation in the ATR field. These advanced architectures are what enable modern systems to tackle the immense variability of historical documents with ever-increasing accuracy.

Resources

Software

Datasets

Bibliography

Bao, H., Dong, L., Piao, S. and Wei, F., 2021. BEiT: BERT Pre-Training of Image Transformers. arXiv preprint arXiv:2106.08254. Available at: https://doi.org/10.48550/arXiv.2106.08254.

Bluche, T., Louradour, J. and Messina, R., 2016. Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention. arXiv preprint arXiv:1604.03286. Available at: https://doi.org/10.48550/arXiv.1604.03286.

Colutto, S., Kahle, P., Guenter, H. and Mühlberger, G., 2019. Transkribus. A Platform for Automated Text Recognition and Searching of Historical Documents. In: 15th International Conference on eScience, pp. 463-466. Available at: https://doi.org/10.1109/eScience.2019.00060.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. and Uszkoreit, J., 2020. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929. Available at: https://doi.org/10.48550/arXiv.2010.11929.

Garrido-Munoz, C., Rios-Vila, A. and Calvo-Zaragoza, J., 2025. Handwritten Text Recognition: A Survey. arXiv preprint arXiv:2502.08417. Available at: https://doi.org/10.48550/arXiv.2502.08417.

Granell, E., Chammas, E., Likforman-Sulem, L., Martínez-Hinarejos, C.D., Mokbel, C. and Cîrstea, B.I., 2018. Transcription of spanish historical handwritten documents with deep neural networks. Journal of Imaging, 4(1), p.15. Available at: https://doi.org/10.3390/jimaging4010015.

Graves, A., Fernández, S., Gomez, F. and Schmidhuber, J., 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369-376. Available at: https://doi.org/10.1145/1143844.114389.

Hodel, T.M., Schoch, D.S., Schneider, C. and Purcell, J., 2021. General Models for Handwritten Text Recognition: Feasibility and State-of-the Art. German Kurrent as an Example. In: Journal of Open Humanities Data, 7(13), pp.1-10. Available at: https://doi.org/10.5334/johd.46.

Humphries, M., Leddy, L.C., Downton, Q., Legace, M., McConnell, J., Murray, I. and Spence, E., 2024. Unlocking the Archives: Large Language Models Achieve State-of-the-Art Performance on the Transcription of Handwritten Historical Documents. Available at: https://dx.doi.org/10.2139/ssrn.5006071.

Kang, L., Riba, P., Rusiñol, M., Fornés, A. and Villegas, M., 2022. Pay Attention to What You Read: Non-Recurrent Handwritten Text-Line Recognition. In: Pattern Recognition, 129, p.108766. Available at: https://doi.org/10.1016/j.patcog.2022.108766

Li, M., Lv, T., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z. and Wei, F., 2021. TrOCR: Transformer-Based Optical Character Recognition with Pre-Trained Models. arXiv preprint arXiv:2109.10282. Available at: https://doi.org/10.48550/arXiv.2109.10282.

Marti, U.V. and Bunke, H., 2002. The IAM-Database: An English Sentence Database for Offline Handwriting Recognition. In: International Journal on Document Analysis and Recognition, 5, pp.39-46. Available at: https://doi.org/10.1007/s100320200071.

Puigcerver, J., 2017. Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition?. In: 2017 14th IAPR International Conference on Document Analysis and Recognition, pp. 67-72. Available at: https://doi.org/10.1109/ICDAR.2017.20.

Shi, B., Bai, X. and Yao, C., 2016. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and its Application to Scene Text Recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), pp. 2298-2304. Available at: https://doi.org/10.1109/TPAMI.2016.2646371.

Ströbel, P.B., Clematide, S., Volk, M. and Hodel, T., 2022. Transformer-Based HTR for Historical Documents. arXiv preprint arXiv:2203.11008. Available at: https://doi.org/10.48550/arXiv.2203.11008.

Ströbel, P.B. (2023) Flexible Techniques for Automatic Text Recognition of Historical Documents. Dissertation, University of Zurich. Available at: https://doi.org/10.5167/uzh-234886.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention Is All You Need. In: Advances in Neural Information Processing Systems, 30. Available at: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Weidemann M., Michael J., Grüning T., and Labahn R. 2018. HTR
Enginge Based on NNs p2 Building Deep Architectures with Tensorflow. Technical
report. Available at: https://read.transkribus.eu/wp-content/uploads/2018/12/
D7.9_HTR_NN_final.pdf.