5: ATR Models

What You Will Learn

By the end of this introduction and the subsequent pages, you will confidently be able to:

  • Understand the difference between small and large models and how this influences the quality of an ATR-transcription
  • Define the Character Error Rate (CER) metric and use it to decide if a model works for your corpus
  • Understand how ATR-modules are trained from scratch or re-trained to have an existing model fit better with your own corpus
  • Decide when to use off-the-shelf models versus custom-trained ones for your specific collection

This chapter looks closer at ATR-models. The aim is to show you how to select and/or train your models and how you can decide if a model works well for your documents, taking your specific project goals into account. 

The ATR models are what enables the ATR-platform or tool to transcribe documents into machine readable text. They basically translate pixel values in an image into characters. As handwritten scripts vary a lot between different times, geographical regions, and languages, but also between different levels of expertise and different purposes, it is important to select models that are trained with data that is as close as possible to your own documents such as historical source material.

During the selection process, we therefore look for models that…
… were trained with textual data from a similar period.
… originate from the same language area.
… if specified, are based on a similar script type (sometimes models are also accompanied by an example image of the script used for training).
… if available, are based on data from the same scribe/scribal school from which the sources to be examined originate.
…report a low Character Error Rate (CER, more on that below).

Here you can find a site with digital transcription tools that could be useful for your project.

In general, it is advisable to look for and, if found, prioritize existing models that are based on as much writing data as possible from the period of interest. The most reliable way to check whether a big model fits a specific text corpus is to apply it to samples and see to what extent the text passages are recognized correctly. Some providers even supply tools for picking random sample lines to check for accuracy in recognition.

Some big models have been trained with handwritten material from several centuries and large geographical areas. It is worth testing these, although it is not always clearly stated which scripts and languages – or which sources – they were trained on. They are therefore not always guaranteed to be a good match for your sources but can be a good starting point. They are often called "Giant", "Titan", or similar. If a bigger model does not provide good enough results, smaller, more specific models can be tested and compared. Often, a well-fitting small model works more precisely than a larger model. Larger models, however, have the advantage to cover more scripts or at least more hands and are therefore a good start for a lot of corpora.

It is important to note that, no matter how well a model fits a script, the ability to read manuscripts without the help of a transcription program remains essential for historians. The skill of reading is irreplaceable and necessary to evaluate the accuracy of existing models, create high-quality transcriptions for solid ground truth (the basis of every successful ATR model), or manually correct the results of ATR transcription after recognition.

When testing a model, pay attention to the Character Error Rate (CER) metric. The CER measures the percentage of incorrectly recognized characters in relation to the total number of recognized characters in your text corpus. A low CER indicates that the model recognizes the handwriting accurately. For a more precise analysis, such as to determine where in the ATR model the errors occur, use tools such as CERberus.

A further metric is the Word Error Rate (WER). It measures the percentage of incorrectly recognized words in relation to the total number of recognized words. A low WER indicates that the model correctly transcribes handwriting into typed words. However, the problem with WER is that there is no exact, fixed definition of a word and the WER does not show how wrong a word is. A word could still be readable if only one character in it is recognized wrong. The WER in this case would be the same as with a word that is completely illegible. Therefore, it is more accurate to evaluate the success of a model with CER.

These metrics help you decide if a model works well on your documents. Keep in mind that depending on your reading skills, you might be able to estimate yourself, how well a model works.

Many current ATR models work best with documents in the same language as they were trained on, since they work with the feature 'predicted next character', which estimates the probability of the next character in a sequence. This is possible with built-in language models inside an ATR model. It is therefore most promising if the model's training data and the sources to be transcribed are in the same language. There are also multilingual ATR models that have been trained with large amounts of text data in multiple languages that work quite well. However, new problems arise with these models, such as overcorrection and/or worsening of results, when the language models do not work well with the language in your corpus. This problem arises most with pre-modern languages that vary widely in spelling.

Generally, ATR models struggle with names of people and places, as those vary the most. The more standardized the spelling in your documents and in the model used are, the fewer errors will be made during text recognition. The same is true for the opposite, which means that a lot of pre-modern (especially non-Latin) documents will have higher CERs.

The layout of your project will decide what constitutes a good (enough) model. Even 'imperfect' transcriptions often allow for reliable identification of relevant text passages in large text corpora, to a certain extent. Therefore, if your goal is to find specific concepts in a big corpus, the CER can be significantly lower compared to a project that only examines a few documents, were you want to have an accurate transcription in terms of character and content. To understand this better, remember the four dimensions with the methodic aspects of a broad versus narrow research question and of distant versus close reading. More often than not, finding an available transcription in which 90% of the text has been correctly recognized already constitutes a great find.

Different tools have been developed to solve the problem of not entirely error-free text recognition. Compared to a conventional and literal 'text search', such search aids are based on calculated probabilities, which show the likeliness that a term in the document corresponds to the term we have searched for. A known example of such a search aid would be "fuzzy search". A fuzzy search is based on the so called Levenshtein-Distance. It counts how many characters must be replaced, removed and added from one word to another. If you search e.g. the term "big" with a Levenshtein-Distance of 1, the results will also show the term "bag". If you then look in the original document, you can check whether it is actually "bag" or "big". The number 1 indicates that one character needed to be changed to go from "big" to "bag". In German, a similar example would be with "Kind" and "Rind", whereas the change between the English "pear" and "peel" would be a Levenshtein-Distance of 2.

To summarize, when deciding if a model works for your documents, consider the following key factors:

  • Character Error Rate (CER)
  • your preexisting reading and comprehension skills
  • the quantity of documents to be transcribed
  • your goals for the transcription

Keep time efficiency in mind and ask yourself: what is good enough for your project?

Since training your own model is not always an option, it is best to test different preexisting models first, to determine which one (if at all) best suits the documents, scripts and the language being examined. Thus, test your ATR model for accuracy, speed and adaptability to your project‘s requirements.

To train an ATR model, first take images of the pages in your corpus that you want to transcribe and a sample of an already transcribed text from the same corpus to obtain a so-called ground truth. The ground truth creates the data basis to train an ATR model through a neural network (NN). During the training, the model learns to recognize and interpret patterns and features within the digitized handwriting to match the found pixels with the given transcription.

When working with neural networks, we do not teach the 'machine' how to process data, but we specify the start and the result, while the 'machine' trains itself how to get from the start to the correct result.

Generally, a maximum of 80% of the text corpus to be transcribed manually is used to train the model as ground truth, while the remaining 20% is used to validate and evaluate the trained model. This data should cover all used fonts, styles and handwriting inside your corpus to improve the robustness of your model.

Through the validation process, the neural networks work to compare the predictions with the ground truth data and to minimize errors. The more training lines are provided, the better a model becomes, generally. This process often requires several iterations to improve the model's performance – meaning new material is added and tested, errors are corrected, and then the model is trained again.

After training, the model is validated and tested to ensure that it has high accuracy on new, unseen data. Examples of such models can be found on the Transkribus platform, which can be used to automatically transcribe handwriting after being trained on a variety of digitized manuscripts. These models can also serve as a foundation for one’s own model by fine-tuning it. This is usually more time-efficient than building your own model from the ground up.

For more detailed instructions on how to train a model within a specific ATR environment, please refer to the instructions of the respective providers, which are linked here.

So far, we have only considered models that transform pixels into characters of a computer script, the so-called text recognition models. 

One level deeper and as a kind of preliminary work to this text recognition, different models work on layout recognition. Put simply, this feature recognizes text regions and the so-called 'base line' of each line, which then serve as reference points for the subsequent text recognition.

These layout recognition models are already integrated into some platforms, while others require the user to carry out this step themselves. This means, on some platform users can simply carry out a text recognition request and in that process, first text regions and base lines will be added, before the actual text recognition is carried out. On other platforms, the user needs to carry out a layout recognition first. This layout recognition works through layout recognition models (similar to the text recognition models). After using the layout recognition model, the user can start the usual process of text recognition.

Expert Client, Transkribus:
As an example, the so-called "Expert Client", the desktop version of Transkribus, used to function with this two-step process, where the user needed to start a layout recognition before being able to use the text recognition models. The Expert Client has been abandoned by circa 2024 and the people behind Transkribus now only promote the online version "App Transkribus". As this online version currently (summer 2025) has layout recognition built in, this two-step process is no longer needed. However, the platforms change quickly, which is why we recommend checking the platform specific instruction manuals and their Frequently Asked Questions (FAQ) sections.

For many documents, the usually provided layout recognition models work quite well. Depending on the tool, you can correct individual poorly recognized text fields or base lines by hand (the how to depends on the specific platform, please look for instructions there). If your documents have unusual formats, different text orientations, and especially if these two challenges occurs repeatedly in many documents in your corpora, it may be worth training your own layout recognition models. For the appropriate procedure, it is best to look for instructions from the relevant platforms.