Ad fontes: Tutoriel / Automated Text Recognition / 2: When is the Use of ATR Helpful?

Use of ATR: Overview

What You Will Learn

By the end of this introduction and the subsequent pages, you will confidently be able to:

Define the four key dimensions when working with ATR
Explain how these four key dimensions influence approaches to working with ATR, especially considering time efficiency

This chapter outlines the necessary requirements to successfully apply Automated Text Recognition (ATR) to different forms of text corpora within historical research. We establish four dimension that should be taken into consideration even before starting with text recognition. These dimensions are:

Heterogeneity of Hands
Amount of Text
Research Question
Method

After these 4 dimensions have been explained, we add some other aspects that should be considered whilst working with ATR in the last section.

One key point we would like to convey to you is time efficiency. It is technically possible to recognize a very large corpus with thousands or millions of pages, manually correct these pages perfectly and to then only look at specific sections. However, in a practical sense and especially considering shorter projects such as seminar papers or even Masters’ or PhD projects, we very strongly recommend approaching your corpus differently. The four dimensions mentioned above and explained below are our way of approaching this complex topic, giving you some questions to ask yourself before starting and some tips on how to go about your project.

Four Key Dimensions Working with ATR

Visual represenation of the four key dimensions when working with ATR.
Image: Lorenz Dändliker, 2024.

After considering various research scenarios in which ATR was applied to different text corpora, the following four criteria proved to be the most vital:

heterogeneity of hands
amount of text
research question
method

The following passages will outline each criterion in more detail and highlight their importance when using ATR. This passage will also place the different criteria in relation to each other. Please note that overlaps may occur in theory as well as in practice.

On the left you see a visual representation of these four dimensions on a scale from low to high (heterogeneity of hands), small to large (amount of text), narrow to broad (research question) and close to distant (method). We use red dots to visualize the approximate value on each scale when evaluating specific projects and their corpora (you can find examples of this in the following chapter).

1: Heterogeneity of Hands

As the name implies, 'heterogeneity of hands' refers to the number of different ‘hands‘ (which means the number of people who wrote the text) and thus writing styles found in a corpus. The more regular the writing styles in a corpus are, the easier it is for humans, as well as for ATR models, to correctly recognize the recurring characters and transliterate them into computer code, represented in different forms of modern scripts. Conversely, the more variety of hands there is in a text corpus, the more demanding it is to read, for humans and computers alike. The difference between a printed (regular) and handwritten (more variety) script is therefore a key distinction, with ATR-models for printed scripts reaching very good results with very little effort.

When using ATR, we must consider: The more variety of hands there is in the corpus, the larger (in the sense of training with more material) an ATR model needs to be to achieve suitable results. It is crucial to bear in mind that the difficulty in finding suitable ATR models depends not only on the heterogeneity of the individual characters in the corpus, but also on the type of characters. This is due to the significant differences in how the same individual characters are written in handwriting, even if they are all produced by the same hand. It is precisely because of these variations that a lot of training material is required for an ATR model to learn how to read handwriting as accurately as possible.

How to Select, Refine and Train ATR-Models
In the fifth chapter of this module, you will learn more about ATR-models and how to choose them. If you have a high heterogeneity of hands in your corpus, because, for example, your documents are highly varied in origin, age or language, it is best to start with a bigger model or to break down your corpus into smaller more homogeneous parts, where you can choose fitting ATR-models for each type. Corpora can also have a low heterogeneity of hands – either because all your documents are in a similar script, written by the same hand or printed, or because you have a small corpus overall. Those corpora can benefit from models that are adjusted to them, incorporating training material from these documents.

Please note that an ATR model is rarely trained from scratch. In most cases, an existing model is selected and improved. There are often in-built tools such as application tests which can be used to test models on your specific script. If the model works somewhat well, it can be optimized through training, so called fine-tuning, with added training material from your documents. In this case too, 'print' models are usually less complex.

2: Amount of Text

The exact definition of 'small' or 'large' amounts of text differs significantly from one project to the other.
Important aspects to consider are:

Duration of the project (time available for exploring sources)
Number of people involved
Level of education of the researchers (e.g., is it a seminar paper or a dissertation?)
Competence in transcribing and understanding sources (content and language)

Generally, the more pages, the more time-consuming the transcription is, and the more likely it is that ATR can be helpful for an initial approach. However, 'amount of text' and 'heterogeneity of hands' are always closely linked. This is because a text volume of, say, a 100 pages in printed writing exhibits massively less heterogeneity than 100 handwritten pages - even if only one author was at work. Researchers, therefore, need far less training material to train a suitable print-scripture model.

In other words, a printed manuscript is easier for researchers and the ATR algorithm to learn to read compared to a handwritten manuscript. However, the researcher‘s ability level of reading and comprehending historical scripts and languages is also a key factor. This means that, depending on your project goal, the semantic understanding of the corpus is just as important as being able to decipher mere words.

Considering time efficiency, it is important to think about what you want from your corpora before doing text recognition on all your sources. This is especially true if you work with a large amount of text. The next steps to take will be explained in the next two sections.

3: Research Question

This criterion refers to the research questions in a project, which guide a project’s focus or interest.

First and foremost: a research question always includes broader and narrower aspects. This passage aims to raise awareness on how a research question posed to a text corpus will influence the different scenarios (as in procedures) when transcribing with ATR.

For example, if the text corpus is examined under a 'broad' focus in a research question, the focus will mainly be the content of the source as a whole. This means that at this stage the researcher is generally interested in what a source is discussing, what topics are being addressed, etc. On the other hand, if the main interest of the research question(s) is 'narrow', the goal is then to find a specific phenomenon, concept and/or term in the text corpus.

In practice, it is very time consuming or maybe even impossible to work with a very large amount of text on a microscopic level, meaning with very specific and narrow research questions. This is where we can use ATR to recognize large amounts of texts, then use this somewhat imperfect automated transcription to find specific extracts we want to focus on. These parts could then be manually improved and studied in detail.

What specific aspect of a research question a user chooses to focus on, is directly linked to how they plan to read the sources. Therefore, a distinction is made between close and distant reading. Moreover, the exact meaning of these two concepts will be addressed in the following chapter 'method' which will highlight the ways in which the dimensions of 'research question' and 'method' are interlinked.

4: Method

In the context of the following discussion, 'method' is to be understood as 'way of reading' – close or distant.

Close or distant readings often alternate and converge, depending on the project phase. If a text corpus is compiled in relation to a research question and transcribed with ATR, the subsequent steps are usually aimed at evaluating the content of the text as a whole (close reading), identifying individual text passages in a large volume of text and then evaluating these (first distant and then close reading, e.g., named entity recognition), or evaluating the text using quantitative methods (e.g. topic modeling, distant reading).

Please also note the practical remark on this matter in the last section.

The Four Dimensions of ATR Brought Together

Figure 1: Visual represenation of the four key dimensions when working with ATR.
Image: Lorenz Dändliker, 2024.

Figure 2: Visual represenation of the four key dimensions when working with ATR, each with added red dots representing an approximated value on the four scales.
Image: Lorenz Dändliker, 2024.

These four discussed criteria that come into play when using ATR only gain explanatory power when they are considered together. In other words, they help realistically assess the benefits and efforts involved in applying ATR to one's own specifically designed research corpora.

Consequently, all dimensions were joined together and displayed in this diagram. The purpose of the diagram is twofold: Firstly, it is intended to help illustrate different scenarios that can occur when using ATR on different text corpora. This should enable greater competence in automatic transcription tools – or more specifically, greater competence in deciding when and how ATR can be best used. Secondly, the diagram should support the users of our module to evaluate the use of ATR for their own text corpora and projects.

Before using ATR, consider the nature of your corpus (amount of text, heterogeneity of hands) and how this relates to the selected approach (research question, method). Figure 2 illustrates what this could look like. The example shows a low heterogeneity of hands and a small amount of text as the nature of the corpus, combined with a broad research question and close reading as the method.

More detailed application examples can be found in the next chapter, where we explain four use-scenarios with different types of text-corpora. This can also help you to think about your own project and where you would place it on these scales.

The subsequent chapter shows all possible combinations of corpus type and research method and some tips on how to approach them respectively.

Further Considerations on the Use of ATR

As stated in the introduction to this chapter, time efficiency is a very important factor when working with ATR. Just because something is technically possible does not mean that it is also a good (meaning time-efficient) solution that balances cost and benefit. A 'good' transcription depends on the research requirements: if only individual sections (e.g., the mentioning of certain people or concepts) in a larger corpus are of interest, a transcription with a character error rate (CER) of up to 10% may be sufficient to identify those text passages. Being able to quickly filter out relevant text passages from imperfectly transcribed text volumes promises a relevant expansion of the source base and could even be used for smaller research projects. The comprehensibility of the transcribed passages, moreover depends on the researcher‘s ability to read and analyze handwritten texts - especially whilst understanding the meaning, checking for accuracy, identifying relevant passages and indexing shorter texts for close readings.

Introduction to reading historical writings

Transcription Exercises in German, English, Latin, Romance and Scandinavian Languages