3: Use-Scenarios of ATR with Different Types of Text-Corpora

What You Will Learn

By the end of this introduction and the subsequent pages, you will confidently be able to:

  • Understand how different project corpora can be represented on scales of four key dimensions of working with ATR
  • Explain why and how these values on the scales influence possible approaches

This chapter shows four examples with different types of corpora.

In the last chapter, we have established four dimensions of working with ATR. Here, we use these to show how different types of corpora can be tackled when using Automated Text Recognition (ATR), depending on the heterogeneity of hands in the corpus, the amount of text, the width of your research question, and if you want to use close reading or distant reading as your method.

In each section, we present one example with the following information:

  • On the right, you will see a description of a corpus and a related project with some information on what we (hypothetically) would like to do with said corpus after text recognition.
  • On your left, you will see a visual representation of the four dimensions established in the previous chapter with red dots representing the value on the scale given to this specific example.
  • Underneeth the visual representation, you find a short summary of the project description.

In this chapter, we sometimes talk about ATR-models you can use for text recognition. You find more information about them in the last chapter. Here, it is only important to understand that models are either trained with a very narrow focus on a specific script or even a specific hand, or they can be trained with a larger set of training data consisting of various scripts, script types or even languages.

Small models with a narrow focus offer a better recognition if your script matches the narrow training data well but are often weak with scripts that differ even slightly. Bigger models aim at broader corpora, therefore matching more scripts. With bigger models, we can therefore recognize more scripts, however, these bigger models have worse results when it comes to the individual hand. The lower the heterogeneity is in your corpus, the better a specific small model will work, if it is selected or trained well. The higher the heterogeneity is, the bigger your model needs to be to cover all scripts in your corpus.

Whereas in this chapter, we focus on four examples, in the next chapter, you will find a tool where you can see all possible combinations of the previously established four important dimensions of working with ATR.

Visual represenation of the four key dimensions when working with ATR, each with added red dots representing an approximated value on the four scales.
Image: Lorenz Dändliker, 2024.

Amount of text: 20 books of 800 pages each
Heterogeneity of hands: 50 different writers, maybe more
Research question: broad (e.g. what are the key themes that recur in the text corpus over the entire period?)
Method: distant reading (e.g. topic modeling algorithm that identifies which group of words occur together; the identified group of words (bag-of-words) must then be named by the researcher with the overarching theme)

In this example we have a corpus which consists of 20 books with over 800 pages each, written by around 50 different scribes in total. This means that we have a high heterogeneity of hands and large amounts of text. For the hypothetical project, we have a broad research question and we want to read the text distantly, using topic modelling as a next step. This example will not talk about topic modelling itself, but about preparing a large corpus for use in a next tool. With large corpora, it is often not needed to have a perfect transcription.

As the amount of text is large, ATR is very useful as a first step. And because the heterogeneity of hands is high, we need a model that has a larger variety in material it is trained with, to be able to recognize the varied versions of characters used.

To find a text recognition model that will work with your corpus, pay attention to the following:

  • Does a model exist that has been trained with text data from a temporally and linguistically close context?
  • Is there one based on text data from the same archive or at least an archive close to the region of the text corpus in question?
  • Do you know the name / scipt type of the scripts used in your corpus? If so, you can also search for the script type on which a model was trained.

Please note: Although some models may produce better results when trained with more material, this might not always be the case. The efficiency and quality of a model can ultimately only be determined by manually checking the results for the number of correctly transcribed passages.

In this project, the goal now is to use topic modeling as a next step after text recognition. Topic modeling filters out various topics as word groups from the corpus. How low the Character Error Rate (CER) must be for topic modeling to produce meaningful results for a text is a research desideratum, but we estimate approximately ~5% (for more on the CER see the last chapter).

Precisely because the evaluation of imperfect transcriptions with topic modeling is a desideratum, it can still make sense to try it. Moreover, the description of imperfect results can generate valuable insights – at the methodological level, as well as at the content level. At the content level, for example, it could be tested whether incorrectly transcribed terms, after they have been identified and corrected in the original text, still fit into the bag of words into which they were sorted (or not).

If this experimental approach is not intended to be part of a project, it makes sense to reduce the text to as low a CER as possible. Therefore, it can be helpful to train (in the sense of fine-tuning) an existing ATR model with writing data from the researcher’s own corpus to optimize the results further (for fine-tuning a model also see the last chapter). It is advisable to include as many of the 50 different scripts as possible in the training sample, to make sure all of them will be recognized well.

Visual represenation of the four key dimensions when working with ATR, each with added red dots representing an approximated value on the four scales.
Image: Lorenz Dändliker, 2024.

Amount of text
: small page with 20 lines
Heterogeneity of hands: low (one writer)
Research question: narrow (e.g., for a narrow research question: what function did concept X fulfill in Document Y in period Z? The research question for this individual source can also be part of a broader research question that is pursued with the help of a source corpus and/or literature that goes beyond this source, for example, an edited source corpus and/or specialist literature. The larger question, to which the narrower question about the individual source is connected, could be: how did concept X function discourse Y in period Z?)
Method: close reading (reading the source as a whole)

The present example depicts a text corpus that is comparatively very manageable, it consists of only one charter with about 20 lines. With a corpus this small, it would be perfectly acceptable to transcribe it manually. However, depending on the researcher‘s reading skills, it may make more sense to take the first steps with the help of ATR, before editing the machine-generated transcription.

This example clearly benefits from using an existing ATR model instead of training your own (for ATR models and how to select or train them, see last chapter). Since only one page is to be transcribed, the training material is not extensive enough and training would be too time consuming. Hence, use an existing ATR model and remember to look for already trained models that use data from the same period, the same language region and (if possible) same scribal school and/or even the same archive as the source being analyzed. A further possibility is to investigate whether the names of the scripts used to train the models are mentioned in the descriptions of the individual models.

Moreover, the question of whether a large or small model is more useful cannot be answered conclusively. Whether to use one or the other always depends on the circumstances. Larger models cover a more comprehensive range of scripts and scribal hands, while smaller ones are rather based on only one scribal hand. But if even one font matches the font of the sources researched closely, the smaller model may still produce good results. However, for a single late medieval charter, it might not be possible to learn the scribal school or the name of the scribe – especially if you are not trained in this field. If so, a larger model might be more useful because chances are higher that it encompasses similar scripts to yours.

As you see, trying out various models and evaluating the results based on the degree of correctly recognized passages is a valid process to obtain the best results.

Visual represenation of the four key dimensions when working with ATR, each with added red dots representing an approximated value on the four scales.
Image: Lorenz Dändliker, 2024.

Amount of text
: 20 handwritten letters, 40 folio
Heterogeneity of hands: 10 from person X and 10 from respondents
Research question: Broad (e.g., what were the main subjects that person X communicated about in his letters in the period Y?)
Method: close reading (someone is interested in understanding the source as a whole)

This corpus comprises a total of 20 letters, 10 of which originate from the hand of person X and 10 of which are from responders. In total, the letters amount to 40 densely written pages with a high heterogeneity of different scripts in relation to the size of the corpus. In this hypothetical project, we have a broad approach to our resarch question but want to read the text close, reading through the whole text.

Here, it is valuable to start with ATR, using a model that fits the following requirements:

  • Is there a model based on data from a similar scribal school? Was there already a model trained with the scripts in your corpus (or some of them)?
  • If not: is there a model trained with data from a similar period and geographical area? Do you know the name of the script(s) in your corpus and can you find a model based on this script?

As we have here a corpus with several different hands (as in there were several people who wrote the documents), you might want to start with larger models and check, if the recognition is already good enough for your purpose. With 11 different writers, this corpus shows a high degree of writing heterogeneity, and larger models might cover this variance better. If the results aren’t satisfactory, try the smaller models. Since the corpus is not too large, it should be possible to correct imperfect transcriptions manually in a reasonably economical way. 

Generally, a close reading that aims to examine the text corpus with a broad range of questions should be well possible, as the amount of text is rather small. However, in the end this must be assessed by the user and depends on their reading and comprehension skills when reviewing the accuracy of the transcribed passages.

Introduction to reading historical scripts

Visual represenation of the four key dimensions when working with ATR, each with added red dots representing an approximated value on the four scales.
Image: Lorenz Dändliker, 2024.

Amount of text
: 31 daily newspapers
Heterogeneity of hands: 2 (different typefaces (e.g., Latin and Black Letter)
Research question: narrow (which horoscopes were written for the star sign Libra in the month of October in the year 1935?)
Method: close reading

Unlike the other types of text corpora, this example consists of printed material. The fact that the writing is printed is significant insofar that there is less variability in the typeface than in the case of handwritten material. Compared to handwriting, it is also much easier to identify the names of typefaces, which can usually be found in the descriptions of the models. In this particular daily newspaper example, two different fonts were used (e.g., Latin and Black letter). Here, the approach would not change much between having a broad or narrow research question or wanting to read distant or close. With printed material and a well-suited model, the automated recognition will be close to perfect.

Therefore, the best-case scenario would be an already existing ATR-model that is trained in at least these two fonts. Normally, there are already well-functioning ATR models available for printed documents. If possible, choose a model from the same language area, since many ATR-models now have incorporated language models that are based on the prediction of proceeding characters. These proceeding characters obviously change between languages but also between stages of language development over time.

If there is no existing model for the specific printed sources, a suitable model can be created as you would with handwritten scripts. However, much less training material is needed because of the much lower variety in characters. If there is only a good working model for one of the fonts, it is possible to finetune the model with data from the other font. This way it should eventually fit for both.

It is also possible to try to access the text using Large Language Models (LLMs, such as ChatGPT 4.0, etc.) – however, this surpasses the scope of the present module.