Translating Sumerian Texts

1. Introduction

The first vignette (โ€œGetting Started with sumerโ€) introduced the basic functions of the package: sign conversion, dictionary lookup, and translation templates for individual lines. This vignette describes the complete workflow for working with entire texts.

The workflow consists of the following steps:

  1. Load a transliterated or cuneiform text
  2. Analyze the text using n-gram analysis and grammar probabilities
  3. Translate line by line interactively
  4. Generate a custom dictionary from the translations

The generated dictionary can be used as an additional source for future translations. This creates a cycle: each new translation improves the dictionary, and the improved dictionary facilitates the next translation.

library(sumer)

2. Loading a Text

The package includes the example text โ€œEnki and the World Orderโ€, a Sumerian myth. The text is stored as a text file:

path <- system.file("extdata", "enki_and_the_world_order.txt", package = "sumer")
text <- readLines(path, encoding = "UTF-8")

The first few lines look like this:

cat(text[1:5], sep = "\n")
#> 
#> 1)   ๐’‚—๐’ˆค๐’ฒ๐’€ญ๐’† ๐’‰ช๐’……๐’…Ž๐’‹ผ๐’ˆพ
#> 2)   ๐’€€๐’€€๐’€ญ๐’‚—๐’† ๐’„ž๐’ฎ๐’€€๐’Š‘๐’€€๐’„ ๐’ƒฒ๐’‚Š๐’Œ…๐’•
#> 3)   ๐’Šฉ๐’…—๐’‚ต๐’†ณ๐’ƒฒ๐’€ญ๐’‚—๐’†ค๐’‡ท๐’† ๐’‰˜๐’€ญ๐’†ฌ๐’‚ต
#> 4)   ๐’ˆ—๐’„‘๐’ˆฉ๐’ช๐’€Š๐’€€๐’†•๐’€€๐’†ณ๐’†ณ๐’‹ซ๐’…๐’†ท

Each line can optionally begin with a line number (e.g., 1)\t...). Lines starting with # are treated as comments and ignored during analysis. The text can be in cuneiform or transliteration โ€“ in the latter case, it can be automatically converted.

3. N-gram Analysis

Finding frequent sign combinations

A good first step when working with a new text is to search for frequently recurring sign combinations (n-grams). Such patterns are valuable clues: if a certain sequence of cuneiform signs appears repeatedly, it is likely a fixed term, a compound word, or an idiomatic expression.

freq <- ngram_frequencies(text, min_freq = c(6, 4, 2))
head(freq, 10)
#>    frequency length          combination
#> 1          2     20 ๐’€ญ๐’‚—๐’† ๐’ˆ—๐’ช๐’€Š๐’†ค๐’…Ž๐’ƒฒ๐’ˆพ๐’†ณ๐’†ช๐’ฒ๐’ฃ๐’‰ˆ๐’Œ‹๐’Œ‹๐’Œ‹๐’ˆพ๐’‚Š
#> 2          2     16     ๐’€ญ๐’‚—๐’†ค๐’ˆ—๐’†ณ๐’†ณ๐’Š๐’Š๐’‚—๐’†ค๐’† ๐’‚ ๐’ƒถ๐’ˆพ๐’€Š๐’บ
#> 3          2     15      ๐’€Š๐’ˆฌ๐’Œง๐’ƒป๐’€Š๐’†ฌ๐’‚ต๐’€€๐’€ญ๐’Šฎ๐’‰๐’ƒด๐’†“๐’€€๐’€ญ
#> 4          2     15      ๐’€€๐’ˆพ๐’€€๐’Š๐’€ญ๐’‡ฒ๐’€€๐’ˆพ๐’€€๐’Š๐’€Š๐’ˆญ๐’‚Š๐’‰ˆ๐’‚—
#> 5          3     14       ๐’‚๐’†ณ๐’Š‘๐’‚๐’€ญ๐’‚—๐’†ค๐’‡ฒ๐’†ค๐’ƒป๐’……๐’†ท๐’‰†๐’‹›
#> 6          2     14       ๐’‰ฃ๐’† ๐’† ๐’†ฌ๐’† ๐’†—๐’†—๐’†ท๐’€ธ๐’ˆจ๐’ˆค๐’‹—๐’‹พ๐’€€
#> 7          2     14       ๐’ˆฅ๐’Œ…๐’ˆง๐’€ฒ๐’Š•๐’‚Š๐’Œ‹๐’Œ‹๐’Œ‹๐’ˆฌ๐’‰Œ๐’‰บ๐’„ธ๐’บ
#> 8          2     12         ๐’†ค๐’ˆฌ๐’Œง๐’•๐’„พ๐’‚—๐’†ค๐’† ๐’…—๐’‰Œ๐’€€๐’€ญ
#> 9         11     10           ๐’€ญ๐’‚—๐’† ๐’†ค๐’ ๐’€๐’‰†๐’ˆช๐’…”๐’บ
#> 10         2     10           ๐’†ฌ๐’€ญ๐’ˆน๐’ˆจ๐’‚—๐’ˆฅ๐’๐’ˆฌ๐’ˆจ๐’€€

The min_freq parameter controls the minimum frequency for different n-gram lengths. The default value c(6, 4, 2) means: single signs must occur at least 6 times, pairs at least 4 times, and all longer combinations at least 2 times. Depending on the length of the text, these thresholds can be adjusted.

The result is a data frame with three columns: frequency, length (number of signs), and combination (cuneiform characters).

The analysis works from the longest combinations down to the shortest. When a long combination is identified as frequent, its occurrences are masked so that shorter sub-combinations are not falsely counted as frequent just because they are part of the longer combination.

Marking n-grams in the text

With mark_ngrams(), the identified patterns are marked in the text with curly braces:

text_marked <- mark_ngrams(text, freq)
cat(text_marked[1:5], sep = "\n")
#> 
#> 1    ๐’‚—๐’ˆค๐’ฒ {๐’€ญ๐’† } ๐’‰ช๐’……๐’…Ž๐’‹ผ๐’ˆพ
#> 2     { {๐’€€ {๐’€€๐’€ญ} ๐’‚—} ๐’† } ๐’„ž๐’ฎ๐’€€๐’Š‘๐’€€๐’„  {๐’ƒฒ๐’‚Š๐’Œ…๐’•} 
#> 3     {๐’Šฉ {๐’…—๐’‚ต} }  {๐’†ณ๐’ƒฒ { { {๐’€ญ๐’‚—} ๐’†ค} ๐’‡ท} }  {๐’† ๐’‰˜}  {๐’€ญ {๐’†ฌ๐’‚ต} } 
#> 4    ๐’ˆ—๐’„‘ {๐’ˆฉ {๐’ช๐’€Š} }  {๐’€€๐’†• {๐’€€๐’†ณ๐’†ณ} ๐’‹ซ๐’…๐’†ท}

In the output, recurring sign combinations are highlighted with {...}. This makes patterns visible that are easily overlooked when reading the raw cuneiform text.

Searching for patterns in the text

You can also search for a specific pattern in the annotated text. To do this, convert the search term with mark_ngrams() into the same format and then search with grepl():

term    <- "IGI.DIB.TU"
pattern <- mark_ngrams(term, freq)
pattern
#> [1] " { {๐’…†๐’ณ} ๐’Œ…} "
result  <- text_marked[grepl(pattern, text_marked, fixed = TRUE)]
cat(result, sep = "\n")
#> 12   ๐’„‹ { {๐’…†๐’ณ} ๐’Œ…} ๐’‡ป๐’…† { { {๐’…†๐’ณ} ๐’Œ…} ๐’•} 
#> 13   ๐’Šพ { {๐’…†๐’ณ} ๐’Œ…} ๐’Šพ๐’‡ { { {๐’…†๐’ณ} ๐’Œ…} ๐’•} 
#> 53    {๐’€ญ๐’‰ก๐’ถ๐’„ท๐’„ญ} ๐’‡‡ {๐’ฃโŸจฤœA2โŸฉ { {๐’Œ“๐’บ} ๐’‰Œ} } ๐’ƒข {๐’ฃ๐’ƒถ { {๐’…†๐’ณ} ๐’Œ…} } 
#> 54   ๐’€–๐’†ฐ { {๐’Œ“๐’บ} ๐’‰Œ} ๐’€ซ {๐’ฃ๐’ƒถ { {๐’…†๐’ณ} ๐’Œ…} } 
#> 55   ๐’š {๐’ฃโŸจฤœA2โŸฉ { {๐’Œ“๐’บ} ๐’‰Œ} } ๐’ˆง {๐’ฃ๐’ƒถ { {๐’…†๐’ณ} ๐’Œ…} } 
#> 80    { { {๐’…†๐’ณ} ๐’Œ…} ๐’•} ๐’Œ‰๐’Š• {๐’€ญ {๐’†ฌ๐’‚ต} }  {๐’ˆจ๐’‚—} 
#> 196  ๐’Œฃ๐’ฃ๐’† ๐’€ญ { {๐’…†๐’ณ} ๐’Œ…} ๐’๐’€ญ๐’ถ๐’‹—๐’‰ก๐’‹ผ๐’‚ท
#> 197   { {๐’ˆ— { {๐’…†๐’ณ} ๐’Œ…} } ๐’ˆฝ๐’ฃ๐’†Ÿ๐’‰ˆ} 
#> 198   { {๐’‚— { {๐’…†๐’ณ} ๐’Œ…} } ๐’Š•๐’ƒž {๐’‚ท๐’‚ท} } 
#> 258   { {๐’€€๐’‡‰} ๐’ˆฆ๐’„˜๐’ƒผ} ๐’„ ๐’ƒฒ๐’ถ๐’Šฎ๐’…Ž๐’„พ { {๐’…†๐’ณ} ๐’Œ…} ๐’€ {๐’ˆฌ๐’‰Œ} โ€ฆ
#> 280  ๐’ƒป๐’†Ÿ๐’•๐’‰Œ { {๐’…†๐’ณ} ๐’Œ…}  {๐’‰ก {๐’Œ“๐’บ} } 
#> 296   {๐’‰ฃ๐’ƒฒ} XXX { {๐’…†๐’ณ} ๐’Œ…} โ€ฆ
#> 298  ๐’€Š๐’ƒป๐’„ญ๐’€Š๐’ƒป { {๐’…†๐’ณ} ๐’Œ…} โ€ฆ
#> 402   {๐’ˆ— { {๐’…†๐’ณ} ๐’Œ…} }  {๐’‚— { {๐’…†๐’ณ} ๐’Œ…} } ๐’‰ {๐’‹—๐’‰Œ๐’€€ { {๐’ƒถ๐’‚—} ๐’……} } 
#> 410   { {๐’ˆ— { {๐’…†๐’ณ} ๐’Œ…} } ๐’ˆฝ๐’ฃ๐’†Ÿ๐’‰ˆ} 
#> 411   { {๐’‚— { {๐’…†๐’ณ} ๐’Œ…} } ๐’Š•๐’ƒž {๐’‚ท๐’‚ท} } ๐’‹—๐’ˆพ { {๐’ƒถ๐’‚—} ๐’……}

This finds all lines where the pattern IGI.DIB.TU occurs โ€“ including its embedding within larger n-grams.

4. Grammar Probabilities

Grammatical types from the dictionary

To understand the structure of a sentence, it is helpful to know which grammatical role each individual sign is likely to play. The function sign_grammar() looks up each sign of a string in the dictionary and counts how often it occurs with each grammatical type:

dic <- read_dictionary()
#>  ###---------------------------------------------------------------
#>  ###                Sumerian Dictionary
#>  ###
#>  ### Author:  Robin Wellmann
#>  ### Year:    2026
#>  ### Version: 0.5
#>  ### Watch for Updates: https://founder-hypothesis.com/en/sumerian-mythology/downloads/
#>  ###---------------------------------------------------------------
sg  <- sign_grammar("a-ma-ru ba-ur3 ra", dic)

The result is a data frame with one row per sign per grammatical type. The n column indicates how often this sign is attested with the respective type in the dictionary.

Bayesian probabilities

The raw frequencies from the dictionary can be refined into probabilities using a Bayesian model. First, compute the prior distribution of types across all signs in the dictionary:

prior <- prior_probs(dic, sentence_prob = 0.25)

The sentence_prob parameter corrects a systematic bias: if a dictionary was primarily built from noun phrases (rather than complete sentences), verbs are underrepresented in it. A value of sentence_prob = 0.25 means that an estimated 25% of the dictionary entries come from complete sentences. Verb probabilities are then upweighted accordingly.

Next, grammar_probs() computes the posterior probabilities for each sign:

gp <- grammar_probs(sg, prior, dic)

For signs with many dictionary entries, the observed frequencies dominate; for rare signs, the result falls back to the prior distribution.

Visualization

The function plot_sign_grammar() presents the results as a stacked bar chart:

plot_sign_grammar(gp, sign_names = FALSE)

Each bar represents a sign position in the sentence. The colours represent grammatical types: green for nouns (S), red shades for verbs (V) and verb operators, blue shades for attribute operators, and orange for other operators. A tall bar in a particular colour indicates that the sign likely has that grammatical function.

The chart can also be saved to a file:

plot_sign_grammar(gp, output_file = "grammar.png")

5. Translating Line by Line

Using the translate gadget with a text

The function translate() reaches its full potential when used together with an entire text. Instead of a string, you pass a line number and the text:

result <- translate(3, text = text)

The gadget opens with the third line of the text and has access to the full text. This provides three additional information sources:

Adjusting the bracket structure

In the input field of the translation section, you can adjust the bracket structure of the sentence. This is particularly important when you have identified fixed terms or compound words:

After clicking โ€œUpdate Skeletonโ€, the template is regenerated while preserving all previously entered translations.

Saving the result

When you click โ€œDoneโ€, translate() returns a skeleton object. This can be saved as a text file:

result <- translate(3, text = text)
writeLines(result, "line_003.txt")

The saved result is a text file in skeleton format (pipe format) that can be used as input for dictionary creation.

Using multiple dictionaries

If you have already created your own dictionaries (see next chapter), they can be passed as additional sources:

result <- translate(3, text = text,
              dic = c(system.file("extdata", "sumer-dictionary.txt", package = "sumer"),
                      "my_dictionary.txt"))

The dictionaries are searched in the order specified. When automatically pre-filling the translation template, the first dictionary that contains an entry for a given substring wins. In the dictionary panel of the gadget, entries from all dictionaries are displayed side by side.

6. Building a Dictionary from Translations

The annotation format

The skeleton files generated by translate() use the pipe format, which serves directly as input for dictionary creation. Every line starting with | is used as a dictionary entry:

|reading=SIGN_NAME=cuneiform:type:translation

A typical file looks like this:

an-en-ki-ki-a-ig-e2-kur-ra: SEN: The god Enki transforms the Earth. The one who establishes sustenance of human existence utilizes a supplier of energy from a distant place (the E-Kur temple). 
|an-en-ki=AN.EN.KI=๐’€ญ๐’‚—๐’† : S: god Enki
|ki-a=KI.A=๐’† ๐’€€: V: to transform the Earth
|   ki=KI=๐’† : S: Earth
|   a=A=๐’€€:Sโ˜’->V: to transform S
|ig=IG=๐’……: S:  one who establishes the sustenance of human existence.
|e2-kur-ra=E2.KUR.RA=๐’‚๐’†ณ๐’Š: V: to utilize a supplier of energy from a distant place
|   e2-kur=E2.KUR=๐’‚๐’†ณ: S: supplier of energy from a distant place
|       e2=E2=๐’‚: โ˜’S->S: supplier of energy from S
|       kur=KUR=๐’†ณ: S: distant place
|   ra=RA=๐’Š: Sโ˜’->V: to utilize S

The header line (if it is without |) and blank lines are ignored when reading the file. Only lines starting with | become dictionary entries.

Creating the dictionary

Once you have translated several lines of a text and saved them to files, these can be combined into a dictionary with make_dictionary(). The function accepts a vector of file paths:

dictionary <- make_dictionary("line_003.txt")

The function reads the translation files, aggregates entries (identical sign-type-translation combinations are counted together), and adds cuneiform characters and readings. The result is a data frame in dictionary format.

Internally, make_dictionary() performs two steps that can also be called individually:

# Step 1: Read translation files
translations <- read_translated_text("line_003.txt")

# Step 2: Convert to dictionary format
dictionary <- convert_to_dictionary(translations)

The intermediate step is useful if you want to edit the translations before conversion โ€“ for example, to unify spelling conventions.

Saving and loading the dictionary

The completed dictionary can be saved with metadata:

save_dictionary(
  dic     = dictionary,
  file    = "my_dictionary.txt",
  author  = "My Name",
  year    = "2026",
  version = "1.0",
  url     = "https://example.com/dictionary"
)

And loaded again later:

my_dic <- read_dictionary("my_dictionary.txt")
look_up("ki", my_dic)

The cycle

The custom dictionary can now be used in further translation work:

# For lookup
look_up("lugal", my_dic)

# For interactive translation
result <- translate(4, text = text, dic = "my_dictionary.txt")

With each translated line, the dictionary grows. Frequent signs and expressions accumulate higher counts over time, and the automatic pre-filling of translation templates becomes increasingly accurate. In this way, you gradually build a comprehensive dictionary database based on your own texts.