Translating Sumerian Texts

2. Loading a Text

The package includes the example text “Enki and the World Order”, a Sumerian myth. The text is stored as a text file:

path <- system.file("extdata", "enki_and_the_world_order.txt", package = "sumer")
text <- readLines(path, encoding = "UTF-8")

The first few lines look like this:

cat(text[1:5], sep = "\n")
#> 
#> 1)   𒂗𒈤𒁲𒀭𒆠𒉪𒅅𒅎𒋼𒈾
#> 2)   𒀀𒀀𒀭𒂗𒆠𒄞𒁮𒀀𒊑𒀀𒄠𒃲𒂊𒌅𒁕
#> 3)   𒊩𒅗𒂵𒆳𒃲𒀭𒂗𒆤𒇷𒆠𒉘𒀭𒆬𒂵
#> 4)   𒈗𒄑𒈩𒍪𒀊𒀀𒆕𒀀𒆳𒆳𒋫𒅍𒆷

Each line can optionally begin with a line number (e.g., 1)\t...). Lines starting with # are treated as comments and ignored during analysis. The text can be in cuneiform or transliteration – in the latter case, it can be automatically converted.

3. N-gram Analysis

Finding frequent sign combinations

A good first step when working with a new text is to search for frequently recurring sign combinations (n-grams). Such patterns are valuable clues: if a certain sequence of cuneiform signs appears repeatedly, it is likely a fixed term, a compound word, or an idiomatic expression.

freq <- ngram_frequencies(text, min_freq = c(6, 4, 2))
head(freq, 10)
#>    frequency length          combination
#> 1          2     20 𒀭𒂗𒆠𒈗𒍪𒀊𒆤𒅎𒃲𒈾𒆳𒆪𒁲𒍣𒉈𒌋𒌋𒌋𒈾𒂊
#> 2          2     16     𒀭𒂗𒆤𒈗𒆳𒆳𒊏𒊏𒂗𒆤𒆠𒂠𒃶𒈾𒀊𒁺
#> 3          2     15      𒀊𒈬𒌧𒃻𒀊𒆬𒂵𒀀𒀭𒊮𒁉𒃴𒆓𒀀𒀭
#> 4          2     15      𒀀𒈾𒀀𒊏𒀭𒇲𒀀𒈾𒀀𒊏𒀊𒈭𒂊𒉈𒂗
#> 5          3     14       𒂍𒆳𒊑𒂍𒀭𒂗𒆤𒇲𒆤𒃻𒅅𒆷𒉆𒋛
#> 6          2     14       𒉣𒆠𒆠𒆬𒆠𒆗𒆗𒆷𒀸𒈨𒈤𒋗𒋾𒀀
#> 7          2     14       𒈥𒌅𒈧𒀲𒊕𒂊𒌋𒌋𒌋𒈬𒉌𒉺𒄸𒁺
#> 8          2     12         𒆤𒈬𒌧𒁕𒄾𒂗𒆤𒆠𒅗𒉌𒀀𒀭
#> 9         11     10           𒀭𒂗𒆠𒆤𒍠𒁀𒉆𒈪𒅔𒁺
#> 10         2     10           𒆬𒀭𒈹𒈨𒂗𒈥𒍝𒈬𒈨𒀀

The min_freq parameter controls the minimum frequency for different n-gram lengths. The default value c(6, 4, 2) means: single signs must occur at least 6 times, pairs at least 4 times, and all longer combinations at least 2 times. Depending on the length of the text, these thresholds can be adjusted.

The result is a data frame with three columns: frequency, length (number of signs), and combination (cuneiform characters).

The analysis works from the longest combinations down to the shortest. When a long combination is identified as frequent, its occurrences are masked so that shorter sub-combinations are not falsely counted as frequent just because they are part of the longer combination.

Marking n-grams in the text

With mark_ngrams(), the identified patterns are marked in the text with curly braces:

text_marked <- mark_ngrams(text, freq)
cat(text_marked[1:5], sep = "\n")
#> 
#> 1    𒂗𒈤𒁲 {𒀭𒆠} 𒉪𒅅𒅎𒋼𒈾
#> 2     { {𒀀 {𒀀𒀭} 𒂗} 𒆠} 𒄞𒁮𒀀𒊑𒀀𒄠 {𒃲𒂊𒌅𒁕} 
#> 3     {𒊩 {𒅗𒂵} }  {𒆳𒃲 { { {𒀭𒂗} 𒆤} 𒇷} }  {𒆠𒉘}  {𒀭 {𒆬𒂵} } 
#> 4    𒈗𒄑 {𒈩 {𒍪𒀊} }  {𒀀𒆕 {𒀀𒆳𒆳} 𒋫𒅍𒆷}

In the output, recurring sign combinations are highlighted with {...}. This makes patterns visible that are easily overlooked when reading the raw cuneiform text.

Searching for patterns in the text

You can also search for a specific pattern in the annotated text. To do this, convert the search term with mark_ngrams() into the same format and then search with grepl():

term    <- "IGI.DIB.TU"
pattern <- mark_ngrams(term, freq)
pattern
#> [1] " { {𒅆𒁳} 𒌅} "
result  <- text_marked[grepl(pattern, text_marked, fixed = TRUE)]
cat(result, sep = "\n")
#> 12   𒄋 { {𒅆𒁳} 𒌅} 𒇻𒅆 { { {𒅆𒁳} 𒌅} 𒁕} 
#> 13   𒊾 { {𒅆𒁳} 𒌅} 𒊾𒁇 { { {𒅆𒁳} 𒌅} 𒁕} 
#> 53    {𒀭𒉡𒁶𒄷𒄭} 𒇇 {𒍣⟨ĜA2⟩ { {𒌓𒁺} 𒉌} } 𒃢 {𒍣𒃶 { {𒅆𒁳} 𒌅} } 
#> 54   𒀖𒆰 { {𒌓𒁺} 𒉌} 𒀫 {𒍣𒃶 { {𒅆𒁳} 𒌅} } 
#> 55   𒍚 {𒍣⟨ĜA2⟩ { {𒌓𒁺} 𒉌} } 𒈧 {𒍣𒃶 { {𒅆𒁳} 𒌅} } 
#> 80    { { {𒅆𒁳} 𒌅} 𒁕} 𒌉𒊕 {𒀭 {𒆬𒂵} }  {𒈨𒂗} 
#> 196  𒌣𒍣𒆠𒀭 { {𒅆𒁳} 𒌅} 𒍝𒀭𒁶𒋗𒉡𒋼𒂷
#> 197   { {𒈗 { {𒅆𒁳} 𒌅} } 𒈽𒍣𒆟𒉈} 
#> 198   { {𒂗 { {𒅆𒁳} 𒌅} } 𒊕𒃞 {𒂷𒂷} } 
#> 258   { {𒀀𒇉} 𒈦𒄘𒃼} 𒄠𒃲𒁶𒊮𒅎𒄾 { {𒅆𒁳} 𒌅} 𒁀 {𒈬𒉌} …
#> 280  𒃻𒆟𒁕𒉌 { {𒅆𒁳} 𒌅}  {𒉡 {𒌓𒁺} } 
#> 296   {𒉣𒃲} XXX { {𒅆𒁳} 𒌅} …
#> 298  𒀊𒃻𒄭𒀊𒃻 { {𒅆𒁳} 𒌅} …
#> 402   {𒈗 { {𒅆𒁳} 𒌅} }  {𒂗 { {𒅆𒁳} 𒌅} } 𒁉 {𒋗𒉌𒀀 { {𒃶𒂗} 𒅅} } 
#> 410   { {𒈗 { {𒅆𒁳} 𒌅} } 𒈽𒍣𒆟𒉈} 
#> 411   { {𒂗 { {𒅆𒁳} 𒌅} } 𒊕𒃞 {𒂷𒂷} } 𒋗𒈾 { {𒃶𒂗} 𒅅}

This finds all lines where the pattern IGI.DIB.TU occurs – including its embedding within larger n-grams.

4. Grammar Probabilities

Grammatical types from the dictionary

To understand the structure of a sentence, it is helpful to know which grammatical role each individual sign is likely to play. The function sign_grammar() looks up each sign of a string in the dictionary and counts how often it occurs with each grammatical type:

dic <- read_dictionary()
#>  ###---------------------------------------------------------------
#>  ###                Sumerian Dictionary
#>  ###
#>  ### Author:  Robin Wellmann
#>  ### Year:    2026
#>  ### Version: 0.5
#>  ### Watch for Updates: https://founder-hypothesis.com/en/sumerian-mythology/downloads/
#>  ###---------------------------------------------------------------
sg  <- sign_grammar("a-ma-ru ba-ur3 ra", dic)

The result is a data frame with one row per sign per grammatical type. The n column indicates how often this sign is attested with the respective type in the dictionary.

Bayesian probabilities

The raw frequencies from the dictionary can be refined into probabilities using a Bayesian model. First, compute the prior distribution of types across all signs in the dictionary:

prior <- prior_probs(dic, sentence_prob = 0.25)

The sentence_prob parameter corrects a systematic bias: if a dictionary was primarily built from noun phrases (rather than complete sentences), verbs are underrepresented in it. A value of sentence_prob = 0.25 means that an estimated 25% of the dictionary entries come from complete sentences. Verb probabilities are then upweighted accordingly.

Next, grammar_probs() computes the posterior probabilities for each sign:

gp <- grammar_probs(sg, prior, dic)

For signs with many dictionary entries, the observed frequencies dominate; for rare signs, the result falls back to the prior distribution.

Visualization

The function plot_sign_grammar() presents the results as a stacked bar chart:

plot_sign_grammar(gp, sign_names = FALSE)

Each bar represents a sign position in the sentence. The colours represent grammatical types: green for nouns (S), red shades for verbs (V) and verb operators, blue shades for attribute operators, and orange for other operators. A tall bar in a particular colour indicates that the sign likely has that grammatical function.

The chart can also be saved to a file:

plot_sign_grammar(gp, output_file = "grammar.png")

5. Translating Line by Line

Using the translate gadget with a text

The function translate() reaches its full potential when used together with an entire text. Instead of a string, you pass a line number and the text:

result <- translate(3, text = text)

The gadget opens with the third line of the text and has access to the full text. This provides three additional information sources:

N-grams: Frequent sign combinations computed from the entire text that appear in the current line. Additionally, n-grams that appear in both the current line and the neighbouring lines are marked with a checkmark in the Theme column – these are thematic connections.
Context: The neighbouring lines (up to 2 before and 2 after the current line), with marked n-grams. This shows at a glance which patterns repeat across line boundaries.
Grammar: The bar chart of grammar probabilities for the current line.

Adjusting the bracket structure

In the input field of the translation section, you can adjust the bracket structure of the sentence. This is particularly important when you have identified fixed terms or compound words:

Mark a proper name as a fixed term with angle brackets: <d-en-ki>
Group a compound word with round brackets: (e2-gal)

After clicking “Update Skeleton”, the template is regenerated while preserving all previously entered translations.

Saving the result

When you click “Done”, translate() returns a skeleton object. This can be saved as a text file:

result <- translate(3, text = text)
writeLines(result, "line_003.txt")

The saved result is a text file in skeleton format (pipe format) that can be used as input for dictionary creation.

Using multiple dictionaries

If you have already created your own dictionaries (see next chapter), they can be passed as additional sources:

result <- translate(3, text = text,
              dic = c(system.file("extdata", "sumer-dictionary.txt", package = "sumer"),
                      "my_dictionary.txt"))

The dictionaries are searched in the order specified. When automatically pre-filling the translation template, the first dictionary that contains an entry for a given substring wins. In the dictionary panel of the gadget, entries from all dictionaries are displayed side by side.

6. Building a Dictionary from Translations

The annotation format

The skeleton files generated by translate() use the pipe format, which serves directly as input for dictionary creation. Every line starting with | is used as a dictionary entry:

|reading=SIGN_NAME=cuneiform:type:translation

A typical file looks like this:

an-en-ki-ki-a-ig-e2-kur-ra: SEN: The god Enki transforms the Earth. The one who establishes sustenance of human existence utilizes a supplier of energy from a distant place (the E-Kur temple). 
|an-en-ki=AN.EN.KI=𒀭𒂗𒆠: S: god Enki
|ki-a=KI.A=𒆠𒀀: V: to transform the Earth
|   ki=KI=𒆠: S: Earth
|   a=A=𒀀:S☒->V: to transform S
|ig=IG=𒅅: S:  one who establishes the sustenance of human existence.
|e2-kur-ra=E2.KUR.RA=𒂍𒆳𒊏: V: to utilize a supplier of energy from a distant place
|   e2-kur=E2.KUR=𒂍𒆳: S: supplier of energy from a distant place
|       e2=E2=𒂍: ☒S->S: supplier of energy from S
|       kur=KUR=𒆳: S: distant place
|   ra=RA=𒊏: S☒->V: to utilize S

The header line (if it is without |) and blank lines are ignored when reading the file. Only lines starting with | become dictionary entries.

Creating the dictionary

Once you have translated several lines of a text and saved them to files, these can be combined into a dictionary with make_dictionary(). The function accepts a vector of file paths:

dictionary <- make_dictionary("line_003.txt")

The function reads the translation files, aggregates entries (identical sign-type-translation combinations are counted together), and adds cuneiform characters and readings. The result is a data frame in dictionary format.

Internally, make_dictionary() performs two steps that can also be called individually:

# Step 1: Read translation files
translations <- read_translated_text("line_003.txt")

# Step 2: Convert to dictionary format
dictionary <- convert_to_dictionary(translations)

The intermediate step is useful if you want to edit the translations before conversion – for example, to unify spelling conventions.

Saving and loading the dictionary

The completed dictionary can be saved with metadata:

save_dictionary(
  dic     = dictionary,
  file    = "my_dictionary.txt",
  author  = "My Name",
  year    = "2026",
  version = "1.0",
  url     = "https://example.com/dictionary"
)

And loaded again later:

my_dic <- read_dictionary("my_dictionary.txt")
look_up("ki", my_dic)

The cycle

The custom dictionary can now be used in further translation work:

# For lookup
look_up("lugal", my_dic)

# For interactive translation
result <- translate(4, text = text, dic = "my_dictionary.txt")

With each translated line, the dictionary grows. Frequent signs and expressions accumulate higher counts over time, and the automatic pre-filling of translation templates becomes increasingly accurate. In this way, you gradually build a comprehensive dictionary database based on your own texts.