The first vignette (โGetting Started with sumerโ) introduced the basic functions of the package: sign conversion, dictionary lookup, and translation templates for individual lines. This vignette describes the complete workflow for working with entire texts.
The workflow consists of the following steps:
The generated dictionary can be used as an additional source for future translations. This creates a cycle: each new translation improves the dictionary, and the improved dictionary facilitates the next translation.
The package includes the example text โEnki and the World Orderโ, a Sumerian myth. The text is stored as a text file:
path <- system.file("extdata", "enki_and_the_world_order.txt", package = "sumer")
text <- readLines(path, encoding = "UTF-8")The first few lines look like this:
cat(text[1:5], sep = "\n")
#>
#> 1) ๐๐ค๐ฒ๐ญ๐ ๐ช๐
๐
๐ผ๐พ
#> 2) ๐๐๐ญ๐๐ ๐๐ฎ๐๐๐๐ ๐ฒ๐๐
๐
#> 3) ๐ฉ๐
๐ต๐ณ๐ฒ๐ญ๐๐ค๐ท๐ ๐๐ญ๐ฌ๐ต
#> 4) ๐๐๐ฉ๐ช๐๐๐๐๐ณ๐ณ๐ซ๐
๐ทEach line can optionally begin with a line number (e.g.,
1)\t...). Lines starting with # are treated as
comments and ignored during analysis. The text can be in cuneiform or
transliteration โ in the latter case, it can be automatically
converted.
A good first step when working with a new text is to search for frequently recurring sign combinations (n-grams). Such patterns are valuable clues: if a certain sequence of cuneiform signs appears repeatedly, it is likely a fixed term, a compound word, or an idiomatic expression.
freq <- ngram_frequencies(text, min_freq = c(6, 4, 2))
head(freq, 10)
#> frequency length combination
#> 1 2 20 ๐ญ๐๐ ๐๐ช๐๐ค๐
๐ฒ๐พ๐ณ๐ช๐ฒ๐ฃ๐๐๐๐๐พ๐
#> 2 2 16 ๐ญ๐๐ค๐๐ณ๐ณ๐๐๐๐ค๐ ๐ ๐ถ๐พ๐๐บ
#> 3 2 15 ๐๐ฌ๐ง๐ป๐๐ฌ๐ต๐๐ญ๐ฎ๐๐ด๐๐๐ญ
#> 4 2 15 ๐๐พ๐๐๐ญ๐ฒ๐๐พ๐๐๐๐ญ๐๐๐
#> 5 3 14 ๐๐ณ๐๐๐ญ๐๐ค๐ฒ๐ค๐ป๐
๐ท๐๐
#> 6 2 14 ๐ฃ๐ ๐ ๐ฌ๐ ๐๐๐ท๐ธ๐จ๐ค๐๐พ๐
#> 7 2 14 ๐ฅ๐
๐ง๐ฒ๐๐๐๐๐๐ฌ๐๐บ๐ธ๐บ
#> 8 2 12 ๐ค๐ฌ๐ง๐๐พ๐๐ค๐ ๐
๐๐๐ญ
#> 9 11 10 ๐ญ๐๐ ๐ค๐ ๐๐๐ช๐
๐บ
#> 10 2 10 ๐ฌ๐ญ๐น๐จ๐๐ฅ๐๐ฌ๐จ๐The min_freq parameter controls the minimum frequency
for different n-gram lengths. The default value c(6, 4, 2)
means: single signs must occur at least 6 times, pairs at least 4 times,
and all longer combinations at least 2 times. Depending on the length of
the text, these thresholds can be adjusted.
The result is a data frame with three columns:
frequency, length (number of signs), and
combination (cuneiform characters).
The analysis works from the longest combinations down to the shortest. When a long combination is identified as frequent, its occurrences are masked so that shorter sub-combinations are not falsely counted as frequent just because they are part of the longer combination.
With mark_ngrams(), the identified patterns are marked
in the text with curly braces:
text_marked <- mark_ngrams(text, freq)
cat(text_marked[1:5], sep = "\n")
#>
#> 1 ๐๐ค๐ฒ {๐ญ๐ } ๐ช๐
๐
๐ผ๐พ
#> 2 { {๐ {๐๐ญ} ๐} ๐ } ๐๐ฎ๐๐๐๐ {๐ฒ๐๐
๐}
#> 3 {๐ฉ {๐
๐ต} } {๐ณ๐ฒ { { {๐ญ๐} ๐ค} ๐ท} } {๐ ๐} {๐ญ {๐ฌ๐ต} }
#> 4 ๐๐ {๐ฉ {๐ช๐} } {๐๐ {๐๐ณ๐ณ} ๐ซ๐
๐ท}In the output, recurring sign combinations are highlighted with
{...}. This makes patterns visible that are easily
overlooked when reading the raw cuneiform text.
You can also search for a specific pattern in the annotated text. To
do this, convert the search term with mark_ngrams() into
the same format and then search with grepl():
term <- "IGI.DIB.TU"
pattern <- mark_ngrams(term, freq)
pattern
#> [1] " { {๐
๐ณ} ๐
} "
result <- text_marked[grepl(pattern, text_marked, fixed = TRUE)]
cat(result, sep = "\n")
#> 12 ๐ { {๐
๐ณ} ๐
} ๐ป๐
{ { {๐
๐ณ} ๐
} ๐}
#> 13 ๐พ { {๐
๐ณ} ๐
} ๐พ๐ { { {๐
๐ณ} ๐
} ๐}
#> 53 {๐ญ๐ก๐ถ๐ท๐ญ} ๐ {๐ฃโจฤA2โฉ { {๐๐บ} ๐} } ๐ข {๐ฃ๐ถ { {๐
๐ณ} ๐
} }
#> 54 ๐๐ฐ { {๐๐บ} ๐} ๐ซ {๐ฃ๐ถ { {๐
๐ณ} ๐
} }
#> 55 ๐ {๐ฃโจฤA2โฉ { {๐๐บ} ๐} } ๐ง {๐ฃ๐ถ { {๐
๐ณ} ๐
} }
#> 80 { { {๐
๐ณ} ๐
} ๐} ๐๐ {๐ญ {๐ฌ๐ต} } {๐จ๐}
#> 196 ๐ฃ๐ฃ๐ ๐ญ { {๐
๐ณ} ๐
} ๐๐ญ๐ถ๐๐ก๐ผ๐ท
#> 197 { {๐ { {๐
๐ณ} ๐
} } ๐ฝ๐ฃ๐๐}
#> 198 { {๐ { {๐
๐ณ} ๐
} } ๐๐ {๐ท๐ท} }
#> 258 { {๐๐} ๐ฆ๐๐ผ} ๐ ๐ฒ๐ถ๐ฎ๐
๐พ { {๐
๐ณ} ๐
} ๐ {๐ฌ๐} โฆ
#> 280 ๐ป๐๐๐ { {๐
๐ณ} ๐
} {๐ก {๐๐บ} }
#> 296 {๐ฃ๐ฒ} XXX { {๐
๐ณ} ๐
} โฆ
#> 298 ๐๐ป๐ญ๐๐ป { {๐
๐ณ} ๐
} โฆ
#> 402 {๐ { {๐
๐ณ} ๐
} } {๐ { {๐
๐ณ} ๐
} } ๐ {๐๐๐ { {๐ถ๐} ๐
} }
#> 410 { {๐ { {๐
๐ณ} ๐
} } ๐ฝ๐ฃ๐๐}
#> 411 { {๐ { {๐
๐ณ} ๐
} } ๐๐ {๐ท๐ท} } ๐๐พ { {๐ถ๐} ๐
}This finds all lines where the pattern IGI.DIB.TU occurs โ including its embedding within larger n-grams.
To understand the structure of a sentence, it is helpful to know
which grammatical role each individual sign is likely to play. The
function sign_grammar() looks up each sign of a string in
the dictionary and counts how often it occurs with each grammatical
type:
dic <- read_dictionary()
#> ###---------------------------------------------------------------
#> ### Sumerian Dictionary
#> ###
#> ### Author: Robin Wellmann
#> ### Year: 2026
#> ### Version: 0.5
#> ### Watch for Updates: https://founder-hypothesis.com/en/sumerian-mythology/downloads/
#> ###---------------------------------------------------------------
sg <- sign_grammar("a-ma-ru ba-ur3 ra", dic)The result is a data frame with one row per sign per grammatical
type. The n column indicates how often this sign is
attested with the respective type in the dictionary.
The raw frequencies from the dictionary can be refined into probabilities using a Bayesian model. First, compute the prior distribution of types across all signs in the dictionary:
The sentence_prob parameter corrects a systematic bias:
if a dictionary was primarily built from noun phrases (rather than
complete sentences), verbs are underrepresented in it. A value of
sentence_prob = 0.25 means that an estimated 25% of the
dictionary entries come from complete sentences. Verb probabilities are
then upweighted accordingly.
Next, grammar_probs() computes the posterior
probabilities for each sign:
For signs with many dictionary entries, the observed frequencies dominate; for rare signs, the result falls back to the prior distribution.
The function plot_sign_grammar() presents the results as
a stacked bar chart:
Each bar represents a sign position in the sentence. The colours represent grammatical types: green for nouns (S), red shades for verbs (V) and verb operators, blue shades for attribute operators, and orange for other operators. A tall bar in a particular colour indicates that the sign likely has that grammatical function.
The chart can also be saved to a file:
The function translate() reaches its full potential when
used together with an entire text. Instead of a string, you pass a line
number and the text:
The gadget opens with the third line of the text and has access to the full text. This provides three additional information sources:
N-grams: Frequent sign combinations computed from the entire text that appear in the current line. Additionally, n-grams that appear in both the current line and the neighbouring lines are marked with a checkmark in the Theme column โ these are thematic connections.
Context: The neighbouring lines (up to 2 before and 2 after the current line), with marked n-grams. This shows at a glance which patterns repeat across line boundaries.
Grammar: The bar chart of grammar probabilities for the current line.
In the input field of the translation section, you can adjust the bracket structure of the sentence. This is particularly important when you have identified fixed terms or compound words:
<d-en-ki>(e2-gal)After clicking โUpdate Skeletonโ, the template is regenerated while preserving all previously entered translations.
When you click โDoneโ, translate() returns a
skeleton object. This can be saved as a text file:
The saved result is a text file in skeleton format (pipe format) that can be used as input for dictionary creation.
If you have already created your own dictionaries (see next chapter), they can be passed as additional sources:
result <- translate(3, text = text,
dic = c(system.file("extdata", "sumer-dictionary.txt", package = "sumer"),
"my_dictionary.txt"))The dictionaries are searched in the order specified. When automatically pre-filling the translation template, the first dictionary that contains an entry for a given substring wins. In the dictionary panel of the gadget, entries from all dictionaries are displayed side by side.
The skeleton files generated by translate() use the pipe
format, which serves directly as input for dictionary creation. Every
line starting with | is used as a dictionary entry:
|reading=SIGN_NAME=cuneiform:type:translation
A typical file looks like this:
an-en-ki-ki-a-ig-e2-kur-ra: SEN: The god Enki transforms the Earth. The one who establishes sustenance of human existence utilizes a supplier of energy from a distant place (the E-Kur temple).
|an-en-ki=AN.EN.KI=๐ญ๐๐ : S: god Enki
|ki-a=KI.A=๐ ๐: V: to transform the Earth
| ki=KI=๐ : S: Earth
| a=A=๐:Sโ->V: to transform S
|ig=IG=๐
: S: one who establishes the sustenance of human existence.
|e2-kur-ra=E2.KUR.RA=๐๐ณ๐: V: to utilize a supplier of energy from a distant place
| e2-kur=E2.KUR=๐๐ณ: S: supplier of energy from a distant place
| e2=E2=๐: โS->S: supplier of energy from S
| kur=KUR=๐ณ: S: distant place
| ra=RA=๐: Sโ->V: to utilize S
The header line (if it is without |) and blank lines are
ignored when reading the file. Only lines starting with |
become dictionary entries.
Once you have translated several lines of a text and saved them to
files, these can be combined into a dictionary with
make_dictionary(). The function accepts a vector of file
paths:
The function reads the translation files, aggregates entries (identical sign-type-translation combinations are counted together), and adds cuneiform characters and readings. The result is a data frame in dictionary format.
Internally, make_dictionary() performs two steps that
can also be called individually:
# Step 1: Read translation files
translations <- read_translated_text("line_003.txt")
# Step 2: Convert to dictionary format
dictionary <- convert_to_dictionary(translations)The intermediate step is useful if you want to edit the translations before conversion โ for example, to unify spelling conventions.
The completed dictionary can be saved with metadata:
save_dictionary(
dic = dictionary,
file = "my_dictionary.txt",
author = "My Name",
year = "2026",
version = "1.0",
url = "https://example.com/dictionary"
)And loaded again later:
The custom dictionary can now be used in further translation work:
# For lookup
look_up("lugal", my_dic)
# For interactive translation
result <- translate(4, text = text, dic = "my_dictionary.txt")With each translated line, the dictionary grows. Frequent signs and expressions accumulate higher counts over time, and the automatic pre-filling of translation templates becomes increasingly accurate. In this way, you gradually build a comprehensive dictionary database based on your own texts.