Grimpy101 / slovene-english-dictionary-TEI

A Slovene-to-English dictionary for FreeDict and other uses

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Slovene-English Dictionary

This is an attempt at a Slovene-English dictionary, intended for FreeDict project and other similar uses.

NOTE: This is still heavily in development. I have yet to create a piece of code to convert current XML files to TEI dictionary.

Project structure

The project's main content files are stored inside xml folder. Inside, there are multiple files, each representing one section of the dictionary - for example, in slv_eng-a.xml there are all entries that start with the letter "a", etc. The files are written in TEI format, but are of the type XML for ease of use and editing.

XML file structure

The text below is more of a "crash course" and not really that detailed or accurate. If one wishes to know much more about the way TEI files are structured, I suggest this documentation: https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#collocates. Alternatively, FreeDict project has a Wiki that explains a few things as well: https://github.com/freedict/fd-dictionaries/wiki.

NOTE: The values and types I mention below are my own restriction. The list can extend and change.

Each XML file contains entries, which represent information on words and phrases. This is not really a TEI structure yet, since it's missing some information, but this would be added later programmatically.

In XML and similar languages for storing information, data is wrapped in tags. These can also have attributes that define additional information. Tags can contain plain-text data and/or other tags, which have their own data.

Below is an example of a TEI dictionary entry as used in this project:

<entry xml:id="a">
    <form type="lemma">
        <orth>a</orth>
    </form>
    <gramGrp>
        <gram type="pos">conj.</gram>
    </gramGrp>
    <sense xml:id="a.1">
        <usg type="dom">Lit.</usg>
        <cit type="trans">
            <quote>but</quote>
            <quote>however</quote>
        </cit>
        <cit type="example">
            <quote>Iščejo dom, a ga ne najdejo.</quote>
            <cit type="trans">
                <quote>They are searching for home but cannot find it.</quote>
            </cit>
        </cit>
    </sense>
</entry>

Entry

The entry tag marks a new entry in the dictionary. It usually only has the xml:id attribute which acts as a unique ID for the entry. It is usually just the word or phrase in the entry.

If there are two entries with the same word, the IDs should be written with a .x suffix, where x represents an integer. For example, there are two entries with the name atika, so we add a suffix, and the resulting IDs of the entries are atika.1 and atika.2.

There are some suggested conventions to follow when writing IDs:

  • always use lowercase letters,
  • replace any spaces with underscores (example: abonirati se -> abonirati_se),
  • replace non-English letters with their ASCII representations where possible (example: užaloščen -> uzaloscen).

Entry usually contains forms, gramGrps, and senses.

Form

The form tag contains much of the information on the original word or phrase. It has the attribute type which provides information on what kind of information form contains. More in the table below...

Form can contain orth tag which holds the actual word/phrase, as well as gramGrp group for any additional grammatical properties which may hold true only for this particular form.

A table of some type values:

Value Meaning
lemma The headword - main word that represents the entry
inflected Word in other than usual dictionary form
variant A variant form
simple A single free lexical item
compound Word formed from simple lexical items
derivative Word derived from headword
phrase Multiple-word lexical item
paradigm A collection of inflected forms

Grammatical Group

The gramGrp tag groups together grammatical properties that define the word/phrase in question. The tag can be found directly in the body (see example above), in which case it holds true for all possible forms in the entry, or it can reside in any form tag, in which case it applies only to this particular form.

A gramGrp group contains a bunch of gram tags. Each gram tag is given a type attribute to specify what kind of grammatical property it holds. Below is a table of some of these types and values.

Type About Values
pos Defines the type of word (noun, verb...) n. (noun)
v. (verb)
adj. (adjective)
conj. (conjugate)
adv. (adverb)
int. (interjection)
prep. (preposition)
pron. (pronoun)
art. (article)
num. (numeral)
pref. (prefix)
case Defines the case of the word nom. (nominative)
gen. (genitive)
dat. (dative)
acc. (accusative)
loc. (locative)
instr. (instrumental)
gender Defines the gender of the word m. (male)
f. (female)
n. (neutral)
mood Defines the mood of the verb indic. (indicative)
imper. (imperative)
condit. (conditional)
number Defines the number of the word sg. (singular)
pl. (plural)
du. (dual)
per Defines the person of the verb 1st
2nd
3rd
tns Tense Present
Future
Past
colloc A collocate - any sequence of words that co-occur with the headword with significant frequency example: [+ conj.]

Sense

Sense contains information on the English counterpart to the Slovene word/phrase. It has its own ID, which is almost the same as the entry ID but with added .x at the end (where x is an integer).

There can be multiple senses in an entry if the word/phrase has many meanings.

A list of tags that can be found in a sense:

Tag Description
usg Defines a type of usage - for example, where is the word used, what kind of situation it is used in, etc.
cit It can contain actual translation or example of usage (all of these are stored in
quote Holds data
def Holds any definitions of words - can be used for extra explanation of the word or when there is no proper translation

Types of usage:

Type Description Values
dom Domain Adm. (administration)
Aero. (aeronautics)
Agr. (agriculture)
Anat. (anatomy)
Antr. (antropology)
Arch. (architecture)
Archae. (archaeology)
Art
Astr. (astronomy)
Bibl. (bibliography)
Biol. (biology)
Bot. (botany)
Buil. (building trade)
Chem. (chemistry)
Chess
Comp. (computation)
Craft. (craftsmanship)
Econ. (economy)
Engin. (engineering)
Film
Fin. (finances)
For. (forestry)
Gast. (gastronomy)
Geol. (geology)
Geog. (geography)
Hist. (history)
Hunt. (hunting)
Law
Lit. (literature)
Ling. (linguistics)
Math. (mathematics)
Med. (medicine)
Meteo. (meteorology)
Milit. (military)
Mus. (music)
Myth. (mythology)
Naut. (nautic)
Pedag. (pedagogics)
Pharm. (pharmacy)
Phil. (philosophy)
Phys. (physics)
Psych. (psychiatry)
Rail. (rail transport)
Rel. (religion)
Sci. (science)
Sport
Tech. (technic)
Text. (textile)
Theat. (theatre)
Vet. (veterinary)
War
Zoo. (zoology)
plev Preference level rare
occas. (occasional)
geo Geographic data dial. (dialect)
Inner Carniola (Notranjska)
Upper Carniola (Gorenjska)
Lower Carniola (Dolenjska)
Littoral Region (Primorje)
Styria (Štajerska)
Prekmurje
Carinthia (Koroška)
White Carniola (Bela krajina)
time Usage by time archaic
old
register child. (childlike)
slang
lingo
vulgar
formal
casual
affect. (affectionate)
colloq. (colloquial)
pejor. (pejorative)
iron. (ironicaly)
style fig. (figurative)
lit. (literal)

Some more examples of entries

<entry xml:id="ah">
    <form type="lemma">
        <orth>ah</orth>
    </form>
    <gramGrp>
        <gram type="pos">int.</gram>
    </gramGrp>
    <sense xml:id="ah.1">
        <cit type="trans">
            <quote>ah</quote>
            <quote>oh</quote>
        </cit>
        <cit type="example">
            <quote>Ah, seveda!</quote>
            <cit type="trans">
                <quote>Oh, right!</quote>
            </cit>
        </cit>
        <def>Expresses awe, contentment, or when getting an idea or thought.</def>
    </sense>
    <sense xml:id="ah.2">
        <cit type="trans">
            <quote>ah</quote>
            <quote>oh</quote>
        </cit>
        <cit type="example">
            <quote>Ah, ti si.</quote>
            <cit type="trans">
                <quote>Oh, it's you.</quote>
            </cit>
        </cit>
        <def>Expresses regret, tiredness.</def>
    </sense>
</entry>
<entry xml:id="aktuar">
    <form type="lemma">
        <orth>aktuar</orth>
    </form>
    <form type="variant">
        <orth>aktuarka</orth>
        <gramGrp>
            <gram type="gender">f.</gram>
        </gramGrp>
    </form>
    <gramGrp>
        <gram type="pos">n.</gram>
        <gram type="gender">m.</gram>
        <gram type="number">sg.</gram>
    </gramGrp>
    <sense xml:id="aktuar.1">
        <cit type="trans">
            <quote>actuary</quote>
        </cit>
    </sense>
</entry>
<entry xml:id="amortizirati_se">
    <form type="lemma">
        <orth>amortizirati se</orth>
    </form>
    <gramGrp>
        <gram type="pos">v.</gram>
    </gramGrp>
    <sense xml:id="amortizirati_se.1">
        <usg type="dom">Econ.</usg>
        <cit type="trans">
            <quote>to be depreciated</quote>
        </cit>
        <cit type="example">
            <quote>Avto se amortizira v petih letih.</quote>
            <cit type="trans">
                <quote>The car is depreciated in five years.</quote>
            </cit>
        </cit>
    </sense>
</entry>

About

A Slovene-to-English dictionary for FreeDict and other uses

License:MIT License