Slovene-English Dictionary

This is an attempt at a Slovene-English dictionary, intended for FreeDict project and other similar uses.

NOTE: This is still heavily in development. I have yet to create a piece of code to convert current XML files to TEI dictionary.

Project structure

The project's main content files are stored inside xml folder. Inside, there are multiple files, each representing one section of the dictionary - for example, in slv_eng-a.xml there are all entries that start with the letter "a", etc. The files are written in TEI format, but are of the type XML for ease of use and editing.

XML file structure

The text below is more of a "crash course" and not really that detailed or accurate. If one wishes to know much more about the way TEI files are structured, I suggest this documentation: https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#collocates. Alternatively, FreeDict project has a Wiki that explains a few things as well: https://github.com/freedict/fd-dictionaries/wiki.

NOTE: The values and types I mention below are my own restriction. The list can extend and change.

Each XML file contains entries, which represent information on words and phrases. This is not really a TEI structure yet, since it's missing some information, but this would be added later programmatically.

In XML and similar languages for storing information, data is wrapped in tags. These can also have attributes that define additional information. Tags can contain plain-text data and/or other tags, which have their own data.

Below is an example of a TEI dictionary entry as used in this project:

<entry xml:id="a">
    <form type="lemma">
        <orth>a</orth>
    </form>
    <gramGrp>
        <gram type="pos">conj.</gram>
    </gramGrp>
    <sense xml:id="a.1">
        <usg type="dom">Lit.</usg>
        <cit type="trans">
            <quote>but</quote>
            <quote>however</quote>
        </cit>
        <cit type="example">
            <quote>Iščejo dom, a ga ne najdejo.</quote>
            <cit type="trans">
                <quote>They are searching for home but cannot find it.</quote>
            </cit>
        </cit>
    </sense>
</entry>

Entry

The entry tag marks a new entry in the dictionary. It usually only has the xml:id attribute which acts as a unique ID for the entry. It is usually just the word or phrase in the entry.

If there are two entries with the same word, the IDs should be written with a .x suffix, where x represents an integer. For example, there are two entries with the name atika, so we add a suffix, and the resulting IDs of the entries are atika.1 and atika.2.

There are some suggested conventions to follow when writing IDs:

always use lowercase letters,
replace any spaces with underscores (example: abonirati se -> abonirati_se),
replace non-English letters with their ASCII representations where possible (example: užaloščen -> uzaloscen).

Entry usually contains forms, gramGrps, and senses.

Form

The form tag contains much of the information on the original word or phrase. It has the attribute type which provides information on what kind of information form contains. More in the table below...

Form can contain orth tag which holds the actual word/phrase, as well as gramGrp group for any additional grammatical properties which may hold true only for this particular form.

A table of some type values:

Value	Meaning
lemma	The headword - main word that represents the entry
inflected	Word in other than usual dictionary form
variant	A variant form
simple	A single free lexical item
compound	Word formed from simple lexical items
derivative	Word derived from headword
phrase	Multiple-word lexical item
paradigm	A collection of inflected forms

Grammatical Group

The gramGrp tag groups together grammatical properties that define the word/phrase in question. The tag can be found directly in the body (see example above), in which case it holds true for all possible forms in the entry, or it can reside in any form tag, in which case it applies only to this particular form.

A gramGrp group contains a bunch of gram tags. Each gram tag is given a type attribute to specify what kind of grammatical property it holds. Below is a table of some of these types and values.

Type	About	Values
pos	Defines the type of word (noun, verb...)	n. (noun) v. (verb) adj. (adjective) conj. (conjugate) adv. (adverb) int. (interjection) prep. (preposition) pron. (pronoun) art. (article) num. (numeral) pref. (prefix)
case	Defines the case of the word	nom. (nominative) gen. (genitive) dat. (dative) acc. (accusative) loc. (locative) instr. (instrumental)
gender	Defines the gender of the word	m. (male) f. (female) n. (neutral)
mood	Defines the mood of the verb	indic. (indicative) imper. (imperative) condit. (conditional)
number	Defines the number of the word	sg. (singular) pl. (plural) du. (dual)
per	Defines the person of the verb	1st 2nd 3rd
tns	Tense	Present Future Past
colloc	A collocate - any sequence of words that co-occur with the headword with significant frequency	example: [+ conj.]

Sense

Sense contains information on the English counterpart to the Slovene word/phrase. It has its own ID, which is almost the same as the entry ID but with added .x at the end (where x is an integer).

There can be multiple senses in an entry if the word/phrase has many meanings.

A list of tags that can be found in a sense:

Tag	Description
usg	Defines a type of usage - for example, where is the word used, what kind of situation it is used in, etc.
cit	It can contain actual translation or example of usage (all of these are stored in
quote	Holds data
def	Holds any definitions of words - can be used for extra explanation of the word or when there is no proper translation

Types of usage:

Type	Description	Values
dom	Domain	Adm. (administration) Aero. (aeronautics) Agr. (agriculture) Anat. (anatomy) Antr. (antropology) Arch. (architecture) Archae. (archaeology) Art Astr. (astronomy) Bibl. (bibliography) Biol. (biology) Bot. (botany) Buil. (building trade) Chem. (chemistry) Chess Comp. (computation) Craft. (craftsmanship) Econ. (economy) Engin. (engineering) Film Fin. (finances) For. (forestry) Gast. (gastronomy) Geol. (geology) Geog. (geography) Hist. (history) Hunt. (hunting) Law Lit. (literature) Ling. (linguistics) Math. (mathematics) Med. (medicine) Meteo. (meteorology) Milit. (military) Mus. (music) Myth. (mythology) Naut. (nautic) Pedag. (pedagogics) Pharm. (pharmacy) Phil. (philosophy) Phys. (physics) Psych. (psychiatry) Rail. (rail transport) Rel. (religion) Sci. (science) Sport Tech. (technic) Text. (textile) Theat. (theatre) Vet. (veterinary) War Zoo. (zoology)
plev	Preference level	rare occas. (occasional)
geo	Geographic data	dial. (dialect) Inner Carniola (Notranjska) Upper Carniola (Gorenjska) Lower Carniola (Dolenjska) Littoral Region (Primorje) Styria (Štajerska) Prekmurje Carinthia (Koroška) White Carniola (Bela krajina)
time	Usage by time	archaic old
register		child. (childlike) slang lingo vulgar formal casual affect. (affectionate) colloq. (colloquial) pejor. (pejorative) iron. (ironicaly)
style		fig. (figurative) lit. (literal)

Some more examples of entries

<entry xml:id="ah">
    <form type="lemma">
        <orth>ah</orth>
    </form>
    <gramGrp>
        <gram type="pos">int.</gram>
    </gramGrp>
    <sense xml:id="ah.1">
        <cit type="trans">
            <quote>ah</quote>
            <quote>oh</quote>
        </cit>
        <cit type="example">
            <quote>Ah, seveda!</quote>
            <cit type="trans">
                <quote>Oh, right!</quote>
            </cit>
        </cit>
        <def>Expresses awe, contentment, or when getting an idea or thought.</def>
    </sense>
    <sense xml:id="ah.2">
        <cit type="trans">
            <quote>ah</quote>
            <quote>oh</quote>
        </cit>
        <cit type="example">
            <quote>Ah, ti si.</quote>
            <cit type="trans">
                <quote>Oh, it's you.</quote>
            </cit>
        </cit>
        <def>Expresses regret, tiredness.</def>
    </sense>
</entry>

<entry xml:id="aktuar">
    <form type="lemma">
        <orth>aktuar</orth>
    </form>
    <form type="variant">
        <orth>aktuarka</orth>
        <gramGrp>
            <gram type="gender">f.</gram>
        </gramGrp>
    </form>
    <gramGrp>
        <gram type="pos">n.</gram>
        <gram type="gender">m.</gram>
        <gram type="number">sg.</gram>
    </gramGrp>
    <sense xml:id="aktuar.1">
        <cit type="trans">
            <quote>actuary</quote>
        </cit>
    </sense>
</entry>

<entry xml:id="amortizirati_se">
    <form type="lemma">
        <orth>amortizirati se</orth>
    </form>
    <gramGrp>
        <gram type="pos">v.</gram>
    </gramGrp>
    <sense xml:id="amortizirati_se.1">
        <usg type="dom">Econ.</usg>
        <cit type="trans">
            <quote>to be depreciated</quote>
        </cit>
        <cit type="example">
            <quote>Avto se amortizira v petih letih.</quote>
            <cit type="trans">
                <quote>The car is depreciated in five years.</quote>
            </cit>
        </cit>
    </sense>
</entry>