EDIFACT parsing - Clojure library

The library parses Edifact data into a predictable data strcuture while at the same time verifying both semantics and data. Transforming the data into "pretty" hashmaps is a job left to you, at least it should be manageble now that all the error cases are handled by the library.

The following usage examples will show the happy path (when nothing goes wrong), followed by how error states are handled.

Usage examples

Happy path

First include the library:

(ns my.app.edifact
  (:require [dk.emcken.edifact.tokenizer :refer [tokenize]]
            [dk.emcken.edifact.schema :refer [schema-based-on-unh schemafy]]))

Lets play with one of the simple interchanges first. Prior knowledge to EDIFACT is recommended.

The following example is constructed from reading Nordea's Implementation Guideline for CONTRL. Notice how one of the seperator charaters are escaped and used in the text NORDEASPECIAL+TEST:

(def contrl "
UNA:+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1001+CONTRL:D:3:UN'
UCI+131+333666999+NORDEASPECIAL?+TEST:ZZ+7'
UCM+1+PAYMUL:D:96A:UN+4+13+UNH+2:4'
UCS+17+16'
UNT+5+1'
UNZ+1+1
")

The tokenizer respects release (escape) characters :

(clojure.pprint/pprint (tokenize contrl))
((["UNB"]
  ["IATB" "1"]
  ["6XPPC" "ZZ"]
  ["LHPPC" "ZZ"]
  ["940101" "0950"]
  ["1"])
 (["UNH"] ["1001"] ["CONTRL" "D" "3" "UN"])
 (["UCI"] ["131"] ["333666999"] ["NORDEASPECIAL+TEST" "ZZ"] ["7"])
 (["UCM"]
  ["1"]
  ["PAYMUL" "D" "96A" "UN"]
  ["4"]
  ["13"]
  ["UNH"]
  ["2" "4"])
 (["UCS"] ["17"] ["16"])
 (["UNT"] ["5"] ["1"])
 (["UNZ"] ["1"] ["1"]))
nil

Now that the data has been tokenized it is time to apply a schema.

The following was taken from the documentaiton linked above, which describes how this particular EDIFACT message is expected to look:

UNH Message header                            M1 M1
UCI Interchange response                      M1 M1
    ------ Segment group 1  ------------ C999999 O999999------+
UCM Message response                          M1 M1           ¦
                                                              ¦
    ------ Segment group 2  --------------- C999 O999 -------+¦
UCS Segment error indication                  M1 M1          ¦¦
UCD Data element error indication            C99 O99 ---------+
    ------ Segment group 3  ------------ C999999  0 ----------+
UCF Functional group response                 M1  0           ¦
                                                              ¦
    ------ Segment group 4  ------------ C999999  0 ---------+¦
UCM Message response                          M1  0          ¦¦
                                                             ¦¦
    ------ Segment group 5  --------------- C999  0 --------+¦¦
UCS Segment error indication                  M1  0         ¦¦¦
UCD Data element error indication            C99  0 ----------+
UNT Message trailer                           M1 M1

Now we take this documentation and turn it into a configuration describing a valid CONTRL message (for this example segment group 3,4 and 5 are ignored):

(defmethod schema-based-on-unh ["CONTRL" "D" "3" "UN"]
  [_]
  '(["UCI" [1 1]]
    [:sg1 [0 999999]
     (["UCM" [1 1]]
      [:sg2 [0 999]
       (["UCS" [1 1]]
        ["UCD" [0 99]])])]))

Now that schema-based-on-unh has been extended to know of the CONTRL message, it is possible to apply the schema using the following:

(clojure.pprint/pprint (schemafy (tokenize contrl) schema-based-on-unh))
[nil
 [(["UNB"]
   ["IATB" "1"]
   ["6XPPC" "ZZ"]
   ["LHPPC" "ZZ"]
   ["940101" "0950"]
   ["1"])
  [:sg0
   [(["UNH"] ["1001"] ["CONTRL" "D" "3" "UN"])
    (["UCI"] ["131"] ["333666999"] ["NORDEASPECIAL+TEST" "ZZ"] ["7"])
    [:sg1
     [(["UCM"]
       ["1"]
       ["PAYMUL" "D" "96A" "UN"]
       ["4"]
       ["13"]
       ["UNH"]
       ["2" "4"])]]
    (["UNT"] ["5"] ["1"])]]
  (["UNZ"] ["1"] ["1"])]]

The first value in the returned list is the remaining segments (nil means none), while the second value is the schemafied version of the tokenized EDIFACT representation.

Validation

The library can validate an edifact interchange in two ways:

Semantics
Data

The semantics describe the structure of the interchange like segment grouping, order, repetition and whether or not those are mandatory or optional. Basically it describes "where" to find the "what" (a.k.a. the data).

The validation will be able to tell exactly where in the edifact interchange garbage was detected.

Data

The above example didn't include any data validation. Lets make the validaiton a bit more strict. Looking at the UCI segment:

UCI+131+333666999+NORDEASPECIAL?+TEST:ZZ+7'

The first component in the first element is a "control reference to an interchange number" (131). For the sake of this example lets assume that the control reference can only be numbers.

Each component is validated in isolation. The nature of the validation is provided by a 2-dimensional vector representing elements and their components.

Redefine the schema to enable validation on the "control reference to an interchange number":

(defmethod schema-based-on-unh ["CONTRL" "D" "3" "UN"]
  [_]
  '(["UCI" [1 1] [[#"^\d+$"]]]
    [:sg1 [0 999999]
     (["UCM" [1 1]]
      [:sg2 [0 999]
       (["UCS" [1 1]]
        ["UCD" [0 99]])])]))

Being given the following UCI segment:

UCI+ABC+333666999+NORDEASPECIAL?+TEST:ZZ+7'
     └─ non numbers not allowed

Would cause the following exception to be thrown:

Unhandled clojure.lang.ExceptionInfo
Unable to validate segment
{:component 1,
 :element 2,
 :error "Must match regular expression",
 :type :dk.emcken.edifact.schema/validation,
 :segment 5}

Lets make the schema even stricter. The first component in the second element is a "sender" (333666999). A sender can be up to 35 alpha-numeric characters but never empty.

For this a helper function would be nice:

(defn an..35m
  "Ensures a maximum of 35 alpha-numeric characters that is mandatory (not empty)"
  [component]
  (and
    (not-empty component)
    (< (count component) 35)))

Now redefine the schema again:

(defmethod schema-based-on-unh ["CONTRL" "D" "3" "UN"]
  [_]
  '(["UCI" [1 1] [[#"^\d+$"] [an..35m]]]
    [:sg1 [0 999999]
     (["UCM" [1 1]]
      [:sg2 [0 999]
       (["UCS" [1 1]]
        ["UCD" [0 99]])])]))

Being given either of the following 2 UCI segments an execption would be thrown:

UCI+131+333666999012345678901234567890123456+NORDEASPECIAL?+TEST:ZZ+7'
        └────── longer than 35 chars ──────┘

UCI+131++NORDEASPECIAL?+TEST:ZZ+7'
        └─ empty sender not allowed

Semantics

Lets start out easy by inserting a wrong segment where a mandatory segment is expected. After a UCI segment either a UCM (in segment group 1) or UNT (the message trailer) is expected. Addind a madeup UCN segment there will break that assumption:

(def contrl "
UNA:+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1001+CONTRL:D:3:UN'
UCI+131+333666999+NORDEASPECIAL?+TEST:ZZ+7'
UCN'
UCM+1+PAYMUL:D:96A:UN+4+13+UNH+2:4'
UCS+17+16'
UNT+5+1'
UNZ+1+1
")

This will cause an exception to be trown:

Unhandled clojure.lang.ExceptionInfo
Unable to find mandatory segment: :sg0
{:type :dk.emcken.edifact.schema/structure,
 :segment 7,
 :error "Unable to find mandatory segment: :sg0"}

The count (7) is the segment number from the bottom and up. Which means that at UNH the paser wasn't able to find a valid segment group. The rules for :sg0 states that it must end with a UNT which it never found.

✓ UNA:+.? '
✓ UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
1 UNH+1001+CONTRL:D:3:UN'
2 UCI+131+333666999+NORDEASPECIAL?+TEST:ZZ+7'
3 UCN'
4 UCM+1+PAYMUL:D:96A:UN+4+13+UNH+2:4'
5 UCS+17+16'
6 UNT+5+1'
7 UNZ+1+1

If you deal with very simple messages i.e. interchanges without multiple messages or messages without optional segment groups the schema above can be simplified (which can cause the error segment to come out different).

UNA - Service String advice

The library respects a non-standard UNA segment (different delimiter and escape charaters than: UNA:+.? ').

License

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

jacobemcken / edifact