OCR support for single articles

Question

OCR support for single articles

Dielee opened this issue 4 years ago · comments

First of all, the script works very well, thanks for that.
Is it possible to read single articles per ocr from the receipt ?

This would be very nice!

Matthias Endler · Answer 1 · Sun Nov 15 2020 20:59:43 GMT+0800 (China Standard Time)

Hey there,
parsing single articles might be tricky because from my tests OCR is quite error-prone. That's because the receipt scans might not be of the highest quality (old receipts, low resolution...) and the article names are usually not well-known words that can get recognized easily by OCR.
That said, it would be great to run some more tests and see if those limitations are still the case. I personally don't have time to do that, but if you or anybody wants to give it a shot then I'd be happy to accept PRs or discuss our options.

Linus Dietz · Answer 2 · Mon Nov 16 2020 00:28:49 GMT+0800 (China Standard Time)

Thanks for your fast answer.

I tried a few things with tesseract and found someting intresting.

Using the --psm 4 flag, gives good results.
My example receipt:

and OCR output:

47806 Neukirchen-Vluyn, NR

k-1JF?
7083481 Bio-Steinofenpi 224 2,49 N
703481 Bio-Steinoferipizza 2,49 A

31483 Lauchzwiebeln 0,39 A

3384 Lachsfi lets 4,23 RK
10386 Bio-Brötchen 1,59 A
465990 Bio-Teigwaren 1,49 A

50206 05 Mini -Roma-Rispe 1,39 A
43320 Funny Frisch Chips 1,39 A

46675 Toastbrötchen 0,59 A
47771 Bio-Kefir Drink 0,99 A
365479 Bio-Aufschnitt-Sor 1,39 A
702380 Bio-Brote 1,/9 A
703360 Edel stählpflege 2,00 B
703184 Kinder Bueno ber 1,99 A
701553 Speisestärke 0,09 A
2538 Bio-Mozzarella 0,89 A
+KUNDENBELEG+
Terminalnummer | 65443234
Datum 23.09.2020
Uhrzeit 13:30:29
Beleg-Nr. 2304

In my opinion, this could work well.
But there should be a possibility for a manual check of the data inside the app.

Matthias Endler · Answer 3 · Mon Nov 16 2020 06:11:18 GMT+0800 (China Standard Time)

That looks quite promising!
Can you try -l deu as well? Wonder if we can also detect words like Transakticens-Nr or Kartenfo] denummer with that or some similar tweaking.

Answer 4 · Mon Nov 16 2020 17:07:35 GMT+0800 (China Standard Time)

This is fantastic!. I did do similar work using python pytesseract on very noise images. Good on you mate.

Seyhan

Linus Dietz · Answer 5 · Mon Nov 16 2020 17:36:48 GMT+0800 (China Standard Time)

@mre Yes this ist with -l deu and --psm 4

So, I played around a bit and got a useful result:

Added a new function in parse.py:

    def parse_lineitems(self):
        item = namedtuple("item", ("article", "sum"))
        items = []

        for line in self.lines:

            reSearch = re.search(r"(...+)\s(-|)(\d,\d\d)\s", line)
            
            if hasattr(reSearch, 'group'):
                items.append(item(reSearch.group(1), reSearch.group(3).replace(",",".")))  

    
        for i in items:
            print ("Artikel: " + i.article + " Summe: " + i.sum)

This works pretty well:

Artikel: 2155 thunfisch in wasse Summe: 1.19
Artikel: 2155 thunfisch in wasse Summe: 1.19
Artikel: 46002 bio rinderhackfl. Summe: 3.59
Artikel: 46002 bio rinderhackfl. Summe: 3.59
Artikel: 49036 bio-reibekäse Summe: 1.59
Artikel: 46932 bio-käsegenuss Summe: 1.49
Artikel: 44011 hähnshenbkrustfilet Summe: 2.99
Artikel: 49036 bio-reibekäse Summe: 1.59
Artikel: 49036 bio-reibekäse Summe: 1.59
Artikel: 4659 bio-semüse-sort. Summe: 1.99
Artikel: 2538 bio-mozzarella Summe: 0.89
Artikel: 2538 bio-mozzarella Summe: 0.89
Artikel: 49934 bio-nudel-speziali Summe: 1.79
Artikel: 7246 premium cornichons Summe: 0.99
Artikel: 9813 sonnenmais Summe: 0.49
Artikel: 702255 gourmet chutney Summe: 0.99
Artikel: 44514 bio brühen im glas Summe: 0.99
Artikel: 49534 bio-nudel -speziali Summe: 1.79
Artikel: 1187 basmati reis Summe: 1.99
Artikel: 704796 bio backzutaten Summe: 0.89
Artikel: 12602 bio-kakao Summe: 1.49
Artikel: 31297 schl angengurke Summe: 0.99
Artikel: 9559 direktsaft premium Summe: 1.39
Artikel: 705912 bio- brotaufstrich Summe: 1.79
Artikel: 60373 avocado Summe: 1.19
Artikel: 707634 bio hülsenfrüchte Summe: 0.79
Artikel: 6862 schokostreusel Summe: 1.29
Artikel: 31801 mini -romana-salath Summe: 0.89
Artikel: 53783 toppits gefrierb. Summe: 1.99

Unfortunately I'm not quite sure how the communication between server and parser works to build the function into the server. Can anyone help me ?

Linus Dietz · Answer 6 · Mon Nov 16 2020 20:37:58 GMT+0800 (China Standard Time)

I have found out that the server does not talk to this parser yet, but uses a pypi package. Should this be changed ?

Matthias Endler · Answer 7 · Mon Nov 16 2020 21:18:32 GMT+0800 (China Standard Time)

Great progress!
Can you try and test your code on the test receipts as well?
E.g. https://github.com/ReceiptManager/Parser/blob/master/data/img/IMG0003.jpg, which does not have numbers in front of the article names. Might need some tweaking of the regex maybe.

Matthias Endler · Answer 8 · Mon Nov 16 2020 21:21:07 GMT+0800 (China Standard Time)

I have found out that the server does not talk to this parser yet, but uses a pypi package. Should this be changed ?

Yes, @monolidth and me talked about it and it makes sense to use the parser code in the server. Help welcome.
I'm thinking we could use the same pypi package name that the server is using at the moment and push the package from this Parser repo and then fix the client to make use of the new package. Help welcome.

Matthias Endler · Answer 9 · Mon Nov 16 2020 21:28:05 GMT+0800 (China Standard Time)

Looks like @monolidth is already on it, improving the code over at dev. 😀

Linus Dietz · Answer 10 · Mon Nov 16 2020 22:09:25 GMT+0800 (China Standard Time)

Output from your IMG0003 sample looks like this:

Artikel: saftorangen Summe: 1.49
Artikel: banane golden v. Summe: 1.19
Artikel: 0,410 kgx 1,99 eur/kg Summe: 0.82
Artikel: pflaume 7509 Summe: 0.99
Artikel: orto mio oliven Summe: 0.69
Artikel: hautklar 3ini Summe: 5.98
Artikel: 2stkx Summe: 2.99
Artikel: today nachfüllbe Summe: 0.65
Artikel: spül-/hh- tücher Summe: 0.75
Artikel: leergut einweg Summe: 2.00
Artikel: brstkkex: Summe: 0.20
Artikel: rückgeld bar eur Summe: 9.44
Artikel: a= 19,0% 4,52 0,86 Summe: 5.38
Artikel: b= 7,0% 4,84 0,34 Summe: 5.18
Artikel: gesantbetrag 9,36 Summe: 1.20

Looks good, but can be improved

Linus Dietz · Answer 11 · Mon Nov 16 2020 23:31:34 GMT+0800 (China Standard Time)

Small improvements implemented, changed sharpnes to 0.8 and stopping at "rückgeld" or "summe" looks like this now:

Item: sap orange Total: 1.49
Item: golden banana v. Total: 1.19
Item: 0.410 kgx 1.99 eur/kg Total: 0.82
Item: plum 7509 Total: 0.99
Item: orto mio olives Total: 0.69
Item: clear 3ini Sum: 5.98
Item: 2pcsx Total: 2.99
Item: today refill Total: 0.65
Item: washing-up/high wipes Total: 0.75
Item: disposable empties Total: 2.00
Item: brstkkex: Total: 0.20

Unfortunately I am not a good python developer, but I am happy to help if I can.

William · Answer 12 · Wed Nov 18 2020 02:59:44 GMT+0800 (China Standard Time)

@Dielee
I removed sensitive image information. The original image contained exif gps coordinates.
I hope thats fine.

Regards,
William

William · Answer 13 · Wed Nov 18 2020 07:46:39 GMT+0800 (China Standard Time)

See commit: ad78ad0.

Linus Dietz · Answer 14 · Wed Nov 18 2020 14:19:16 GMT+0800 (China Standard Time)

@monolidth yes, thank you very much.
Commit looks good! We should stop parsing if we find something like sum_keys in config.yaml.

Dont know how to do this...

Linus Dietz · Answer 15 · Wed Nov 18 2020 17:43:28 GMT+0800 (China Standard Time)

Build something ugly

    def parse_items(self):
        item = namedtuple("item", ("article", "sum"))
        articleName = ""
        parseStop = False
        items = []

        config = open("config.yml")
        parsed_config = yaml.load(config, Loader=yaml.FullLoader)
        stopWords = parsed_config.get("sum_keys")

        for line in self.lines:

            match = re.search(r"(...+)\s(-|)(\d,\d\d)\s", line)
            if hasattr(match, 'group'):
                articleName = match.group(1)
            else:
                continue

            for word in stopWords:
                parseStop = fnmatch.fnmatch(articleName, f"{word}*")
                if parseStop == True:
                    return items

            items.append(item(match.group(1), match.group(3).replace(",", ".")))
                
        return items

This works pretty well

William · Answer 16 · Wed Nov 18 2020 18:03:15 GMT+0800 (China Standard Time)

Yeah, agree. I was tired yesterday night, but this is a feature which is required.
Thanks for sharing, I will take a look at this as soon as possible.

Regards,
William

Linus Dietz · Answer 17 · Wed Nov 18 2020 18:04:22 GMT+0800 (China Standard Time)

Very nice, thanks a lot!

William · Answer 18 · Wed Nov 18 2020 18:07:42 GMT+0800 (China Standard Time)

Some friendly notes:
Use the sneak case naming convention. Additionally, you can simplify statements like

  if parse_stop == True:
                    return items

to:

  if parse_stop:
                    return items

Regards,
William

Linus Dietz · Answer 19 · Wed Nov 18 2020 18:08:57 GMT+0800 (China Standard Time)

Thanks, such tips are always good. These are my first lines of python code.

William · Answer 20 · Wed Nov 18 2020 18:49:13 GMT+0800 (China Standard Time)

@Dielee
Please take a look at: e7a29fc

Linus Dietz · Answer 21 · Wed Nov 18 2020 19:38:46 GMT+0800 (China Standard Time)

Looks good, thank you 👍