ReceiptManager / receipt-parser-legacy

A supermarket receipt parser written in Python using tesseract OCR

Home Page:https://tech.trivago.com/2015/10/06/python_receipt_parser/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OCR support for single articles

Dielee opened this issue · comments

First of all, the script works very well, thanks for that.
Is it possible to read single articles per ocr from the receipt ?

This would be very nice!

Hey there,
parsing single articles might be tricky because from my tests OCR is quite error-prone. That's because the receipt scans might not be of the highest quality (old receipts, low resolution...) and the article names are usually not well-known words that can get recognized easily by OCR.
That said, it would be great to run some more tests and see if those limitations are still the case. I personally don't have time to do that, but if you or anybody wants to give it a shot then I'd be happy to accept PRs or discuss our options.

Thanks for your fast answer.

I tried a few things with tesseract and found someting intresting.

Using the --psm 4 flag, gives good results.
My example receipt:

and OCR output:

47806 Neukirchen-Vluyn, NR

k-1JF?
7083481 Bio-Steinofenpi 224 2,49 N
703481 Bio-Steinoferipizza 2,49 A

31483 Lauchzwiebeln 0,39 A

3384 Lachsfi lets 4,23 RK
10386 Bio-Brötchen 1,59 A
465990 Bio-Teigwaren 1,49 A

50206 05 Mini -Roma-Rispe 1,39 A
43320 Funny Frisch Chips 1,39 A

46675 Toastbrötchen 0,59 A
47771 Bio-Kefir Drink 0,99 A
365479 Bio-Aufschnitt-Sor 1,39 A
702380 Bio-Brote 1,/9 A
703360 Edel stählpflege 2,00 B
703184 Kinder Bueno ber 1,99 A
701553 Speisestärke 0,09 A
2538 Bio-Mozzarella 0,89 A
+KUNDENBELEG+
Terminalnummer | 65443234
Datum 23.09.2020
Uhrzeit 13:30:29
Beleg-Nr. 2304

In my opinion, this could work well.
But there should be a possibility for a manual check of the data inside the app.

That looks quite promising!
Can you try -l deu as well? Wonder if we can also detect words like Transakticens-Nr or Kartenfo] denummer with that or some similar tweaking.

This is fantastic!. I did do similar work using python pytesseract on very noise images. Good on you mate.

Seyhan

@mre Yes this ist with -l deu and --psm 4

So, I played around a bit and got a useful result:

Added a new function in parse.py:

    def parse_lineitems(self):
        item = namedtuple("item", ("article", "sum"))
        items = []

        for line in self.lines:

            reSearch = re.search(r"(...+)\s(-|)(\d,\d\d)\s", line)
            
            if hasattr(reSearch, 'group'):
                items.append(item(reSearch.group(1), reSearch.group(3).replace(",",".")))  

    
        for i in items:
            print ("Artikel: " + i.article + " Summe: " + i.sum)

This works pretty well:

Artikel: 2155 thunfisch in wasse Summe: 1.19
Artikel: 2155 thunfisch in wasse Summe: 1.19
Artikel: 46002 bio rinderhackfl. Summe: 3.59
Artikel: 46002 bio rinderhackfl. Summe: 3.59
Artikel: 49036 bio-reibekäse Summe: 1.59
Artikel: 46932 bio-käsegenuss Summe: 1.49
Artikel: 44011 hähnshenbkrustfilet Summe: 2.99
Artikel: 49036 bio-reibekäse Summe: 1.59
Artikel: 49036 bio-reibekäse Summe: 1.59
Artikel: 4659 bio-semüse-sort. Summe: 1.99
Artikel: 2538 bio-mozzarella Summe: 0.89
Artikel: 2538 bio-mozzarella Summe: 0.89
Artikel: 49934 bio-nudel-speziali Summe: 1.79
Artikel: 7246 premium cornichons Summe: 0.99
Artikel: 9813 sonnenmais Summe: 0.49
Artikel: 702255 gourmet chutney Summe: 0.99
Artikel: 44514 bio brühen im glas Summe: 0.99
Artikel: 49534 bio-nudel -speziali Summe: 1.79
Artikel: 1187 basmati reis Summe: 1.99
Artikel: 704796 bio backzutaten Summe: 0.89
Artikel: 12602 bio-kakao Summe: 1.49
Artikel: 31297 schl angengurke Summe: 0.99
Artikel: 9559 direktsaft premium Summe: 1.39
Artikel: 705912 bio- brotaufstrich Summe: 1.79
Artikel: 60373 avocado Summe: 1.19
Artikel: 707634 bio hülsenfrüchte Summe: 0.79
Artikel: 6862 schokostreusel Summe: 1.29
Artikel: 31801 mini -romana-salath Summe: 0.89
Artikel: 53783 toppits gefrierb. Summe: 1.99

Unfortunately I'm not quite sure how the communication between server and parser works to build the function into the server. Can anyone help me ?

I have found out that the server does not talk to this parser yet, but uses a pypi package. Should this be changed ?

Great progress!
Can you try and test your code on the test receipts as well?
E.g. https://github.com/ReceiptManager/Parser/blob/master/data/img/IMG0003.jpg, which does not have numbers in front of the article names. Might need some tweaking of the regex maybe.

I have found out that the server does not talk to this parser yet, but uses a pypi package. Should this be changed ?

Yes, @monolidth and me talked about it and it makes sense to use the parser code in the server. Help welcome.
I'm thinking we could use the same pypi package name that the server is using at the moment and push the package from this Parser repo and then fix the client to make use of the new package. Help welcome.

Looks like @monolidth is already on it, improving the code over at dev. 😀

Output from your IMG0003 sample looks like this:

Artikel: saftorangen Summe: 1.49
Artikel: banane golden v. Summe: 1.19
Artikel: 0,410 kgx 1,99 eur/kg Summe: 0.82
Artikel: pflaume 7509 Summe: 0.99
Artikel: orto mio oliven Summe: 0.69
Artikel: hautklar 3ini Summe: 5.98
Artikel: 2stkx Summe: 2.99
Artikel: today nachfüllbe Summe: 0.65
Artikel: spül-/hh- tücher Summe: 0.75
Artikel: leergut einweg Summe: 2.00
Artikel: brstkkex: Summe: 0.20
Artikel: rückgeld bar eur Summe: 9.44
Artikel: a= 19,0% 4,52 0,86 Summe: 5.38
Artikel: b= 7,0% 4,84 0,34 Summe: 5.18
Artikel: gesantbetrag 9,36 Summe: 1.20

Looks good, but can be improved

Small improvements implemented, changed sharpnes to 0.8 and stopping at "rückgeld" or "summe" looks like this now:

Item: sap orange Total: 1.49
Item: golden banana v. Total: 1.19
Item: 0.410 kgx 1.99 eur/kg Total: 0.82
Item: plum 7509 Total: 0.99
Item: orto mio olives Total: 0.69
Item: clear 3ini Sum: 5.98
Item: 2pcsx Total: 2.99
Item: today refill Total: 0.65
Item: washing-up/high wipes Total: 0.75
Item: disposable empties Total: 2.00
Item: brstkkex: Total: 0.20

Unfortunately I am not a good python developer, but I am happy to help if I can.

@Dielee
I removed sensitive image information. The original image contained exif gps coordinates.
I hope thats fine.

Regards,
William

See commit: ad78ad0.

@monolidth yes, thank you very much.
Commit looks good! We should stop parsing if we find something like sum_keys in config.yaml.

Dont know how to do this...

Build something ugly

    def parse_items(self):
        item = namedtuple("item", ("article", "sum"))
        articleName = ""
        parseStop = False
        items = []

        config = open("config.yml")
        parsed_config = yaml.load(config, Loader=yaml.FullLoader)
        stopWords = parsed_config.get("sum_keys")

        for line in self.lines:

            match = re.search(r"(...+)\s(-|)(\d,\d\d)\s", line)
            if hasattr(match, 'group'):
                articleName = match.group(1)
            else:
                continue

            for word in stopWords:
                parseStop = fnmatch.fnmatch(articleName, f"{word}*")
                if parseStop == True:
                    return items

            items.append(item(match.group(1), match.group(3).replace(",", ".")))
                
        return items

This works pretty well

Yeah, agree. I was tired yesterday night, but this is a feature which is required.
Thanks for sharing, I will take a look at this as soon as possible.

Regards,
William

Very nice, thanks a lot!

Some friendly notes:
Use the sneak case naming convention. Additionally, you can simplify statements like

  if parse_stop == True:
                    return items

to:

  if parse_stop:
                    return items

Regards,
William

Thanks, such tips are always good. These are my first lines of python code.

@Dielee
Please take a look at: e7a29fc

Looks good, thank you 👍