Textract Analyze document for Tables issue
krishna-vicky96 opened this issue · comments
Hi,
I found an issue while extracting tables from a document using Analyze. My Textract OCR identified proper table with correct Bbox. Now, If I am using the same info and trying to extract text, I am missing some information. Here are the samples for that.
Below Image is the cropped image which I got using Bbox info from textract ocr output.
Analyze Document output (after some postprocessing like including markdowns) : "\n\n | n pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| of climate change issues into of executives | Attainment of "promoting sustainability," including climate change-relat initiatives. reflected in performance-linked remuneration |"
If you observe the image and output clearly, I am getting missed "Internal Carbo" in first row and "Incorporation" and "Remuneration" in second row of 1st cell.
For this, I tried to apply canvas with the page size from which I fetched the table and created below image. Still it is giving me same output.
Now, I went ahead and tried to add thresholding and got this image as an output.
Interestingly, this provided proper output - "\n\n | Internal carbon pricing | ¥10,000/ton of CO2, utilized in our investment decision-making, awards program, etc. |\n|---|---|\n| Incorporation of climate change issues into remuneration of executives | Attainment of "promoting sustainability," including climate change-related initiatives, reflected in performance-linkeo remuneration |"
Here is the way, I created sample threshold -
_, binary_image = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
But there will be problem with colored images, the solution which I proposed won't work as it makes things worst.
This is the issue I found and the some hack, If there is anything interesting rather than this please feel free to post the solution.
I ask AWS team to have a look over and fix this issue.