UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)

Home Page:https://github.com/UglyToad/PdfPig/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

When a get textblock from a PDF vary depending on the operating system

fpisarello-dawa opened this issue · comments

Hi, i have some problem when i get TextArea from this code:

When i run this script in Windows 10 Platform - LinqPad:

using (var document = PdfDocument.Open(FACTURA_AFIP.pdf))
{
for (var i = 1; i <= document.NumberOfPages; i += 4)
{
	// For PDF coordinates the y-axis runs from the bottom of the page up
	var bottomLeft = new PdfPoint(479, 149);
	var topRight = new PdfPoint(559, 159);
	var square = new PdfRectangle(bottomLeft, topRight);
	var page = document.GetPage(i);
	var letters = page.Letters.Where(x => square.IntersectsWith(x.GlyphRectangle)).ToList();

	var wordsInRegion = DefaultWordExtractor.Instance.GetWords(letters);
	var textInRegion = string.Join(" ", wordsInRegion.Select(x => x.Text).ToList());
			
	textInRegion.Dump();
	
}
}

Result:
72515176735833

but the same Script in Linux Ubuntu 20.04 - dotnet-script:
N°: 72515176735833

Why do Windows and Linux show different results?

Upload the PDF file form more detail.
FACTURA_AFIP.pdf

@fpisarello-dawa I'm guessing this comes from different default fonts being used on different operating systems. I'd expect the fonts in your documents are not embedded, and PdfPig uses the OS ones to get the bounding boxes. These will differ by OS.

@EliotJones this is not the first time we have this kind of question. I think we should try to ship default fonts like other pdf readers do, so that pdfpig always use the sames ones.

Doing so will also make easier to write units tests across different OS, as people will expect consistency across. Let me know what you think

Also see https://askubuntu.com/questions/599915/what-is-the-closest-font-to-helvetica-available-on-ubuntu

And https://stackoverflow.com/questions/6383511/font-metrics-for-the-base-14-fonts-in-the-pdf-specification#6506818

@BobLd it's a reasonable suggestion, I'm just not sure what the licensing situation for that looks like. I'd expect you need some kind of payment to redistribute most fonts from foundries.

@EliotJones you nailed the main issue with fonts... I'll revert back with fonts that have a compatible license with the project. Let's see then what's doable

Looking at the table below, we have open source equivalents (table from https://wiki.archlinux.org/title/Metric-compatible_fonts)
image

Liberation fonts are available under SIL OPEN FONT LICENSE Version 1.1, which is from what I understand as open source as you can get for a font, see here https://github.com/liberationfonts/liberation-fonts/tree/main/src

Using Liberation fonts, we cover 12 out of the 14 Base fonts (we are missing Symbol and ZapfDingbats) - I'll look into the rest (also, they are already referenced in the SystemFontFinder class)

Symbol font: https://github.com/powerline/fonts/tree/master/SymbolNeu (Apache License, Version 2.0)

@BobLd thank for response. I installed font into Linux server (Helvetica) and i had the same behavior. I need to install another font into a server to make the same response?