AngleSharp / AngleSharp.Css

:angel: Library to enable support for cascading stylesheets in AngleSharp.

Home Page:https://anglesharp.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GetInnerText Paragraphs do add too much linebreaks before and after the tag

Seyden opened this issue · comments

Bug Report

Prerequisites

  • Can you reproduce the problem in a MWE?
  • Are you running the latest version of AngleSharp?
  • Did you check the FAQs to see if that helps you?
  • Are you reporting to the correct repository? (there are multiple AngleSharp libraries, e.g., AngleSharp.Css for CSS support)
  • Did you perform a search in the issues?

For more information, see the CONTRIBUTING guide.

Description

GetInnerText of the latest AngleSharp.CSS version renders too many line breaks at the start and end of a paragraph

Steps to Reproduce

using AngleSharp;
using AngleSharp.Css;
using AngleSharp.Dom;
using System.Diagnostics;

var content = "<div class=\"entry-content entry-content-single\" itemprop=\"description\"><p><em><strong>[By the studio that brought you &lt;Solo Leveling&gt;, &lt;Reaper of the Drifting Moon&gt;, and many more!]</strong></em></p>\n<p>He was the hound of the Baskerville family: Vikir.</p>\n<p>Yet his loyalty was rewarded by the blade of a guillotine dirtied by slander.</p>\n<p>“I will never live the life of a hound slaughtered after the rabbit is caught.”</p>\n<p>In place of death, an unexpected opportunity awaits him.</p>\n<p>Vikir’s eyes glowed red as he sharpened his canines in the dark.</p>\n<p>“Just you wait, Hugo. I will rip out your throat this time.”</p>\n<p>It’s time for the hound to exact bloody revenge on his owner.</p>\n</div>";

var context = BrowsingContext.New(Configuration.Default
    .WithCss());

var doc = await context.OpenAsync(req => req.Content(content));
var description = doc.QuerySelector(("div[itemprop=\"description\"]"))?.GetInnerText().Trim();
Console.WriteLine(description);
Console.ReadKey();

Expected behavior: Outputs

"[By the studio that brought you <Solo Leveling>, <Reaper of the Drifting Moon>, and many more!]\n\nHe was the hound of the Baskerville family: Vikir.\n\nYet his loyalty was rewarded by the blade of a guillotine dirtied by slander.\n\n“I will never live the life of a hound slaughtered after the rabbit is caught.”\n\nIn place of death, an unexpected opportunity awaits him.\n\nVikir’s eyes glowed red as he sharpened his canines in the dark.\n\n“Just you wait, Hugo. I will rip out your throat this time.”\n\nIt’s time for the hound to exact bloody revenge on his owner."

Pasting this code into the browser consoles also outputs the same
document.querySelector('div[itemprop=\"description\"]').innerText

Actual behavior: Outputs

"[By the studio that brought you <Solo Leveling>, <Reaper of the Drifting Moon>, and many more!]\n\n \n\nHe was the hound of the Baskerville family: Vikir.\n\n \n\nYet his loyalty was rewarded by the blade of a guillotine dirtied by slander.\n\n \n\n“I will never live the life of a hound slaughtered after the rabbit is caught.”\n\n \n\nIn place of death, an unexpected opportunity awaits him.\n\n \n\nVikir’s eyes glowed red as he sharpened his canines in the dark.\n\n \n\n“Just you wait, Hugo. I will rip out your throat this time.”\n\n \n\nIt’s time for the hound to exact bloody revenge on his owner."

Environment details: .NET 7

Possible Solution

Adjust the RequiredLineBreakCounts in ElementExtension.cs for the Start from 2 to 1 and for the end from 2 to 1

Adding a IsEmpty check for Text Nodes did fix the issue, it was actually empty "\n" line break Text Nodes between the elements (nodes which aren't visible in the frontend anyway, so they shouldn't be rendered).

grafik

The current behaviour is actually adding 2 line breaks before the element and 2 line breaks after the element, but Paragraphs should actually render 2 line breaks in total, 1 at the start and 1 at the end.

Adding a IsEmpty check for Text Nodes did fix the issue, it was actually empty "\n" line break Text Nodes between the elements (nodes which aren't visible in the frontend anyway, so they shouldn't be rendered).

That is, unfortunately, a wrong assumption. This is fully depending on the styling - in most settings it will render a space, but if its literal it might be a line break. In block display it will not render anything.

I don't see a clear reproducible (external sites are not regarded as a reproducible as their content might change, they have too much noise, and its unclear if the source is IP / user dependent).

Please provide a short and concise MWE to illustrate and show the problem. Look at other bug reports on how to do it. Thanks!

Just as an example consider:

<!DOCTYPE html>
<html>
<style>
.a * {
  display: block;
}
.b * {
  display: inline-block;
}
.c * {
  white-space: pre;
}
</style>
<body>
<div class="a">
<p>a</p><span>
</span><p>b</p>
</div>
<div class="b">
<p>a</p><span>
</span>
<p>b</p>
</div>
<div class="c">
<p>a</p><span>
</span>
<p>b</p>
</div>
<script>
console.log(document.querySelector('.a').innerText);
console.log(document.querySelector('.b').innerText);
console.log(document.querySelector('.c').innerText);
</script>
</body>

</html>

The outcome is quite different:

image

And, for good reasons, innerText also follows here:

image

So long story short: The text node has to be respected, but its interpretation is depending on the sheet. We need a clear reproducible to see if its a bug and if so where the bug originates.

I updated the MWE, i think you also might have misunderstood what the possible solution code is supposed to do.

as an example, i added a test case

Working case:

[TestCase("<p>test1</p><p>test2</p>", "test1\n\ntest2")]

Broken case:

[TestCase("<p>test1</p>\n<p>test2</p>", "test1\n\ntest2")]

grafik

I updated the MWE, i think you also might have misunderstood what the possible solution code is supposed to do.

No - my answer is independent of the possible solution section.

My answer had two parts; one to show how a reproducible should be done (thanks for the update, that qualifies) and one part to illustrate that your assumption that text nodes should be removed does not apply here / in general. It might be effectively the case in the given scenario, but in general text nodes contribute to the rendering and (almost) every line break in an HTML source code ends up as either part of or an individual text node. That's just how HTML parsing works.

So the "Possible Solution" that you provided does not work - see my example. You cannot just remove text nodes - the way that innerText works (and what makes it different from innerHTML) is that is respects the styling. Therefore, if you omit the information from a text node fully depends on the styling context.

Do you might have an CSS where this case is considered ?

And do you perhaps already have an idea on how to approach this bug?

So this fix basically involves the requiredLineBreakCounts because based on what i read in the Specs, every white space has to be considered and because we handled the required line breaks at the end, all the whitespace css rules couldn't be applied to these breaks

${{\color{Green}\Huge{\textsf{Passing\ tests:\ }}}}$

[TestCase("<div><p>test1 test1 test1 ! test1</p>\\n<p>test2 test2  test2 test2</p></div>", "test1 test1 test1 ! test1\n\n\\n\n\ntest2 test2 test2 test2")]
[TestCase("<div style=\"white-space: normal;\"><p>Test</p>&#010;<p>Test</p></div>", "Test\n\nTest")]
[TestCase("<div style=\"white-space:pre;\"><p>Test</p>&#010;                                          <p>Test</p></div>", "Test\n\n\n                                          \n\nTest")]
[TestCase("<div style=\"white-space: pre-wrap;\"><p>Test</p>&#010;<p>Test</p></div>", "Test\n\n\n\n\nTest")]

${{\color{Red}\Huge{\textsf{Broken\ tests:\ }}}}$

[TestCase("<div style=\"white-space: pre-line;\"><p>Test</p>\n<p>Test</p></div>", "Test\n\nTest")]
[TestCase("<div style=\"white-space: break-spaces;\"><p>Test</p>\n<p>Test</p></div>", "Test\n\n\n\n\nTest")]

${{\color{Aqua}\Huge{\textsf{Fix\ patch:\ }}}}$

diff --git a/src/AngleSharp.Css/Extensions/ElementExtensions.cs b/src/AngleSharp.Css/Extensions/ElementExtensions.cs
index 7e687f2..5eb65c3 100644
--- a/src/AngleSharp.Css/Extensions/ElementExtensions.cs
+++ b/src/AngleSharp.Css/Extensions/ElementExtensions.cs
@@ -196,7 +196,7 @@ private static void ItcInCssBox(ICssStyleDeclaration elementStyle, ICssStyleDecl
                     var lastLine = node.NextSibling is null ||
                         String.IsNullOrEmpty(node.NextSibling.TextContent) ||
                         node.NextSibling is IHtmlBreakRowElement;
-                    ProcessText(textElement.Data, sb, parentStyle, lastLine);
+                    ProcessText(textElement.Data, sb, parentStyle, lastLine, requiredLineBreakCounts);
                 }
                 else if (node is IHtmlBreakRowElement)
                 {
@@ -390,12 +390,15 @@ private static Boolean IsBlockLevel(INode node)
             }
         }
 
-        private static void ProcessText(String text, StringBuilder sb, ICssStyleDeclaration style, Boolean lastLine)
+        private static void ProcessText(String text, StringBuilder sb, ICssStyleDeclaration style, Boolean lastLine, Dictionary<Int32, Int32> requiredLineBreakCounts)
         {
             var startIndex = sb.Length;
             var whiteSpace = style?.GetWhiteSpace();
             var textTransform = style?.GetTextTransform();
-            var isWhiteSpace = startIndex > 0 ? Char.IsWhiteSpace(sb[startIndex - 1]) && sb[startIndex - 1] != Symbols.NoBreakSpace : true;
+            bool IsWhiteSpace(Char c) { return Char.IsWhiteSpace(c) && c != Symbols.NoBreakSpace; }
+            var isWhiteSpace = startIndex <= 0
+                || (IsWhiteSpace(sb[startIndex - 1])
+                || (requiredLineBreakCounts.ContainsKey(startIndex) && IsWhiteSpace(text[0])));
 
             for (var i = 0; i < text.Length; i++)
             {

Fix for it with additional tests landed in devel.