voku / simple_html_dom

📜 Modern Simple HTML DOM Parser for PHP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parsing introduces new line breaks when outputting the html with multiple childs

mangei opened this issue · comments

What is this feature about (expected vs actual behaviour)?

When parsing a file, I would like to have the original html of an element, so that I can search-replace a specific part of a document, without changing/updating the rest.

The issue is that $el->html does not return the right string, if any of its childs has more than one child. It introduces additional linebreaks.

How can I reproduce it?

Script: (it shows my full use-case; the notable part is highlighted)

<?php

use voku\helper\HtmlDomParser;

require_once '../composer/autoload.php';

$fileContent = file_get_contents('./test.html');
$dom = HtmlDomParser::str_get_html($fileContent);

foreach($dom->find('.mydiv') as $myDivEl) {
    $currentHtml = $myDivEl->html;
    echo $currentHtml;                                           // <---- here you can see the wrong output (you can skip the rest)

    $newContent = "";
    foreach($myDivEl->find('.mydiv-item') as $childEl) {
        $childEl->class = 'replaced';

        $newContent .= $childEl;
    }

    $myDivEl->outerhtml = '<div class="myreplacement">' . $newContent . '</div>';
    
    $fileContent = str_replace($currentHtml, $myDivEl->html, $fileContent);
}

file_put_contents('./test-out.html', $fileContent);

Input HTML file:

<html>
<body>
        <div class="mydiv">
        </div>
        <div class="mydiv">
            <div class="mydiv-item"><span>A1</span></div>
        </div>
        <div class="mydiv">
            <div class="mydiv-item"><span>B1</span><span>B2</span></div>
        </div>
</body>
</html>

Actual output: (B is not replaced)

<html>
<body>
        <div class="myreplacement"></div>
        <div class="myreplacement"><div class="replaced"><span>A1</span></div></div>
        <div class="mydiv">
            <div class="mydiv-item"><span>B1</span><span>B2</span></div>
        </div>
</body>
</html>

Expected output:

<html>
<body>
        <div class="myreplacement"></div>
        <div class="myreplacement"><div class="replaced"><span>A1</span></div></div>
        <div class="myreplacement"><div class="replaced"><span>B2</span><span>B2</span></div></div>
</body>
</html>

The issue is, that the html of the selected elements is not the same, if an element has more than one child. Therefore the search-replace does not work correctly:

<div class="mydiv">
        </div>

A:
<div class="mydiv">
            <div class="mydiv-item"><span>A1</span></div>
        </div>

B:
<div class="mydiv">
            <div class="mydiv-item">
<span>B1</span><span>B2</span>
</div>
        </div>

B should be:

<div class="mydiv">
            <div class="mydiv-item"><span>B1</span><span>B2</span></div>
        </div>

Does it take minutes, hours or days to fix?

Minutes?

Any additional information?

.

Thanks for your help!

It would also help me, if I can get the original parsed text, so that I can (search &) replace it. Maybe indices (from-to) of the original parsed string.

It's much more simple to use the HtmlDom object instead of some string replacements, here is an example: 7571bee

This is a lack of this library. If I have multiple parent & multiple child selectors that's a big problem.

Example:

<html>
<body>
        <div class="mydiv">
        </div>
        <div class="mydiv_a">
            <div class="mydiv-item"><span>A1</span></div>
        </div>
        <div class="mydiv_b">
            <div class="mydiv-item"><span>B1</span><span>B2</span></div>
        </div>

<div class="mydiv_c">
            <div class="mydiv-item-next"><span>B1</span><span>B2</span></div>
        </div>
</body>
</html>