Encoding problem while writing to database

Question

Encoding problem while writing to database

SpaceShaman opened this issue 4 years ago · comments

Hi, I'm trying to do a web scraper with this library in laravel and everything works great until I want to save the result to the database.
This is some encoding problem. For example, when HtmlDomParser downloads "Léon" to $film['title'], "L& eacute;on" is saved in the database, but echo displays "Léon".

Do you have any idea how to fix this problem?

code snippet:

    $dom = HtmlDomParser::file_get_html('url');
    $film['title'] = $dom->find('selector', 0)->innertext;
    ...
    $film_db = new Movie_info;
    foreach ($film as $k => $v) {
        $film_db->$k = $v;
        echo $k .": ". $v ."<br>";
    }
    $film_db->save();

my database settings:

'mysql' => [
            'driver' => 'mysql',
            'url' => env('DATABASE_URL'),
            'host' => env('DB_HOST', '127.0.0.1'),
            'port' => env('DB_PORT', '3306'),
            'database' => env('DB_DATABASE', 'forge'),
            'username' => env('DB_USERNAME', 'forge'),
            'password' => env('DB_PASSWORD', ''),
            'unix_socket' => env('DB_SOCKET', ''),
            'charset' => 'utf8mb4',
            'collation' => 'utf8mb4_unicode_ci',
            'prefix' => '',
            'prefix_indexes' => true,
            'strict' => true,
            'engine' => null,
            'options' => extension_loaded('pdo_mysql') ? array_filter([
                PDO::MYSQL_ATTR_SSL_CA => env('MYSQL_ATTR_SSL_CA'),
            ]) : [],
        ],

issue-label-bot · Answer 1 · Wed Nov 11 2020 08:06:21 GMT+0800 (China Standard Time)

Issue-Label Bot is automatically applying the label question to this issue, with a confidence of 0.73. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

Lars Moelleken · Answer 2 · Wed Nov 11 2020 09:09:39 GMT+0800 (China Standard Time)

Please check the output via "var_dump()" and check if the string on the website is maybe already html encoded?

There are already tests for unicode support, so that I am pretty sure that this Dom Parser hat nothing to do with your problem. Maybe you can show the original html, so that we can add one more test, thanks. 👍

SpaceShaman · Answer 3 · Wed Nov 11 2020 19:37:49 GMT+0800 (China Standard Time)

After checking "var_dump($film)" on the output I got

'title' => string 'L&eacute ;on' (length=11)

This is the link to the page where I'm trying to do web scraping filmweb.pl
Now, when I look at the source of the page, I can see that it is already encoded in this way xD

<h2 class="filmCoverSection__orginalTitle">L&eacute;on</h2>

SpaceShaman · Answer 4 · Wed Nov 11 2020 20:03:53 GMT+0800 (China Standard Time)

ok, I managed to fix the problem with the function from your respository voku/portable-utf8
Using the page below, I checked which decoding method would work and I chose UTF8::rawurldecode()
encoder.suckup.de
Right now my code looks like this:

   $dom = HtmlDomParser::file_get_html('url');
    $film['title'] = $dom->find('selector', 0)->innertext;
    $film['title'] = UTF8::rawurldecode($film['title']);
    ...
    $film_db = new Movie_info;
    foreach ($film as $k => $v) {
        $film_db->$k = $v;
        echo $k .": ". $v ."<br>";
    }
    $film_db->save();

Lars Moelleken · Answer 5 · Wed Nov 11 2020 20:12:42 GMT+0800 (China Standard Time)

I think you only need UTF8::html_entity_decode() but you can test it here: https://encoder.suckup.de/index.php

Happy Coding! :)

SpaceShaman · Answer 6 · Wed Nov 11 2020 20:17:32 GMT+0800 (China Standard Time)

Thanks for your help and very useful libraries from you.
At this link is a whole function that downloads information about the movie from filmweb.pl
Filmweb.php