Encoding problem while writing to database
SpaceShaman opened this issue · comments
Hi, I'm trying to do a web scraper with this library in laravel and everything works great until I want to save the result to the database.
This is some encoding problem. For example, when HtmlDomParser downloads "Léon" to $film['title'], "L& eacute;on" is saved in the database, but echo displays "Léon".
Do you have any idea how to fix this problem?
code snippet:
$dom = HtmlDomParser::file_get_html('url');
$film['title'] = $dom->find('selector', 0)->innertext;
...
$film_db = new Movie_info;
foreach ($film as $k => $v) {
$film_db->$k = $v;
echo $k .": ". $v ."<br>";
}
$film_db->save();
my database settings:
'mysql' => [
'driver' => 'mysql',
'url' => env('DATABASE_URL'),
'host' => env('DB_HOST', '127.0.0.1'),
'port' => env('DB_PORT', '3306'),
'database' => env('DB_DATABASE', 'forge'),
'username' => env('DB_USERNAME', 'forge'),
'password' => env('DB_PASSWORD', ''),
'unix_socket' => env('DB_SOCKET', ''),
'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',
'prefix' => '',
'prefix_indexes' => true,
'strict' => true,
'engine' => null,
'options' => extension_loaded('pdo_mysql') ? array_filter([
PDO::MYSQL_ATTR_SSL_CA => env('MYSQL_ATTR_SSL_CA'),
]) : [],
],
Issue-Label Bot is automatically applying the label question
to this issue, with a confidence of 0.73. Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
Please check the output via "var_dump()" and check if the string on the website is maybe already html encoded?
There are already tests for unicode support, so that I am pretty sure that this Dom Parser hat nothing to do with your problem. Maybe you can show the original html, so that we can add one more test, thanks. 👍
After checking "var_dump($film)" on the output I got
'title' => string 'Lé ;on' (length=11)
This is the link to the page where I'm trying to do web scraping filmweb.pl
Now, when I look at the source of the page, I can see that it is already encoded in this way xD
<h2 class="filmCoverSection__orginalTitle">Léon</h2>
ok, I managed to fix the problem with the function from your respository voku/portable-utf8
Using the page below, I checked which decoding method would work and I chose UTF8::rawurldecode()
encoder.suckup.de
Right now my code looks like this:
$dom = HtmlDomParser::file_get_html('url');
$film['title'] = $dom->find('selector', 0)->innertext;
$film['title'] = UTF8::rawurldecode($film['title']);
...
$film_db = new Movie_info;
foreach ($film as $k => $v) {
$film_db->$k = $v;
echo $k .": ". $v ."<br>";
}
$film_db->save();
I think you only need UTF8::html_entity_decode()
but you can test it here: https://encoder.suckup.de/index.php
Happy Coding! :)
Thanks for your help and very useful libraries from you.
At this link is a whole function that downloads information about the movie from filmweb.pl
Filmweb.php