IonicaBizau / scrape-it

🔮 A Node.js scraper for humans.

Home Page:http://ionicabizau.net/blog/30-how-to-write-a-web-scraper-in-node-js

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Encoding issue with Spanish accents

opened this issue · comments

Hi, I'm a fairly new programmer and I don't know exactly if this issue is related with this library but I'm trying to scrape a website where the content have accents. The output in my console seems like this:

     { place: 'C�rtama',
       title: 'IV Torneo de F�tbol 7 Miguel Gonz�lez Santos \'Milli\'',
       unqueriableDate: 'Fecha: Todo el a�o',
       event_img: 'img_contenido/agenda/2019/08/365993/234100240__130x130.jpg',
       location: 'Lugar: Campo Municipal Joaqu�n Mart�n D�az - C�rtama' } ] }

There's no way to solve it? Seems like an encoding issue.

Thanks in advance

I have the same problem

I found a solution for my case.
I realized, that the page I wanted to use didn't use a standard utf8 encoding, but an ISO encoding, like this:
Content-Type: text/html; charset="iso-8859-15"

I decided to encode the html myself, and use the scrapeIt.scrapeHTML function instead of the original scrapeIt.
Here's my code:

const axios = require('axios');
const iso88592 = require('iso-8859-2');
const scrapeIt = require('scrape-it');

run();

async function run() {
  // Send the request and get the binary response
  const axiosResponse = await axios.request({
    method: 'GET',
    url: `YOUR_UR`,
    responseType: 'arraybuffer',
    responseEncoding: 'binary'
  });
  
  // iso88592 encode the binary string with a specific library
  const htmlString = iso88592.decode(axiosResponse.data.toString('binary'));
  const scrapedJson = await scrapeIt.scrapeHTML(htmlString, mappingConfig);
  
  // And here we are:
  console.log(scrapedJson);
}

Thanks a lot for sharing, @marcellkiss