medialab / sandcrawler

sandcrawler.js - the server-side scraping companion.

Home Page:http://medialab.github.io/sandcrawler/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handle nasty charset polymorphism

Yomguithereal opened this issue · comments

@boogheta: do you know this way of indicating the charset?

en_US.iso885915

mmm it's a mix of locale and encoding I guess?

I guess so, but I didn't even know this was standard.

This format is described in POSIX Base Definitions, 8.2 Internationalization Variables:

If the locale value has the form:

language[_territory][.codeset]

it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.

LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME are defined to accept an additional field @ modifier, which allows the user to select a specific instance of localization data within a single category (for example, for selecting the dictionary as opposed to the character ordering of data). The syntax for these environment variables is thus defined as:

[language[_territory][.codeset][@modifier]]

Thanks @eric-brechemier. I guess we'll have to fix the parser's implementation to make this work.