hse-aml / natural-language-processing

Resources for "Natural Language Processing" Coursera course.

Home Page:https://www.coursera.org/learn/language-processing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

main_bot struggles if you have non-ascii characters in your name

loeiten opened this issue · comments

If you name contains any funny characters in Telegram, the bot will crash

Ready to talk!
An update received.
Traceback (most recent call last):
  File "main_bot.py", line 111, in <module>
    main()
  File "main_bot.py", line 103, in main
    print("Update content: {}".format(update))
UnicodeEncodeError: 'ascii' codec can't encode character '\xf8' in position 153: ordinal not in range(128)

Although adding some more computational complexity, adding the following function

def cast_to_utf_8(old_dict):
    """
    Encodes the string content of a dict to utf-8

    Parameters
    ----------
    old_dict : dict
        The dict to encode

    Returns
    -------
    new_dict : dict
        The encoded dict
    """

    def walk(node):
        """
        Recursively traverses a node ande encodes all strings to utf-8

        Parameters
        ----------
        node : dict
            The node to traverse

        Returns
        -------
        node : dict
            The node where the strings are encoded to utf-8
        """
        for key, item in node.items():
            if type(item)==dict:
                walk(item)
            elif type(item)==list:
                for i, elem in enumerate(item):
                    if type(elem) == str:
                        node[key][i] = elem.encode('utf-8')
            elif type(item)==str:
                node[key] = item.encode('utf-8')
        return node

    new_dict = walk(old_dict)

    return new_dict

and calling it like this in main()

                    if is_unicode(text):
                        update = cast_to_utf_8(update)
                        print("Update content: {}".format(update))
                        bot.send_message(chat_id, bot.get_answer(update["message"]["text"]))
                    else:
                        bot.send_message(chat_id, "Hmm, you are sending some weird characters to me...")

was a remedy for me

Hello loeiten,
It looks like your terminal does not support outputting Unicode characters. Have you tried setting the locale that supports Unicode? E.g.:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

See this article for more details.

We can probably detect such situations in the code, though it feels a bit opinionated - some users might want to use Unicode characters.
Whatever the final solution, we will reflect this in the docs for the assignment.

Thanks @akashin, I was not aware of this. Your suggestion worked :).

Indeed, maybe writing it in the docs would be best if it affects several people.
If not, a simple

import sys
if not 'UTF-8' in sys.stdout.encoding:
    # Suggest exporting UTF-8

would also do