pelias / api

HTTP API for Pelias Geocoder

Home Page:http://pelias.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Long query with special characters causes 500 error

orangejulius opened this issue · comments

We recently saw some 500 errors in Geocode Earth production that initially gave us a bit of a scare! It turned out to be some fairly wild input from one of our users that triggered a 500 error in the Pelias API.

We've solved a few problems with special characters, emoji, long inputs, etc recently but this one is especially confusing to me.

Here's a URL that demonstrates the problem:

http://localhost:3100/v1/search?text=ASDFJK%20%D0%A1%D0%A2%D0%A0%D0%90%D0%9D%D0%AB%20%D0%91%D0%AB%D0%9B%D0%98%20%D0%9A%D0%A2%D0%9E%20%D0%A1%D0%9B%D0%A3%D0%A8%D0%90%D0%95%D0%A2%20%D0%9C%D0%95%D0%9D%D0%AF%D0%A41%F0%9F%92%92%F0%9F%8C%8B%F0%9F%9A%AD%F0%9F%9A%B1%E2%AC%86%EF%B8%8F%E2%86%97%EF%B8%8F%E2%9E%A1%EF%B8%8F%E2%86%98%EF%B8%8F%E2%AC%87%EF%B8%8F%E2%86%99%EF%B8%8F%E2%AC%85%EF%B8%8F%E2%86%96%EF%B8%8F%E2%86%95%EF%B8%8F%E2%86%94%EF%B8%8F%E2%86%A9%E2%86%AA%E2%A4%B4%EF%B8%8F%F0%9F%91%AF%E2%80%8D%E2%99%82%EF%B8%8F%F0%9F%87%B0%F0%9F%87%BF%E2%A4%B5%EF%B8%8F%F0%9F%94%83%F0%9F%94%84%F0%9F%9B%90%E2%9A%9B%F0%9F%95%89%E2%9C%A1%E2%98%B8%E2%99%8D%E2%98%AF%EF%B8%8F%E2%98%B8%E2%9C%A1%F0%9F%95%89%F0%9F%9B%90%E2%9A%9B%F0%9F%94%85%F0%9F%94%86%F0%9F%94%B4%F0%9F%9F%A0%F0%9F%9F%A1%F0%9F%9F%A2%D0%9D%D0%97%D0%95%D0%92%D0%98%20%D0%9B%20%D0%9A%20%D0%92%D0%9E%20%D0%A3%D0%9D%D0%A2%D0%92%D0%90854%D0%A332%D0%991%F0%9F%94%B5%F0%9F%9F%A3%F0%9F%9F%A4%E2%9A%AA%E2%9A%AB%E2%9A%AB%F0%9F%9F%A5%F0%9F%9F%A7%F0%9F%9F%A8%F0%9F%94%B6%EF%B8%8F%F0%9F%9F%A9%F0%9F%9F%A6%F0%9F%9F%AA%F0%9F%9F%AB%E2%AC%9B%E2%AC%9C%E2%97%BC%E2%97%BB%E2%97%BE%E2%97%BD%F0%9F%92%A0

Oddly, this URL only appears to cause issues for me when using it on the command line with curl. Pasting it into a browser works fine.

This is the error printed by the API in that case:

URIError: URI malformed
    at encodeURI (<anonymous>)
    at synthesizeUrl (/home/julian/repos/pelias/api/node_modules/pelias-microservice-wrapper/service.js:21:12)
    at /home/julian/repos/pelias/api/node_modules/pelias-microservice-wrapper/service.js:75:25
    at controller (/home/julian/repos/pelias/api/controller/libpostal.js:73:5)
    at Layer.handle [as handle_request] (/home/julian/repos/pelias/api/node_modules/express/lib/router/layer.js:95:5)
    at trim_prefix (/home/julian/repos/pelias/api/node_modules/express/lib/router/index.js:317:13)
    at /home/julian/repos/pelias/api/node_modules/express/lib/router/index.js:284:7
    at Function.process_params (/home/julian/repos/pelias/api/node_modules/express/lib/router/index.js:335:12)
    at next (/home/julian/repos/pelias/api/node_modules/express/lib/router/index.js:275:10)
    at setQuerySize (/home/julian/repos/pelias/api/middleware/sizeCalculator.js:22:5)

From the browser I get a reasonable response. I assume the browser does some cleaning that helps fix things:

{
  "geocoding": {
    "version": "0.2",
    "attribution": "http://192.168.1.191:3100/attribution",
    "query": {
      "text": "ASDFJK СТРАНЫ БЫЛИ КТО СЛУШАЕТ МЕНЯФ1🇰🇿🟠🟡🟢НЗЕВИ Л К ВО УНТВА854У32Й1🟣🟤🟥🟧🟨🟩🟦🟪🟫",
      "size": 10,
      "private": false,
      "lang": {
        "name": "English",
        "iso6391": "en",
        "iso6393": "eng",
        "via": "header",
        "defaulted": false
      },
      "querySize": 20,
      "parser": "pelias",
      "parsed_text": {
        "subject": "ASDFJK СТРАНЫ БЫЛИ КТО СЛУШАЕТ МЕНЯФ1🇰🇿🟠🟡🟢НЗЕВИ Л К ВО УНТВА854У32Й1🟣🟤🟥🟧🟨🟩🟦🟪🟫"
      }
    },
    "warnings": [
      "param 'text' truncated to 140 characters"
    ],
    "engine": {
      "name": "Pelias",
      "author": "Mapzen",
      "version": "1.0"
    },
    "timestamp": 1635525115912
  },
  "type": "FeatureCollection",
  "features": []
}

The really strange part is that I've been having trouble coming up exactly what is causing the problem. I've tried taking shorter subsets of the input, they all work. I can make the query work by removing only the last URLEncoded character (%A0, corresponding to a non-breaking space). But a simple query like foo%A0 works fine, so it's not just that character.

Ideally we would figure out what part of this input is causing issues and, assuming it's truly invalid in some way, either prune out the invalid characters and return an HTTP 200, or return an HTTP 400 if we can't reasonably do that.

Could this possibly due to how your terminal is handling the encoding?
I find the way my Mac handles copy->pasting URLs unintuitive sometimes compared to Linux.

You probably tried this already, but does the behaviour change when you wrap your url in single quotes on the CLI?

Okay, we think we tracked down the problem here. It looks like, because the code we use to trim the text parameter is not unicode aware, it can trim the input in the middle of UTF-8 multibyte characters:

text = text.substring(0, MAX_TEXT_LENGTH);

There are packages like unicode-substring that can handle this in a safer way, and we might end up using one of those.

Some other details:

  • it doesn't look like a terminal encoding issue
  • Apparently Firefox in particular does something that makes this query work in the browser. Chrome behaves the same as curl 🤷

it seems that this bug is the cause of the issue.

I can confirm that #1577 fixed this and the query shown above no longer causes a 500 error with API version v5.34.0.

Also, the issue with browsers modifying the request seems to be limited to Firefox only. Chrome was able to reproduce the error just fine. Interesting.