String ~ vec[i8] comparisons Python3

Question

String ~ vec[i8] comparisons Python3

radujica opened this issue 6 years ago · comments

Am attempting in baloo to encode strings to Weld for e.g. sr[sr != 'abc'] to work, however there seems to be a bug somewhere. Are vec[i8] <comparison> vec[i8] expected to work correctly at the Weld level?

For example:

// _inp2 here is the index associated with the _inp0 strings data
|_inp0: vec[vec[i8]], _inp1: vec[i8], _inp2: vec[i64]| let obj100 = (_inp0);
let obj101 = (map(
    obj100,
    |a: vec[i8]| 
        a != _inp1
));
result(
    for(
        zip(_inp2, obj101),
        appender[i64],
        |b: appender[i64], i: i64, e: {i64, bool}| 
            if (e.$1, 
                merge(b, e.$0), 
                b)
    )
)

This only seems to work when _inp1 is of length 1. So for:

sr = Series(np.array(['abc', 'Burgermeister', 'b'], dtype=np.bytes_))
sr[sr != 'b']  # will correctly return the first 2 elements
sr[sr != 'abc']  # does not; (returns all elements)

The most likely culprit is the encoding with Python3. The only changes I made are essentially moving from PyString_AsString and PyString_Size to the PyBytes_* equivalents (in the .cpp file) and encoding the str to utf-8, e.g. abc.encode('utf-8') (in the encoders.py file):

extern "C"
weld::vec<uint8_t> str_to_weld_char_arr(PyObject* in) {
  int64_t dimension = (int64_t) PyBytes_Size(in);
  weld::vec<uint8_t> t;
  t.size = dimension;
  t.ptr = (uint8_t*) PyBytes_AsString(in);
  return t;
}

...
if isinstance(obj, str):
        numpy_to_weld = self.utils.str_to_weld_char_arr
        numpy_to_weld.restype = WeldVec(WeldChar()).ctype_class
        numpy_to_weld.argtypes = [py_object]

        return numpy_to_weld(obj.encode('utf-8'))

Note that

En-/decoding numpy arrays of bytes works fine with the grizzly encoders (and using PyBytes_FromStringAndSize instead of PyString_FromStringAndSize).
Also toyed around with modifying WeldChar.ctype_class to c_char_p as opposed to c_wchar_p which seemed more appropriate yet produces the same result.
Encoding as ascii would probably be more appropriate, since Weld can't handle unicode from what I can tell. Nevertheless, the tested data is ascii.
This is with the master branch Weld.

Any feedback/idea on what the issue might be?

Shoumik Palkar · Answer 1 · Wed Oct 31 2018 05:23:02 GMT+0800 (China Standard Time)

Interesting -- so are you observing that this works when using PyBytes_FromStringAndSize (with comparisons)? Comparisons should work properly at the Weld level, since its just doing a byte-by-byte comparison.

Radu · Answer 2 · Wed Oct 31 2018 19:40:23 GMT+0800 (China Standard Time)

What definitely works is encoding and decoding numpy arrays of bytes. This uses the original en/decoders only changed to use PyBytes_FromStringAndSize during decoding (through Python 3). Note the numpy_to_weld_char_arr_arr encoder uses only numpy methods, no PyString/Bytes*.

What I want is to encode a Python3 str such that comparisons like sr[sr != 'abc'] work. Restricting this to ascii string and passing it to the encoder as a str.encode('ascii') should then work using the encoder pasted above.

However, something goes wrong after encoding. It's as if Weld receives only the first byte correctly, hence only working for cases such as sr[sr != 'a']. Note that printing the contents of t right before returning from the encoder shows the correct ptr data and size.

Related question: what does Weld do if it ends up comparing 'a' with 'abc'?

For reference, have managed to pass a string (utf-8 even) to C++ and get it back in Python as seen in the answer here. Only seems to work in Python 3 though.

Shoumik Palkar · Answer 3 · Thu Nov 01 2018 01:58:31 GMT+0800 (China Standard Time)

I see - when you compare 'a' and 'abc' in Weld, it will treat them both as vec[i8] and do a byte-by-byte comparison of the two strings, just like it would with any other vector. That's strange that its only picking up the first byte...is the length field of the vector being populated correctly? For a single string, you also probably want to call numpy_to_weld_char_arr (single _arr) since the other one will create a vec[vec[i8]].

Radu · Answer 4 · Thu Nov 01 2018 03:48:51 GMT+0800 (China Standard Time)

That's the one I am using. str_to_weld_char_arr above is numpy_to_weld_char_arr renamed and using PyBytes_* equivalents. PyString_* versions don't compile with Python 3 libs.

Shoumik Palkar · Answer 5 · Thu Nov 01 2018 04:21:09 GMT+0800 (China Standard Time)

Got it -- one thing to try could be to print out the raw u8 bytes after the encode and before the decode, to make sure that the correct lengths are being set in the vector. If those also look as expected (e.g., the raw bytes correspond to the correct ASCII values and the lengths are also correct), I'll spend some time digging into the generated code for vector comparisons.

Radu · Answer 6 · Thu Nov 01 2018 18:40:55 GMT+0800 (China Standard Time)

For the comparison example, there is no str to decode since all the code needs to do is encode the 'abc' in sr[sr != 'abc'] and use it in the comparison. Adding cout << t.ptr << t.size; as follows returns the right values (even going byte-by-byte through [0], ...):

extern "C"
weld::vec<uint8_t> str_to_weld_char_arr(PyObject* in) {
  int64_t dimension = (int64_t) PyBytes_Size(in);
  weld::vec<uint8_t> t;
  t.size = dimension;
  t.ptr = (uint8_t*) PyBytes_AsString(in);
  cout << t.ptr << t.size;
  return t;
}

Eliminating the comparison from the equation, tried to just encode a str, pass it through Weld, and return it. That code is really just:

data = 'abc'
weld_obj = WeldObject(NumPyEncoder(), NumPyDecoder())
obj_id = weld_obj.update(data)
weld_obj.weld_code = '{}'.format(obj_id)
res = LazyResult(weld_obj, WeldChar(), 1)
print(res.evaluate())

Presumably for the reason above, grizzly does not contain a weld_to_numpy_char_arr. However, it should be just:

extern "C"
PyObject* weld_to_str(weld::vec<uint8_t> inp) {
  Py_Initialize();
  cout << inp.ptr;
  cout << inp.size;
  PyObject* out = PyBytes_FromStringAndSize((const char*) inp.ptr, inp.size);
  return out;
}

# and this in the NumPyDecoder:
elif restype == WeldVec(WeldChar()):
        weld_to_numpy = self.utils.weld_to_str
        weld_to_numpy.restype = py_object
        weld_to_numpy.argtypes = [restype.ctype_class]
        result = ctypes.cast(data, ctypes.POINTER(restype.ctype_class)).contents
        result = weld_to_numpy(result)
        return result.decode('ascii')

Running with data='a' works correctly, i.e. both the encoder and the decoder print the correct values and the data shows correctly during evaluation. With anything longer than 1 character, the decoder shows rubbish data. More precisely, the data in t.ptr in the decoder (for char str of len > 1) is non-deterministic ~ rubbish: doing int a = t.ptr[0]; cout << a; shows different values with each execution. The size however, is correct!

As tested in the answer here by myself, just encoding and decoding the str (without Weld) works fine.

So. Is there maybe some issue with the data ~ str being freed from memory in some way, something that wouldn't happen with NumPy? It's still baffling why when the length of the str is 1, it does work.

Radu · Answer 7 · Thu Nov 01 2018 18:52:08 GMT+0800 (China Standard Time)

For reasons unknown to me, copying the str in the encoder seems to solve it:

extern "C"
weld::vec<uint8_t> str_to_weld_char_arr(PyObject* in) {
  int64_t dimension = (int64_t) PyString_Size(in);
  weld::vec<uint8_t> t;
  t.size = dimension;
  const char *str = PyString_AsString(in);
  const char *copied = NULL;
  copied = strdup(str);
  t.ptr = (uint8_t*) copied;
  return t;
}

Sigh Guess this can be closed.

Shoumik Palkar · Answer 8 · Thu Nov 01 2018 23:21:51 GMT+0800 (China Standard Time)

Hmmm interesting....it’s possible that the reference count of the string needs to be incremented on the Python side? This may not apply for NumPy arrays since the buffers are managed directly by NumPy itself. You could try adding a Py_INCREF call and see if that makes any difference to the version without the strdup to test this.

On Thu, Nov 1, 2018 at 3:52 AM Radu ***@***.***> wrote: For reasons unknown to me, copying the str in the encoder seems to solve it: extern "C" weld::vec<uint8_t> str_to_weld_char_arr(PyObject* in) { int64_t dimension = (int64_t) PyString_Size(in); weld::vec<uint8_t> t; t.size = dimension; const char *str = PyString_AsString(in); const char *copied = NULL; copied = strdup(str); t.ptr = (uint8_t*) copied; return t; } *Sigh* Guess this can be closed. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#411 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABTCY0K1Qwwz-d8YCDq7aKaO9lPyH-Fvks5uqtJZgaJpZM4YCBb9> .

-- Shoumik

Radu · Answer 9 · Tue Nov 06 2018 16:59:08 GMT+0800 (China Standard Time)

Adding Py_IncRef(in) does work, thanks! Guess internally there's some Py_DecRef call on the string before Weld can use it. Interesting though that strings of length 1 worked fine.