weld-project / weld

High-performance runtime for data analytics applications

Home Page:https://www.weld.rs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

String ~ vec[i8] comparisons Python3

radujica opened this issue · comments

commented

Am attempting in baloo to encode strings to Weld for e.g. sr[sr != 'abc'] to work, however there seems to be a bug somewhere. Are vec[i8] <comparison> vec[i8] expected to work correctly at the Weld level?

For example:

// _inp2 here is the index associated with the _inp0 strings data
|_inp0: vec[vec[i8]], _inp1: vec[i8], _inp2: vec[i64]| let obj100 = (_inp0);
let obj101 = (map(
    obj100,
    |a: vec[i8]| 
        a != _inp1
));
result(
    for(
        zip(_inp2, obj101),
        appender[i64],
        |b: appender[i64], i: i64, e: {i64, bool}| 
            if (e.$1, 
                merge(b, e.$0), 
                b)
    )
)

This only seems to work when _inp1 is of length 1. So for:

sr = Series(np.array(['abc', 'Burgermeister', 'b'], dtype=np.bytes_))
sr[sr != 'b']  # will correctly return the first 2 elements
sr[sr != 'abc']  # does not; (returns all elements)

The most likely culprit is the encoding with Python3. The only changes I made are essentially moving from PyString_AsString and PyString_Size to the PyBytes_* equivalents (in the .cpp file) and encoding the str to utf-8, e.g. abc.encode('utf-8') (in the encoders.py file):

extern "C"
weld::vec<uint8_t> str_to_weld_char_arr(PyObject* in) {
  int64_t dimension = (int64_t) PyBytes_Size(in);
  weld::vec<uint8_t> t;
  t.size = dimension;
  t.ptr = (uint8_t*) PyBytes_AsString(in);
  return t;
}

...
if isinstance(obj, str):
        numpy_to_weld = self.utils.str_to_weld_char_arr
        numpy_to_weld.restype = WeldVec(WeldChar()).ctype_class
        numpy_to_weld.argtypes = [py_object]

        return numpy_to_weld(obj.encode('utf-8'))

Note that

  1. En-/decoding numpy arrays of bytes works fine with the grizzly encoders (and using PyBytes_FromStringAndSize instead of PyString_FromStringAndSize).
  2. Also toyed around with modifying WeldChar.ctype_class to c_char_p as opposed to c_wchar_p which seemed more appropriate yet produces the same result.
  3. Encoding as ascii would probably be more appropriate, since Weld can't handle unicode from what I can tell. Nevertheless, the tested data is ascii.
  4. This is with the master branch Weld.

Any feedback/idea on what the issue might be?

Interesting -- so are you observing that this works when using PyBytes_FromStringAndSize (with comparisons)? Comparisons should work properly at the Weld level, since its just doing a byte-by-byte comparison.

commented

What definitely works is encoding and decoding numpy arrays of bytes. This uses the original en/decoders only changed to use PyBytes_FromStringAndSize during decoding (through Python 3). Note the numpy_to_weld_char_arr_arr encoder uses only numpy methods, no PyString/Bytes*.

What I want is to encode a Python3 str such that comparisons like sr[sr != 'abc'] work. Restricting this to ascii string and passing it to the encoder as a str.encode('ascii') should then work using the encoder pasted above.

However, something goes wrong after encoding. It's as if Weld receives only the first byte correctly, hence only working for cases such as sr[sr != 'a']. Note that printing the contents of t right before returning from the encoder shows the correct ptr data and size.

Related question: what does Weld do if it ends up comparing 'a' with 'abc'?

For reference, have managed to pass a string (utf-8 even) to C++ and get it back in Python as seen in the answer here. Only seems to work in Python 3 though.

I see - when you compare 'a' and 'abc' in Weld, it will treat them both as vec[i8] and do a byte-by-byte comparison of the two strings, just like it would with any other vector. That's strange that its only picking up the first byte...is the length field of the vector being populated correctly? For a single string, you also probably want to call numpy_to_weld_char_arr (single _arr) since the other one will create a vec[vec[i8]].

commented

That's the one I am using. str_to_weld_char_arr above is numpy_to_weld_char_arr renamed and using PyBytes_* equivalents. PyString_* versions don't compile with Python 3 libs.

Got it -- one thing to try could be to print out the raw u8 bytes after the encode and before the decode, to make sure that the correct lengths are being set in the vector. If those also look as expected (e.g., the raw bytes correspond to the correct ASCII values and the lengths are also correct), I'll spend some time digging into the generated code for vector comparisons.

commented

For the comparison example, there is no str to decode since all the code needs to do is encode the 'abc' in sr[sr != 'abc'] and use it in the comparison. Adding cout << t.ptr << t.size; as follows returns the right values (even going byte-by-byte through [0], ...):

extern "C"
weld::vec<uint8_t> str_to_weld_char_arr(PyObject* in) {
  int64_t dimension = (int64_t) PyBytes_Size(in);
  weld::vec<uint8_t> t;
  t.size = dimension;
  t.ptr = (uint8_t*) PyBytes_AsString(in);
  cout << t.ptr << t.size;
  return t;
}

Eliminating the comparison from the equation, tried to just encode a str, pass it through Weld, and return it. That code is really just:

data = 'abc'
weld_obj = WeldObject(NumPyEncoder(), NumPyDecoder())
obj_id = weld_obj.update(data)
weld_obj.weld_code = '{}'.format(obj_id)
res = LazyResult(weld_obj, WeldChar(), 1)
print(res.evaluate())

Presumably for the reason above, grizzly does not contain a weld_to_numpy_char_arr. However, it should be just:

extern "C"
PyObject* weld_to_str(weld::vec<uint8_t> inp) {
  Py_Initialize();
  cout << inp.ptr;
  cout << inp.size;
  PyObject* out = PyBytes_FromStringAndSize((const char*) inp.ptr, inp.size);
  return out;
}

# and this in the NumPyDecoder:
elif restype == WeldVec(WeldChar()):
        weld_to_numpy = self.utils.weld_to_str
        weld_to_numpy.restype = py_object
        weld_to_numpy.argtypes = [restype.ctype_class]
        result = ctypes.cast(data, ctypes.POINTER(restype.ctype_class)).contents
        result = weld_to_numpy(result)
        return result.decode('ascii')

Running with data='a' works correctly, i.e. both the encoder and the decoder print the correct values and the data shows correctly during evaluation. With anything longer than 1 character, the decoder shows rubbish data. More precisely, the data in t.ptr in the decoder (for char str of len > 1) is non-deterministic ~ rubbish: doing int a = t.ptr[0]; cout << a; shows different values with each execution. The size however, is correct!

As tested in the answer here by myself, just encoding and decoding the str (without Weld) works fine.

So. Is there maybe some issue with the data ~ str being freed from memory in some way, something that wouldn't happen with NumPy? It's still baffling why when the length of the str is 1, it does work.

commented

For reasons unknown to me, copying the str in the encoder seems to solve it:

extern "C"
weld::vec<uint8_t> str_to_weld_char_arr(PyObject* in) {
  int64_t dimension = (int64_t) PyString_Size(in);
  weld::vec<uint8_t> t;
  t.size = dimension;
  const char *str = PyString_AsString(in);
  const char *copied = NULL;
  copied = strdup(str);
  t.ptr = (uint8_t*) copied;
  return t;
}

Sigh Guess this can be closed.

commented

Adding Py_IncRef(in) does work, thanks! Guess internally there's some Py_DecRef call on the string before Weld can use it. Interesting though that strings of length 1 worked fine.