String ~ vec[i8] comparisons Python3
radujica opened this issue · comments
Am attempting in baloo to encode strings to Weld for e.g. sr[sr != 'abc']
to work, however there seems to be a bug somewhere. Are vec[i8] <comparison> vec[i8]
expected to work correctly at the Weld level?
For example:
// _inp2 here is the index associated with the _inp0 strings data
|_inp0: vec[vec[i8]], _inp1: vec[i8], _inp2: vec[i64]| let obj100 = (_inp0);
let obj101 = (map(
obj100,
|a: vec[i8]|
a != _inp1
));
result(
for(
zip(_inp2, obj101),
appender[i64],
|b: appender[i64], i: i64, e: {i64, bool}|
if (e.$1,
merge(b, e.$0),
b)
)
)
This only seems to work when _inp1
is of length 1. So for:
sr = Series(np.array(['abc', 'Burgermeister', 'b'], dtype=np.bytes_))
sr[sr != 'b'] # will correctly return the first 2 elements
sr[sr != 'abc'] # does not; (returns all elements)
The most likely culprit is the encoding with Python3. The only changes I made are essentially moving from PyString_AsString
and PyString_Size
to the PyBytes_*
equivalents (in the .cpp
file) and encoding the str
to utf-8
, e.g. abc
.encode('utf-8') (in the encoders.py
file):
extern "C"
weld::vec<uint8_t> str_to_weld_char_arr(PyObject* in) {
int64_t dimension = (int64_t) PyBytes_Size(in);
weld::vec<uint8_t> t;
t.size = dimension;
t.ptr = (uint8_t*) PyBytes_AsString(in);
return t;
}
...
if isinstance(obj, str):
numpy_to_weld = self.utils.str_to_weld_char_arr
numpy_to_weld.restype = WeldVec(WeldChar()).ctype_class
numpy_to_weld.argtypes = [py_object]
return numpy_to_weld(obj.encode('utf-8'))
Note that
- En-/decoding numpy arrays of bytes works fine with the grizzly encoders (and using
PyBytes_FromStringAndSize
instead ofPyString_FromStringAndSize
). - Also toyed around with modifying
WeldChar.ctype_class
toc_char_p
as opposed toc_wchar_p
which seemed more appropriate yet produces the same result. - Encoding as
ascii
would probably be more appropriate, since Weld can't handle unicode from what I can tell. Nevertheless, the tested data isascii
. - This is with the master branch Weld.
Any feedback/idea on what the issue might be?
Interesting -- so are you observing that this works when using PyBytes_FromStringAndSize
(with comparisons)? Comparisons should work properly at the Weld level, since its just doing a byte-by-byte comparison.
What definitely works is encoding and decoding numpy arrays of bytes. This uses the original en/decoders only changed to use PyBytes_FromStringAndSize
during decoding (through Python 3). Note the numpy_to_weld_char_arr_arr
encoder uses only numpy methods, no PyString/Bytes*
.
What I want is to encode a Python3 str
such that comparisons like sr[sr != 'abc']
work. Restricting this to ascii string and passing it to the encoder as a str.encode('ascii')
should then work using the encoder pasted above.
However, something goes wrong after encoding. It's as if Weld receives only the first byte correctly, hence only working for cases such as sr[sr != 'a']
. Note that printing the contents of t right before returning from the encoder shows the correct ptr data and size.
Related question: what does Weld do if it ends up comparing 'a' with 'abc'?
For reference, have managed to pass a string (utf-8 even) to C++ and get it back in Python as seen in the answer here. Only seems to work in Python 3 though.
I see - when you compare 'a' and 'abc' in Weld, it will treat them both as vec[i8]
and do a byte-by-byte comparison of the two strings, just like it would with any other vector. That's strange that its only picking up the first byte...is the length field of the vector being populated correctly? For a single string, you also probably want to call numpy_to_weld_char_arr
(single _arr
) since the other one will create a vec[vec[i8]]
.
That's the one I am using. str_to_weld_char_arr
above is numpy_to_weld_char_arr
renamed and using PyBytes_*
equivalents. PyString_*
versions don't compile with Python 3 libs.
Got it -- one thing to try could be to print out the raw u8
bytes after the encode and before the decode, to make sure that the correct lengths are being set in the vector. If those also look as expected (e.g., the raw bytes correspond to the correct ASCII values and the lengths are also correct), I'll spend some time digging into the generated code for vector comparisons.
For the comparison example, there is no str to decode since all the code needs to do is encode the 'abc'
in sr[sr != 'abc']
and use it in the comparison. Adding cout << t.ptr << t.size;
as follows returns the right values (even going byte-by-byte through [0], ...
):
extern "C"
weld::vec<uint8_t> str_to_weld_char_arr(PyObject* in) {
int64_t dimension = (int64_t) PyBytes_Size(in);
weld::vec<uint8_t> t;
t.size = dimension;
t.ptr = (uint8_t*) PyBytes_AsString(in);
cout << t.ptr << t.size;
return t;
}
Eliminating the comparison from the equation, tried to just encode a str, pass it through Weld, and return it. That code is really just:
data = 'abc'
weld_obj = WeldObject(NumPyEncoder(), NumPyDecoder())
obj_id = weld_obj.update(data)
weld_obj.weld_code = '{}'.format(obj_id)
res = LazyResult(weld_obj, WeldChar(), 1)
print(res.evaluate())
Presumably for the reason above, grizzly does not contain a weld_to_numpy_char_arr
. However, it should be just:
extern "C"
PyObject* weld_to_str(weld::vec<uint8_t> inp) {
Py_Initialize();
cout << inp.ptr;
cout << inp.size;
PyObject* out = PyBytes_FromStringAndSize((const char*) inp.ptr, inp.size);
return out;
}
# and this in the NumPyDecoder:
elif restype == WeldVec(WeldChar()):
weld_to_numpy = self.utils.weld_to_str
weld_to_numpy.restype = py_object
weld_to_numpy.argtypes = [restype.ctype_class]
result = ctypes.cast(data, ctypes.POINTER(restype.ctype_class)).contents
result = weld_to_numpy(result)
return result.decode('ascii')
Running with data='a'
works correctly, i.e. both the encoder and the decoder print the correct values and the data shows correctly during evaluation. With anything longer than 1 character, the decoder shows rubbish data. More precisely, the data in t.ptr
in the decoder (for char str of len > 1) is non-deterministic ~ rubbish: doing int a = t.ptr[0]; cout << a;
shows different values with each execution. The size
however, is correct!
As tested in the answer here by myself, just encoding and decoding the str (without Weld) works fine.
So. Is there maybe some issue with the data ~ str being freed from memory in some way, something that wouldn't happen with NumPy? It's still baffling why when the length of the str is 1, it does work.
For reasons unknown to me, copying the str in the encoder seems to solve it:
extern "C"
weld::vec<uint8_t> str_to_weld_char_arr(PyObject* in) {
int64_t dimension = (int64_t) PyString_Size(in);
weld::vec<uint8_t> t;
t.size = dimension;
const char *str = PyString_AsString(in);
const char *copied = NULL;
copied = strdup(str);
t.ptr = (uint8_t*) copied;
return t;
}
Sigh Guess this can be closed.
Adding Py_IncRef(in)
does work, thanks! Guess internally there's some Py_DecRef
call on the string before Weld can use it. Interesting though that strings of length 1 worked fine.