Creating high-dimensional PyJArrays

Question

Creating high-dimensional PyJArrays

gselzer opened this issue 2 years ago · comments

Describe the bug
While you can create nested PyJArrays, they seem to be read-only. Furthermore, assigning a sub-array to a variable seems to overwrite the original subarray within the multi-dimensional array. Is this intended behavior?

To Reproduce
Here's a small reproduction that can be run in the interactive console:

import jep

# Create a 4x4 Boolean[][]
f_inner = jep.jarray(4, 'b')
f = jep.jarray(4, f_inner)
for i in range(4):
    f[i] = jep.jarray(4, 'b')

print(f"The size of f is {len(f)}-by-{len(f[0])}")

# Try to assign a value to f[0][0]
f[0][0] = 4
print(f"f[0][0]={f[0][0]}")

# memory locations
print(f"{id(f[0])} is the address of f[0]")
g = f[0] #NB we CAN write to this just fine
print(f"{id(f[0])} is the address of f[0] after assignment")
g = f[0] #NB we CAN write to this just fine
print(f"{id(f[0])} is the address of f[0] after two assignments")

Expected behavior
I'd expect to be able to write to these multi-dimensional arrays such that the write is preserved.

Environment (please complete the following information):

OS Platform, Distribution, and Version: Ubuntu 20.04.5 LTS
Python Distribution and Version: Python 3.10.6
Java Distribution and Version: openjdk 11.0.15-internal 2022-04-19
Jep Version: 4.1.0
Python packages used (e.g. numpy, pandas, tensorflow): None

Ben Steffensmeier · Answer 1 · Sat Nov 05 2022 07:59:18 GMT+0800 (China Standard Time)

Thanks for reporting this. I agree it is a problem. Unfortunately the cause of this behavior is complicated and the history of this behavior goes back longer than I have been working on jep but I will try to explain the best I can.

If you are looking for an easy fix I recommend using a Boolean[][] instead of a primitive type. This introduces the overhead of object creation but I would not expect that to be a problem in most cases and it would give you more predictable behavior. Here is an example of the modification to your code to use a Boolean[][]

import jep

from java.lang import Boolean

# Create a 4x4 Boolean[][]
f_inner = jep.jarray(4, Boolean)
f = jep.jarray(4, f_inner)
for i in range(4):
    f[i] = jep.jarray(4, Boolean)

print(f"The size of f is {len(f)}-by-{len(f[0])}")

# Try to assign a value to f[0][0]
f[0][0] = bool(4)
print(f"f[0][0]={f[0][0]}")

Jep primitive jarrays take advantage of JNI array pinning. We ask the JVM to give us a direct pointer to the memory for the array so we can access and modify elements directly without needing to call into the JVM for each access, This is the fastest way to use java arrays in native code, which also means this is the fastest way we can expose in python.

Unfortunately the JNI specification says that a JVM may give us a direct pointer to the data or may decide to copy the data instead. Since the JVM can choose to copy the data or not there may be garbage collection settings or alternative implementations that behave differently. When the JVM copies the data then jep will allow access and modifications to the copy and at certain points in code jep will commit the array back into the JVM. For example when a primitive array is passed as an argument to a Java method the array will be committed back to Java.

In your example you aren't doing any operations with the primitive array except assignment so in this case there is no point where Jep specifically commits the changes back to the JVM which leads to your problem. Another technique to workaround this problem is to use the commit() method on jarray to ensure the data is committed back to java, although for this use case that is particularly ugly as shown in teh following example:

import jep

# Create a 4x4 Boolean[][]
f_inner = jep.jarray(4, 'b')
f = jep.jarray(4, f_inner)
for i in range(4):
    f[i] = jep.jarray(4, 'b')

print(f"The size of f is {len(f)}-by-{len(f[0])}")

# Try to assign a value to f[0][0]
t = f[0]
t[0] = 4
t.commit()
print(f"f[0][0]={f[0][0]}")

In the future I would like to rewrite jarray so that this pinning behavior is not the default behavior, or at least provide some mechanism to turn it off. Unfortunately that is not a minor change and I have concerns that some users with very large arrays may be impacted by slower performance without pinning so I have been hesitant to modify the existing code.

Gabriel Selzer · Answer 2 · Tue Nov 08 2022 04:11:09 GMT+0800 (China Standard Time)

This is super insightful, thanks so much @bsteffensmeier!

My use case involves the creation of arrays that are arbitrary dimensionality; modification is also important for the use case. For now, I'll probably case it:

If the array to be created is one dimensional, use primitives
Otherwise, use the boxed class.