0.3.7 incorrectly pickles the class definition for module/class with the same name
TD22057 opened this issue · comments
This is a different bug than #628 and #604. When a class is defined in a file with the same name but imported into the package under that name (hiding the file name), then it's pickled with the full code block instead of just referencing the class name. Note that regular pickle has no problem with this case and performs correctly.
foo/__init__.py
from .Bar import Bar
foo/Bar.py:
class Bar:
pass
then run
import foo
import dill
b1 = foo.Bar()
print( " in ID:", id( b1.__class__ ) )
s = dill.dumps( b1 )
b2 = dill.loads( s )
print( "out ID:", id( b2.__class__ ) )
and you will get (ID's will be different):
in ID: 27881968
out ID: 29513904
If I add byref=True
to the dump call, then I get the correct class ID but also get a warning:
in ID: 25867200
.../lib/python3.11/site-packages/dill/_dill.py:412: PicklingWarning: Cannot locate reference to <class 'foo.Bar.Bar'>.
StockPickler.save(self, obj, save_persistent_id)
out ID: 25867200
I think the issue is in _dill.py in the _locate_function() which isn't finding the definition properly. I'm guess this is because it tries to do an import instead of looking in sys.modules. If that function checked sys.modules for the module name, it would find the module and then could call getattr() on that to see the class definition.
FYI - possible fix for you to evaluate is the following. It appears to fix the issue on my end and I believe it passes tests. But I don't understand the rational for why it was done this way in the first place so it would be good for you to look at. If you'd rather have this as a PR, let me know and I'll do that.
def _import_module(import_name, safe=False):
try:
# attempt to fix git bug #634.
m = sys.modules.get(import_name)
if m:
return m
#if import_name.startswith('__runtime__.'):
# return sys.modules[import_name]
elif '.' in import_name:
items = import_name.split('.')
module = '.'.join(items[:-1])
obj = items[-1]
else:
return __import__(import_name)
return getattr(__import__(module, None, None, [obj]), obj)
except (ImportError, AttributeError, KeyError):
if safe:
return None
raise
Thanks for reporting, and the follow-up on this. I'll have a look... and yes, I'll make a PR if you don't.
Slight modification:
foo/__init__.py
from .bar import bar, zap
foo/bar.py:
class bar:
pass
class zap:
pass
With the above, I don't think there's a difference for foo.bar.zap
versus foo.bar.bar
. So, it would seem the module/class
having the same name doesn't matter. Also, I think this is the expected behavior -- dill
makes a copy of the class instead of referencing it like pickle
would. The warning is a bit annoying, but innocuous.
Can you tell me what you think the behavior should be?
I'm not at a computer where I can check this. But I assume "with the above" means my hack/fix. That could be true - I didn't test every combination. The fix I proposed may not be correct and/or complete.
However that doesn't mean this isn't a bug and IMO it's a serious bug. The problem is that I have code that unpickles something and then says:
obj = dill.loads( data )
if isinstance( obj, foo.bar ):
... do something
And the isinstance check will fail because dill made a clone of the class. So the unpickled object is no longer the same type as the object that was pickled. BTW regular pickle handles this case just fine - I don't know what it's doing but it correctly finds the package and unpickles to the original type.
Currently, the expected behavior is that dill
makes a copy of the class, and doesn't pickle a reference unless it's requested that the class is pickled by reference. I initially thought that you were reporting that the bug was the warning being thrown, but if it's that dill
is not pickling by reference... then this is the expected behavior.
This enables dill
to deal with all sorts of different cases, such as the class being dynamically modified.
Sorry - I didn't realize that was the desired behavior. I doubt it matters at this point but FYI in every use case we have (saving data, sending functions and data to parallel engines) it's really bad to make a copy of the class unless there is no other way to pickle it. I guess we'll try to figure out how to suppress that error and wrap the library to set byRef=True for all of our calls.
No worries. Also, those are the most common dill
use cases, I believe. If you want classes to be pickled by reference in all cases, you can adjust the global settings. dill.settings['byref'] = True
. -- then you won't need to use the keyword in the dump.