uqfoundation / dill

serialize all of Python

Home Page:http://dill.rtfd.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

0.3.7 incorrectly pickles the class definition for module/class with the same name

TD22057 opened this issue · comments

This is a different bug than #628 and #604. When a class is defined in a file with the same name but imported into the package under that name (hiding the file name), then it's pickled with the full code block instead of just referencing the class name. Note that regular pickle has no problem with this case and performs correctly.

foo/__init__.py
    from .Bar import Bar

foo/Bar.py:
   class Bar:
      pass

then run

import foo
import dill

b1 = foo.Bar()
print( " in ID:", id( b1.__class__ ) )

s = dill.dumps( b1 )
b2 = dill.loads( s )

print( "out ID:", id( b2.__class__ ) )

and you will get (ID's will be different):

 in ID: 27881968
out ID: 29513904

If I add byref=True to the dump call, then I get the correct class ID but also get a warning:

 in ID: 25867200
.../lib/python3.11/site-packages/dill/_dill.py:412: PicklingWarning: Cannot locate reference to <class 'foo.Bar.Bar'>.
  StockPickler.save(self, obj, save_persistent_id)
out ID: 25867200

I think the issue is in _dill.py in the _locate_function() which isn't finding the definition properly. I'm guess this is because it tries to do an import instead of looking in sys.modules. If that function checked sys.modules for the module name, it would find the module and then could call getattr() on that to see the class definition.

FYI - possible fix for you to evaluate is the following. It appears to fix the issue on my end and I believe it passes tests. But I don't understand the rational for why it was done this way in the first place so it would be good for you to look at. If you'd rather have this as a PR, let me know and I'll do that.

def _import_module(import_name, safe=False):
    try:
        # attempt to fix git bug #634.
        m = sys.modules.get(import_name)
        if m:
           return m
        #if import_name.startswith('__runtime__.'):
        #    return sys.modules[import_name]
        elif '.' in import_name:
            items = import_name.split('.')
            module = '.'.join(items[:-1])
            obj = items[-1]
        else:
            return __import__(import_name)
        return getattr(__import__(module, None, None, [obj]), obj)
    except (ImportError, AttributeError, KeyError):
        if safe:
            return None
        raise

Thanks for reporting, and the follow-up on this. I'll have a look... and yes, I'll make a PR if you don't.

Slight modification:

foo/__init__.py
    from .bar import bar, zap

foo/bar.py:
   class bar:
      pass

   class zap:
      pass

With the above, I don't think there's a difference for foo.bar.zap versus foo.bar.bar. So, it would seem the module/class having the same name doesn't matter. Also, I think this is the expected behavior -- dill makes a copy of the class instead of referencing it like pickle would. The warning is a bit annoying, but innocuous.

Can you tell me what you think the behavior should be?

I'm not at a computer where I can check this. But I assume "with the above" means my hack/fix. That could be true - I didn't test every combination. The fix I proposed may not be correct and/or complete.

However that doesn't mean this isn't a bug and IMO it's a serious bug. The problem is that I have code that unpickles something and then says:

obj = dill.loads( data )
if isinstance( obj, foo.bar ):
   ...  do something

And the isinstance check will fail because dill made a clone of the class. So the unpickled object is no longer the same type as the object that was pickled. BTW regular pickle handles this case just fine - I don't know what it's doing but it correctly finds the package and unpickles to the original type.

Currently, the expected behavior is that dill makes a copy of the class, and doesn't pickle a reference unless it's requested that the class is pickled by reference. I initially thought that you were reporting that the bug was the warning being thrown, but if it's that dill is not pickling by reference... then this is the expected behavior.

This enables dill to deal with all sorts of different cases, such as the class being dynamically modified.

Sorry - I didn't realize that was the desired behavior. I doubt it matters at this point but FYI in every use case we have (saving data, sending functions and data to parallel engines) it's really bad to make a copy of the class unless there is no other way to pickle it. I guess we'll try to figure out how to suppress that error and wrap the library to set byRef=True for all of our calls.

No worries. Also, those are the most common dill use cases, I believe. If you want classes to be pickled by reference in all cases, you can adjust the global settings. dill.settings['byref'] = True. -- then you won't need to use the keyword in the dump.