Unexpected TypeError when using natsorted on a custom class
evandrocoan opened this issue · comments
Minimum, Complete, Verifiable Example
Using my custom class with python builtin sorted()
works 100%. But with natsorted
, Python throws this insane type error. This happens when one of my classes on the list has an iterator and the other don't.
from natsort import natsorted
class Production(object):
def __init__(self, arg):
self.arg = arg
def __str__(self):
return self.arg
def __lt__(self, other):
return str( self ) < str( other )
def __hash__(self):
return hash( str( self ) )
def __iter__(self):
self.index = True
return self
def __next__(self):
if self.index:
self.index = False
return self.arg
raise StopIteration
class Terminal(object):
def __init__(self, arg):
self.arg = arg
def __str__(self):
return self.arg
def __lt__(self, other):
return str( self ) < str( other )
def __hash__(self):
return hash( str( self ) )
terminal = Terminal('B')
production = Production('A')
trouble_set = [terminal, production]
print( 'Python sorted:' )
print( sorted( trouble_set ) )
print( )
print( 'natsorted sorted:' )
print( natsorted( trouble_set ) )
Error message, Traceback
Python sorted:
[<__main__.Production object at 0x00...>, <__main__.Terminal object at 0x00...>]
natsorted sorted:
Traceback (most recent call last):
File "D:\User\Downloads\test.py", line 54, in <module>
print( natsorted( trouble_set ) )
File "F:\Python\lib\site-packages\natsort\natsort.py", line 299, in natsorted
return sorted(seq, reverse=reverse, key=natsort_key)
TypeError: '<' not supported between instances of 'tuple' and 'str'
This error also comes up if you replace the iterator by the __getitem__
:
from natsort import natsorted
class Production(object):
def __init__(self, arg):
self.arg = arg
def __str__(self):
return self.arg
def __lt__(self, other):
return str( self ) < str( other )
def __hash__(self):
return hash( str( self ) )
def __getitem__(self, key):
return self.arg[key]
class Terminal(object):
def __init__(self, arg):
self.arg = arg
def __str__(self):
return self.arg
def __lt__(self, other):
return str( self ) < str( other )
def __hash__(self):
return hash( str( self ) )
terminal = Terminal('B')
production = Production('A')
trouble_set = [terminal, production]
print( 'Python sorted:' )
print( sorted( trouble_set ) )
print( )
print( 'natsorted sorted:' )
print( natsorted( trouble_set ) )
natsort
works by first transforming the input into a tuple with alternating str
then numbers (check out http://natsort.readthedocs.io/en/master/howitworks.html for details). It does not make many assumptions about types when doing this, but does try to make sure the output is comparable no matter the input type in order to avoid TypeError
s. For built-in types or classes that inherit from something defined in collections.abc
this should work well. However, if trying to compare classes that do not transform uniformly or do not follow a standard interface you might run into trouble. For this reason I strongly disagree with the characterization that the error is insane - “unexpected to you” != “insane”.
You can check out the function that initiates the transformation to get an idea where the error is coming from. You can also inspect what natsort
is doing with the following code (untested... I am on the road right now).
>>> import natsort
>>> ns_key = natsort.natsort_ keygen()
>>> print(repr(ns_key(terminal)))
>>> print(repr(ns_key(production)))
I suspect that because one class defines an iterator interface and the other does not they are going down different branches of code, resulting in non-comparable types.
Having said all this, the approach being used to sort the classes in the above implementation is not the preferred method by the python community - one should instead use a key function (all the modern examples in https://wiki.python.org/moin/HowTo/Sorting show using a key function rather than relying on __lt__
). It seems these classes should be sorted by the string representation, so using
>>> sorted(trouble_set, key=str)
is what should be used. In the implementation above, for each pair of inputs str
will be called twice, which at worst case will be N^2
conversions to a string; conversely, the key function will only call str
once per input for N
conversions.
If the code is changed to use a key function, then natsort
will start working. Because natsort
is transforming the input before passing it to sorted
, the calls to __lt__
never occur. However, if one first converts to string first with a key function it all works as expected.
TL;DR
This will fix it
>>> natsort.natsorted(trouble_set, key=str)
For this reason I strongly disagree with the characterization that the error is insane - “unexpected to you” != “insane”.
Sorry, I did not mean to be rude. In my understanding, I only had to override the __lt__()
and __eq__()
operator for any sorting method to work. In my math I had added unexpected to me
+ I have no idea where is it coming from
= insane
For the built-in Python sorted()
method, that was enough. But, when I migrated my system to use natsorted()
, I spent several hours trying to figure out what was going on, until I narrowed it down to the fact of one of my classes defining their own iterator __iter__()
, __next__()
or just one __getitem__()
.
Now, I understand why natsort
ignores my overriding of __lt__()
method and uses his own less than
operator. But would nice if natsort
had a different behavior than this unexpected error
when implemented a custom iterator for a class and directly used natsorted()
without a key
. Because the library user (or at least me) just had no clue about what is going on.
There are a few things to address.
I want to be very clear that natsort
does not implement it's own less than operator. Rather, it takes the input and converts it into a tuple of strings and numbers. Python natively knows how to compare these.
To sort naturally without using a key function, the correct approach would be to insert the natsort
key directly into your __lt__
implementation.
import natsort
natsort_key = natsort.natsort_keygen()
class Myclass(object):
def __init__(self, arg):
self.arg = arg
def __eq__(self, other):
return str(self) == str(other)
def __lt__(self, other):
return natsort_key(str(self)) < natsort_key(str(other))
If this approach is taken for both classes, then one can use sorted
and it will sort naturally because the comparison operator is implemented in terms of a natural comparison - the transformation of the data happens inside __lt__
and calling natsorted
is not needed. Note that I don't recommend only implementing __lt__
for a class - take a look at total_ordering
.
But would nice if
natsort
had a different behavior than this unexpected error when implemented a custom iterator for a class and directly usednatsorted()
without a key.
Please describe what behavior you would want instead. Keep in mind the following when coming up with a behavior:
- It is not the
natsort
library that is throwing the error but rather is thesorted
function itself. - Comparing an iterable and a non-iterable is typically not possible in python (3) -- it is only because these are custom classes that this is allowed.
- I have taken great care to make sure that
natsort
introduces as little overhead as possible, so any solutions should not introduce significant extra overhead.
While I empathize with having to spend an hour trying to debug the TypeError
, I had spent over a week (nights and weekends) writing the How it Works page in the docs in the hopes that users of the library would read it and have an understanding of how natsort
works so that they could debug issues that arise or learn how to hack the algorithm. If this page is not enough to have explained that the TypeError
arose from the transformation of data, please help by suggesting how that page can be improved or how the README could be improved so that what was going on and how to solve the issue would be more clear.
First, thanks for your dedication into writing natsort
and all its documentation. It is really hard to find someone so dedicated maintaining (on their free time) good open source software.
I had spent over a week (nights and weekends) writing the How it Works page in the docs in the hopes that users of the library would read it and have an understanding of how natsort works so that they could debug issues that arise or learn how to hack the algorithm.
When I first got the TypeError
, I was interested into figuring out why my code was breaking natsort
, because:
- Either I wrote something wrong and I needed to fix so my code can be robust.
- Or because there was a bug in the
natsort
library, and I would like to report the bug.
The first thing which bothered me was the fact that the builtin python sorted()
function was working fine. Next to it, I tried to reproduce the bug outside my system, but natsort
worked fine with everything I throwed at it.
help by suggesting how that page can be improved or how the README could be improved so that what was going on and how to solve the issue would be more clear.
The README page could always use show the use of key
when calling natsort
on this examples, even when they are already native strings. With that, I would figured out sooner that with natsort
I should usually set what is the comparison attribute key.
After understanding how natsort
works, I think there is nothing much which could be done:
- Either mention this special behavior on the README when comparing custom objects,
- Or perhaps, if no key is provided by the user, then set by default the key as
str
. - Or always use tuples when
natsort
generates its keys. Therefore, there is no way this error can happen again as everything would always being tuples.
I am not sure about the performance implication of this or if such thing would be possible. Some bench marking should be done, and see if this changes the library performance, or if it is irrelevant.
If this impact the performance, then just documenting this behavior on README should be enough.
I will update the documentation to make aspects more clear.