SethMMorton / natsort

Simple yet flexible natural sorting in Python.

Home Page:https://pypi.org/project/natsort/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unexpected TypeError when using natsorted on a custom class

evandrocoan opened this issue · comments

Minimum, Complete, Verifiable Example

Using my custom class with python builtin sorted() works 100%. But with natsorted, Python throws this insane type error. This happens when one of my classes on the list has an iterator and the other don't.

from natsort import natsorted

class Production(object):
    def __init__(self, arg):
        self.arg = arg

    def __str__(self):
        return self.arg

    def __lt__(self, other):
        return str( self ) < str( other )

    def __hash__(self):
        return hash( str( self ) )

    def __iter__(self):
        self.index = True
        return self

    def __next__(self):

        if self.index:
            self.index = False
            return self.arg

        raise StopIteration


class Terminal(object):
    def __init__(self, arg):
        self.arg = arg

    def __str__(self):
        return self.arg

    def __lt__(self, other):
        return str( self ) < str( other )

    def __hash__(self):
        return hash( str( self ) )


terminal = Terminal('B')
production = Production('A')

trouble_set = [terminal, production]

print( 'Python sorted:' )
print( sorted( trouble_set ) )
print(  )
print( 'natsorted sorted:' )
print( natsorted( trouble_set ) )

Error message, Traceback

Python sorted:
[<__main__.Production object at 0x00...>, <__main__.Terminal object at 0x00...>]

natsorted sorted:
Traceback (most recent call last):
  File "D:\User\Downloads\test.py", line 54, in <module>
    print( natsorted( trouble_set ) )
  File "F:\Python\lib\site-packages\natsort\natsort.py", line 299, in natsorted
    return sorted(seq, reverse=reverse, key=natsort_key)
TypeError: '<' not supported between instances of 'tuple' and 'str'

This error also comes up if you replace the iterator by the __getitem__:

from natsort import natsorted

class Production(object):
    def __init__(self, arg):
        self.arg = arg

    def __str__(self):
        return self.arg

    def __lt__(self, other):
        return str( self ) < str( other )

    def __hash__(self):
        return hash( str( self ) )

    def __getitem__(self, key):
        return self.arg[key]


class Terminal(object):
    def __init__(self, arg):
        self.arg = arg

    def __str__(self):
        return self.arg

    def __lt__(self, other):
        return str( self ) < str( other )

    def __hash__(self):
        return hash( str( self ) )


terminal = Terminal('B')
production = Production('A')

trouble_set = [terminal, production]

print( 'Python sorted:' )
print( sorted( trouble_set ) )
print(  )
print( 'natsorted sorted:' )
print( natsorted( trouble_set ) )


natsort works by first transforming the input into a tuple with alternating str then numbers (check out http://natsort.readthedocs.io/en/master/howitworks.html for details). It does not make many assumptions about types when doing this, but does try to make sure the output is comparable no matter the input type in order to avoid TypeErrors. For built-in types or classes that inherit from something defined in collections.abc this should work well. However, if trying to compare classes that do not transform uniformly or do not follow a standard interface you might run into trouble. For this reason I strongly disagree with the characterization that the error is insane - “unexpected to you” != “insane”.

You can check out the function that initiates the transformation to get an idea where the error is coming from. You can also inspect what natsort is doing with the following code (untested... I am on the road right now).

>>> import natsort
>>> ns_key = natsort.natsort_ keygen()
>>> print(repr(ns_key(terminal)))
>>> print(repr(ns_key(production)))

I suspect that because one class defines an iterator interface and the other does not they are going down different branches of code, resulting in non-comparable types.

Having said all this, the approach being used to sort the classes in the above implementation is not the preferred method by the python community - one should instead use a key function (all the modern examples in https://wiki.python.org/moin/HowTo/Sorting show using a key function rather than relying on __lt__). It seems these classes should be sorted by the string representation, so using

>>> sorted(trouble_set, key=str)

is what should be used. In the implementation above, for each pair of inputs str will be called twice, which at worst case will be N^2 conversions to a string; conversely, the key function will only call str once per input for N conversions.

If the code is changed to use a key function, then natsort will start working. Because natsort is transforming the input before passing it to sorted, the calls to __lt__ never occur. However, if one first converts to string first with a key function it all works as expected.

TL;DR

This will fix it

>>> natsort.natsorted(trouble_set, key=str)

For this reason I strongly disagree with the characterization that the error is insane - “unexpected to you” != “insane”.

Sorry, I did not mean to be rude. In my understanding, I only had to override the __lt__() and __eq__() operator for any sorting method to work. In my math I had added unexpected to me + I have no idea where is it coming from = insane

For the built-in Python sorted() method, that was enough. But, when I migrated my system to use natsorted(), I spent several hours trying to figure out what was going on, until I narrowed it down to the fact of one of my classes defining their own iterator __iter__(), __next__() or just one __getitem__().

Now, I understand why natsort ignores my overriding of __lt__() method and uses his own less than operator. But would nice if natsort had a different behavior than this unexpected error when implemented a custom iterator for a class and directly used natsorted() without a key. Because the library user (or at least me) just had no clue about what is going on.

There are a few things to address.


I want to be very clear that natsort does not implement it's own less than operator. Rather, it takes the input and converts it into a tuple of strings and numbers. Python natively knows how to compare these.


To sort naturally without using a key function, the correct approach would be to insert the natsort key directly into your __lt__ implementation.

import natsort

natsort_key = natsort.natsort_keygen()

class Myclass(object):
    def __init__(self, arg):
        self.arg = arg

    def __eq__(self, other):
        return str(self) == str(other)

    def __lt__(self, other):
        return natsort_key(str(self)) < natsort_key(str(other))

If this approach is taken for both classes, then one can use sorted and it will sort naturally because the comparison operator is implemented in terms of a natural comparison - the transformation of the data happens inside __lt__ and calling natsorted is not needed. Note that I don't recommend only implementing __lt__ for a class - take a look at total_ordering.


But would nice if natsort had a different behavior than this unexpected error when implemented a custom iterator for a class and directly used natsorted() without a key.

Please describe what behavior you would want instead. Keep in mind the following when coming up with a behavior:

  • It is not the natsort library that is throwing the error but rather is the sorted function itself.
  • Comparing an iterable and a non-iterable is typically not possible in python (3) -- it is only because these are custom classes that this is allowed.
  • I have taken great care to make sure that natsort introduces as little overhead as possible, so any solutions should not introduce significant extra overhead.

While I empathize with having to spend an hour trying to debug the TypeError, I had spent over a week (nights and weekends) writing the How it Works page in the docs in the hopes that users of the library would read it and have an understanding of how natsort works so that they could debug issues that arise or learn how to hack the algorithm. If this page is not enough to have explained that the TypeError arose from the transformation of data, please help by suggesting how that page can be improved or how the README could be improved so that what was going on and how to solve the issue would be more clear.

First, thanks for your dedication into writing natsort and all its documentation. It is really hard to find someone so dedicated maintaining (on their free time) good open source software.

I had spent over a week (nights and weekends) writing the How it Works page in the docs in the hopes that users of the library would read it and have an understanding of how natsort works so that they could debug issues that arise or learn how to hack the algorithm.

When I first got the TypeError, I was interested into figuring out why my code was breaking natsort, because:

  1. Either I wrote something wrong and I needed to fix so my code can be robust.
  2. Or because there was a bug in the natsort library, and I would like to report the bug.

The first thing which bothered me was the fact that the builtin python sorted() function was working fine. Next to it, I tried to reproduce the bug outside my system, but natsort worked fine with everything I throwed at it.

help by suggesting how that page can be improved or how the README could be improved so that what was going on and how to solve the issue would be more clear.

The README page could always use show the use of key when calling natsort on this examples, even when they are already native strings. With that, I would figured out sooner that with natsort I should usually set what is the comparison attribute key.

After understanding how natsort works, I think there is nothing much which could be done:

  1. Either mention this special behavior on the README when comparing custom objects,
  2. Or perhaps, if no key is provided by the user, then set by default the key as str.
  3. Or always use tuples when natsort generates its keys. Therefore, there is no way this error can happen again as everything would always being tuples.
    I am not sure about the performance implication of this or if such thing would be possible. Some bench marking should be done, and see if this changes the library performance, or if it is irrelevant.
    If this impact the performance, then just documenting this behavior on README should be enough.

I will update the documentation to make aspects more clear.