mschubert / ebits

R bioinformatics toolkit incubator

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`sort_by` duplicating available functionality

mschubert opened this issue · comments

base/vector: sort_by.data.frame should be implemented in dplyr::arrange, and sort_by.default by base::sort.

Am I missing something here or is this duplicating functionality?

You can’t pass a key to base::sort. order handles this universally, hence why it’s used in both implementations. At any rathe, the implementations are identical except for the indexing operation (because data.frames need to be indexed differently from vectors).

So for a single key sort_by.default(x, key) should be equal to x[order(key)].

The advantage is that using sort_by on a vector you can have multiple keys. Do you have any use case for that?

In addition, why do you not use gtools::mixedorder to get natural sort for characters?

Well, the other difference is that the x in your example can be a complex expression. I definitely have use-cases for that, that’s why I wrote the function in the first place.

Regarding gtools, I tend to avoid these chaotic cram-every-unrelated-thing-into-one-utility packages (one of my main motivation for writing modules!) – which is why I didn’t know about mixedorder.

Talking API, I would implement this via

sort_by(x, natural(key))

with

natural = function (x) structure(x, class = 'natural')

And then have an S3 method order.natural to return the natural ordering of its arguments (mixedorder is known as “natural ordering” in other programming languages).

This could be implemented in terms of gtools::naturalorder. What speaks against this is that we would incur yet another (huge) dependency for a single function.

Well, the other difference is that the x in your example can be a complex expression. I definitely have use-cases for that, that’s why I wrote the function in the first place.

x is the data variable, how can this be a complex expression? Did you mean key? And if so, please provide an example.

Talking API, I would implement this via [...] And then have an S3 method order.natural [...]

The only reason not to use natural sort as (only) default is execution speed, which I would argue should not be a primary concern in this case (one would use dyplr::arrange for huge data.frames anyway). Counter-examples welcome.

This could be implemented in terms of gtools::naturalorder. What speaks against this is that we would incur yet another (huge) dependency for a single function.

The cost of having some additional packages as dependencies are negligible if they are on CRAN and handled using R's package manager. The goal in my opinion should not be to implement every functionality properly ourselves but to provide functionality without incurring too much work (i.e., use packages that already implement what we are trying to do, and reimplement on a as-needed basis (hint: there will not be much need)).

x is the data variable, how can this be a complex expression? Did you mean key? And if so, please provide an example.

result = sort_by(some_function(args) %>% filter(condition) %>% select(some, columns),
                 key)

Of course you could also do

x = some_function(args) %>% filter(condition) %>% select(some, columns)
x = x[order(key), ]

But that defeats the whole point of R being a functional language.

That said, I just realise that I introduced a regression into sort_by which means that you cannot use a column of the data frame as the key, which renders this much less useful. And, as you’ve said, there’s arrange.

Regarding natural ordering: I would never use this by default. In fact, I never do, and it’s not default anywhere that I know of. But this may indeed be a performance concern. I still think that it pays to be explicit here, especially since natural ordering uses heuristics to determine how to sort, and this can yield inconsistent result (in particular, the relative sort order of two elements can change when new elements are added to the data).

And about the cost of packages: it’s always a cost to have dependencies. Not because they have to be installed but because they have to be managed, kept up to date, version conflicts have to be avoided, etc. This is somewhat less of an issue with CRAN packages since CRAN packages rarely update.

Off topic, but I'd go for this then:

result = some_function(args) %>%
    filter(condition) %>%
    select(some, columns) %>%
    sort_by(key)

Still, can you show me some actual code where this is useful? (emphasis on the external key)

Regarding natural ordering: [...]

Fair point.

And about the cost of packages: it’s always a cost to have dependencies. Not because they have to be installed but because they have to be managed, kept up to date [...]

With a dependency that is maintained, this is work done by someone else. True, the interface can change, but testing and bugfixes will happen regardless - and the best thing: you don't even have to do anything for it. Given the choice between a somewhat widely used and somewhat stable library and writing from scratch, the latter should be preferred for practical reasons (e.g. data.table's fread, even if it is horrible code; gtool's mixedsort, even if we could implement it from scratch).

This should have been closed a long time ago. Good discussion, but I was wrong with the premise of the issue, you do want to sort by external keys.