cbouy / mols2grid

Interactive molecule viewer for 2D structures

Home Page:https://mols2grid.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Using transform on floats breaks sorting

slochower opened this issue · comments

If I use a transform dictionary to change the formatting of certain properties (say, according to other local variables) like so:

raw_html = mols2grid.display(
    _df,
    mol_col="Mol",
    subset=[
        "Name",
        "img",
    ],
    transform={
        "MMP std. dev. difference": lambda x: (
            f"{x:.2f} {' (Log units)' if log else ''}"
        ),
    },
    tooltip=[
        "MMP std. dev. difference",
    ],
)._repr_html_()  # type: ignore

...then it appears all sorting it based on the str representation of "MPP std. dev. difference" instead of the float representation, even though in the actual data frame, the column "MMP std. dev. difference" has dtype of float64. In this example, I could just change the column title before I pass the data to mols2grid, but in other cases, I want to use transform to otherwise mutate the string shown to the user yet still retain sorting by original dtype. Is this possible?

Edit: I think the issue arises from

for col, func in transform.items():
df[col] = df[col].apply(func)
where the transform is applied directly to the data, changing the column to a str in the above code snippet.

A simpler example that would show the same behavior is this:

raw_html = mols2grid.display(
    _df,
    mol_col="Mol",
    subset=[
        "Name",
        "img",
        "x",
    ],
    transform={
        "x": lambda x:  f"x = {np.round(x,2)}"
    },
)._repr_html_()  # type: ignore

where just changing the display from 3.14 to x = 3.14 will break sorting.

Can't really think of an easy solution for this one apart from keeping 2 distinct columns, one for displaying and one for sorting (which you could hide with style={"x": lambda x: "display: none"}) but that's not great...

A solution could be to do some regex search for numeric values and striping out everything else before sorting, but this operation would have to be done on each pair of values being compared which, in addition to being tricky to do correctly) would slow things down quite a bit (or it would need a significant rewrite which will not happen if I'm honest)

Yes, I see what you mean. I had two thoughts:

  1. Apply the transform only for display, so that the transform string really only gets written to the div and not stored in the data frame, like this:
    s = f'<div class="data data-{slugify(col)}" style="display: none;"></div>'

    ...and then still use the data frame itself for sorting.
  2. Allow a custom key for sorting that would get passed directly to pd.sort_values(key=...) https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html . I think this one is the simplest. Then, in my case, I could use a lambda to just truncate the first few characters which are fixed and cast the rest of the string to a float.

Ah wait are you talking about the sort_by parameter in mols2grid.display or the sort button on the grid?

My "solution" above referred to the latter and would operate JavaScript side through the library that handles the grid display, yours seem to suggest the former (i.e. Python side).
If you're only interested in sorting once at the beginning, I guess I could just do the sorting before applying the data transforms (but you could also do that directly on your input dataframe tbh).

Regarding your thoughts:

  1. the div seen here is basically just a template string used on the JavaScript side later on. The values from the dataframe are directly injected between the <div ...></div> in that template by the JavaScript library that I use, so at that point I can only pass the already-transformed values.
  2. this would work, but as said in the message above, I could just do the sorting before the data transforms and avoid users having to provide yet another lambda function to handle the data.

Sorry for the confusion -- I am actually referring to the button on the rendered grid. I didn't look at when/how the data is shipped to JS, so I forgot to think about sorting in JS rather than pandas. I'm not interested in sorting once -- as you say, that's not too tricky -- just hoping that if someone sorts on a field that's x = 123, that we can do the "right thing" numerically. I'm using a transform to do x = 123 simply because the molecules have lots of data associated with them and without the string, I don't think the users will know which number is displayed.

Ok in that case I guess a reasonable feature would be to add a new regex_sort parameter that toggles a slower but more powerful sorting function on the JS side. Not sure when I can add that in but will definitely consider it