mortonne / psifr

Psifr: Analysis and visualization of free recall data

Home Page:https://psifr.readthedocs.io/en/stable/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multi-Level Indexing and Compatibility with R's reticulate Library

githubpsyche opened this issue · comments

Prima facie, psifr seems well-suited for use in R via its reticulate library. Even though it is written in Python, psifr is well-suited for R because its numerical outputs are Data Frames, rather than vectors or arrays as MATLAB-based functions tend to output. When using reticulate, R data frames are automatically converted to and from Python DataFrames. Moreover, the data format required for performing psifr or more hand-spun analyses is well suited for the kinds of operations you'd be doing with ggplot to make visualizations.

These features should make use of the library in R relatively seamless. But since psifr heavily wields multi-index rows, etc, to do a lot of its analyses, it doesn't really work out that way. Operations like fr.spc, when called using reticulate, discard subject and input levels critical for performing downstream analyses, like actually plotting serial position curves.

I feel like that's kinda a shame, a missed opportunity? Adding an optional argument triggering reset_index as the last step of functions like these would probably sidestep this limitation while maintaining compatibility with downstream functions like fr.plot_spc (I've already tested as much for this function). seaborn and pandas's various grouping steps seem to work just as well when identifying information like subject or input are in row indices or columns.

Of course R compatibility is outside the project's scope, and the overhead of performing reset_index over and over again might be significant, but adding an option to perform that within psifr's functions would make it possible to, for example, add a page to the library's documentation called "Using Psifr in R" that shows it's as simple as the code I include below (i.e. almost exactly like it's used in Python), and potentially extend its reach across the research community.

Just an idea.

library(reticulate)
fr <- import("psifr.fr")

data <- read.csv("data.csv")
merged<- fr$merge_free_recall(data)
spc <- fr$spc(merged))

Seems like a good idea. It's a simple change to add a reset_index call to each of the analysis functions in the fr module, and no code that I know of actually uses the MultiIndex in the output (resetting is the most common first step). I don't think there's any need for an option to toggle between resetting or not, especially given that re-creating the MultiIndex is trivial if anyone ever needs it, so resetting could just be the new behavior across the board. The current API isn't that specific about the output format anyway.

You're welcome to implement this change if you want to take a pass at it. I'm less familiar with the R side of things, but it sounds like you could test that out. I've updated the contribution guidelines to give more detail about how to prepare a pull request.

As for the documentation, if you could confirm that your example works with your changes to Psifr, then I can add a page to the docs with that example.

Cool, I'll get to it when I have the chance.

Hullo,

After some more experimenting, it seems that there are other issues with the reticulate package's conversion between Python and R data frames that might make committing psifr to work smoothly in R more trouble than it's maybe worth. For example, it seems integer values in Python dataframes (e.g. subject indices and input positions) are automatically converted into doubles.

Also, psifr's plotting functionality (and matplotlib-based libraries in general) doesn't seem to play very well with RStudio's console output, even with the help of reticulate. I also find that it doesn't work within the R Jupyter kernel. There are other contexts where it might work fine (e.g. RMarkdown), but this limitation is yet another source of friction that suggests explicitly supporting R in psifr's documentation is perhaps a step too far.

Also, adding reset_index() as a final step to functions like spc would affect the outputs of code already depending on psifr, as an additional reset_index() operation on relevant dataframes adds a column assigning a unique index to each row.

For these reasons, I think I got ahead of myself a bit here with this feature suggestion. Like you, I mostly work in Python; I should have gotten more experienced with the reticulate package before suggestion this library be extended to cooperate with it.

@githubpsyche
I've gotten more interest for R support, and decided that it's worth having at least some support through reticulate. I think changing the output to not use multi-indexing is probably an improvement anyway, as multi-indexing adds some complications and users are unlikely to use it (the most common use case I've seen for multi-indexes is merging two DataFrames, which is unlikely to be useful in this context).

I don't think the conversion step will cause any problems with existing analyses. Input and output positions are generally doubles anyway, as they may contain NaNs and Pandas therefore generally uses double arrays for those columns. Item indices are already cast to int as needed. And subject identifiers can be basically anything, including floats, so a conversion to doubles shouldn't cause problems.

R Jupyter support would be nice, but RStudio support can still be added without much work. I'm guessing that plotting support will not be possible, but access to the statistics calculated by Psifr (e.g., lag-CRP, temporal distance rank) could still be very useful for R users.

I've created a modified version of Psifr in the reticulate branch. There are draft instructions in case anyone wants to try it out now. Any feedback would be very welcome!

Unless major problems are found in testing, I'll add reticulate support in a release in the near future.

Hi, yes, cool. My memory is a little hazy about the details of this, but if I remember right, the trouble was less with the conversion step than steps afterward. I think I also noticed that some of psifr's plotting functions don't behave how they should when you apply them to analysis outputs post-reset_index -- but I imagine your tests have already ruled out or caught any issues.

I agree that plotting support is probably outside of scope here. At the same time, I think including downstream processing/plotting examples in the documentation's usage notes and tests might help clarify any limitations (if any) to using psifr/reticulate to do analyses in R. I think these will either be able to show that the output representations are handled as cleanly as native data representations in R, or provide a clearer roadmap for further extending psifr's R support through reticulate (e.g. a wrapper function to handle any final conversion steps).

The test suite passes, so I know at least that none of the plotting functions crash when reset_index has been run. I'll take a closer look at test plots before releasing the change, but I don't know of any reason why resetting the index would cause any issues with any of the plotting functions. If anything, it's the lack of index resetting that I've observed causing issues.

I'm working on adding unit tests in R to test for accuracy and identify any edge cases in the conversion when calculating statistics. That'll make sure the main psifr.fr API is working through reticulate, minus the plotting functions.

I can make plots in R using reticulate, if I first call the plotting function and then run plt.show() (after importing matplotlib.pyplot). But the resulting plots are low-resolution raster images that are not suitable for publications. Native support for plotting in R is beyond my current knowledge. Full native plotting support would require calculating bootstrap intervals and supporting something like the current Seaborn-based interface, or showing how to do similar things with ggplot2. I would likely need help to implement plotting support in R, or to show good plotting examples in the documentation.

I've made a new R package, psifrr, to help with interfacing with Psifr through R using reticulate. With a few fixes in the reticulate branch, all the non-plotting functions work and the output matches the Psifr documentation examples.

I've merged in the necessary changes, so the {psifrr} package no longer requires the reticulate branch of Psifr. Psifr 0.9.0 is compatible with {psifrr}. Now, installing {psifrr} will install Psifr from PyPI instead of from GitHub.