Pierre-Sassoulas / pySankey

This is the maintened version of PySankey (pySankeyBeta on Pypi)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Usage for larger datasets

Trybnetic opened this issue · comments

Hi,

excellent work you put into this package here!

I started using your package, but ran into some problems with a larger data set which is too big to load everything simultaneously into my memory. For that reason, I am wondering whether there is any possibility to already aggregate the data beforehand. I have not figured out yet, how this is handled internally in pySankey, but I assume some aggregation like the one below is done internally anyways:

> df = pd.read_csv('pysankey/fruits.txt', sep=' ', names=['true', 'predicted'])
> df.groupby(["true", "predicted"]).size().reset_index().rename(columns={0: "weight"})
         true  predicted  weight
0       apple      apple      50
1       apple     banana      28
2       apple  blueberry     129
3       apple       kiwi      12
4       apple       lime      21
5       apple     orange      50
6      banana      apple       7
7      banana     banana      34
8      banana  blueberry      34
9      banana       lime       6
10     banana     orange      88
11  blueberry      apple      19
12  blueberry     banana      46
13  blueberry  blueberry      84
14  blueberry       lime      14
15  blueberry     orange      50
16       lime      apple      55
17       lime     banana      23
18       lime  blueberry      75
19       lime       kiwi      23
20       lime       lime      70
21       lime     orange      37
22     orange      apple       3
23     orange     banana      15
24     orange  blueberry       8
25     orange       kiwi       6
26     orange       lime       1
27     orange     orange      53

I would be absolutely delighted, if you could give me some pointers where the data aggregation happens. Also I would be happy about your point of view whether (or rather how difficult it is) to expose the underlying plotting-only code to allow for some pre-call aggregation. Finally, I would be also happy to know whether you would be interested to incorporate a solution to this problem into the package? In that case, I would try to solve my problem in a fork and open a PR if I am successful.

I now discovered the use of leftWeight and rightWeight to exactly do that. Somehow I skipped that part of the Readme and it did not become clear from the documentation in the code and I am still a bit curious why it is leftWeight and rightWeight, even though you give the same argument to both in the example?

Thank you for the kind word, I didn't do much here except for releases, most of the actual changes are done by user of this lib tjat contribute. (I've not actually used this lib since 2018).

why it is leftWeight and rightWeight, even though you give the same argument to both in the example?

There's probably a mistake in the example, a PR to make the documentation clearer with different weight for left and right would be welcome.

Thanks for the reply! To me it looks actually that the documentation there is so far correct, but that implementation of leftWeight and rightWeight seems redundant. At least, I obtain the same plot for the fruits dataset when I use it the way described in the documentation and when I weight it first. However, there might be some undocumented edge cases which lead to this way. Some more research on the code side would be required to (hopefully) figure this out and I will see whether I can find some time to do that.

On the other hand, in any case, I believe that the examples in the readme could be clarified and harmonized a bit more (e.g., describing the observed behavior as mentioned above) and I would open a PR for that soon.

Thank you for digging into it !