Usage for larger datasets
Trybnetic opened this issue · comments
Hi,
excellent work you put into this package here!
I started using your package, but ran into some problems with a larger data set which is too big to load everything simultaneously into my memory. For that reason, I am wondering whether there is any possibility to already aggregate the data beforehand. I have not figured out yet, how this is handled internally in pySankey
, but I assume some aggregation like the one below is done internally anyways:
> df = pd.read_csv('pysankey/fruits.txt', sep=' ', names=['true', 'predicted'])
> df.groupby(["true", "predicted"]).size().reset_index().rename(columns={0: "weight"})
true predicted weight
0 apple apple 50
1 apple banana 28
2 apple blueberry 129
3 apple kiwi 12
4 apple lime 21
5 apple orange 50
6 banana apple 7
7 banana banana 34
8 banana blueberry 34
9 banana lime 6
10 banana orange 88
11 blueberry apple 19
12 blueberry banana 46
13 blueberry blueberry 84
14 blueberry lime 14
15 blueberry orange 50
16 lime apple 55
17 lime banana 23
18 lime blueberry 75
19 lime kiwi 23
20 lime lime 70
21 lime orange 37
22 orange apple 3
23 orange banana 15
24 orange blueberry 8
25 orange kiwi 6
26 orange lime 1
27 orange orange 53
I would be absolutely delighted, if you could give me some pointers where the data aggregation happens. Also I would be happy about your point of view whether (or rather how difficult it is) to expose the underlying plotting-only code to allow for some pre-call aggregation. Finally, I would be also happy to know whether you would be interested to incorporate a solution to this problem into the package? In that case, I would try to solve my problem in a fork and open a PR if I am successful.
I now discovered the use of leftWeight
and rightWeight
to exactly do that. Somehow I skipped that part of the Readme and it did not become clear from the documentation in the code and I am still a bit curious why it is leftWeight
and rightWeight
, even though you give the same argument to both in the example?
Thank you for the kind word, I didn't do much here except for releases, most of the actual changes are done by user of this lib tjat contribute. (I've not actually used this lib since 2018).
why it is leftWeight and rightWeight, even though you give the same argument to both in the example?
There's probably a mistake in the example, a PR to make the documentation clearer with different weight for left and right would be welcome.
Thanks for the reply! To me it looks actually that the documentation there is so far correct, but that implementation of leftWeight
and rightWeight
seems redundant. At least, I obtain the same plot for the fruits dataset when I use it the way described in the documentation and when I weight it first. However, there might be some undocumented edge cases which lead to this way. Some more research on the code side would be required to (hopefully) figure this out and I will see whether I can find some time to do that.
On the other hand, in any case, I believe that the examples in the readme could be clarified and harmonized a bit more (e.g., describing the observed behavior as mentioned above) and I would open a PR for that soon.
Thank you for digging into it !