IQSS / dataverse-sample-data

Scripts and sample data for demo purposes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tabular file for Zoomable Circle Packing visualization

pdurbin opened this issue ยท comments

Given sample data files like these four...

Screen Shot 2019-09-26 at 3 21 15 PM

We want a tabular file that looks like this:

Screen Shot 2019-09-26 at 3 34 27 PM

Here's a downloadable version: data.tsv.txt

The task is to write a script to create this tabular file based on the latest sample data in this repo. (A future task will be to transform it into a nested JSON document to be compatible with the d3 code below.)

Here's the first file to show the hierarchy and publication date:

Screen Shot 2019-09-26 at 3 37 59 PM

The eventual goal is to come up with something like the Zoomable Circle Packing visualization at http://bl.ocks.org/nbremer/667e4df76848e72f250b and in the screenshot below.

Screen Shot 2019-09-26 at 3 38 47 PM

@TaniaSchlatter @erikbuunk and Jess, please let me know if I have any of this wrong! ๐Ÿ˜„

0d9d467 is a complete mess but I'm starting to construct a TSV file with the 50 files currently in this repo. ๐Ÿ˜„

Hi Phil, thanks for opening this issue. I think the outcome that we want for the end of next week is to have the data from this dataset (https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/U5DCAW) in the format described above.

Related to issue IQSS/dataverse#5603. The plan after getting the formatted data to Jess is that she will work with it Tableau and R. We will meet next in the design meeting, 10am, Oct. 30.

@TaniaSchlatter maybe I misunderstood but I thought the target for the end of next week was a tabular file for the sample files in this repo. I just pushed b6cdf06 and it's full of hacks but I can at least provide you, @erikbuunk and Jess the following tabular file to look at:

files.tsv.txt (I have to add ".txt" to attached it to this GitHub issue).

Here's how it looks in a spreadsheet program (LibreOffice because I'm on Linux today):

Screenshot from 2019-09-27 15-48-37

I don't know if Jess is on GitHub or not so I'll assign this to you and Erik for review. I'll email Jess separately. Please note that the current sample data does not go to the level of depth we talked about in the meeting. Also, I hard coded the dates to today but I figure Jess can change a few of them manually if she wants to since we're only talking about 50 files.

The next steps after this as I see them:

  • Jess can immediately start playing with the "files.tsv" data above, which (again) is derived from this "sample data" repo. (Again, I'll email her to make sure she knows it's now available.)
  • Clean up the hideous script I just mentioned above that makes "files.tsv" out of whatever is in this sample data repo. That should be the definition of done for this issue. Please kick this issue back to me and I'll make a proper pull request and delete the "scratch" branch I've been using.
  • Update the SQL script from @scolapasta to use the new format. This work should go in the main "dataverse" repo and probably eventually made into an API if we like it. The new issue would be in the main repo too as a small chuck or I guess we could just use #5603.
  • Write a script to convert the tsv file into the nested JSON. The code can go in this issue because we might want to use the circle packing viz for demos some day. That would be a new small chunk issue here.

Thanks @pdurbin for this awesome step. The structure of the data looks to me like what we discussed. The sample data was fine for working out the structure. What we want for Jess for end of next week is applying the formatting to real world dataverse data from the dataset referenced above, either for 6 months or the full year, depending on the number of lines. We talked about @ 75,000 lines being reasonable for Jess to work with. Maybe @erikbuunk can give the formatting a review as well to help confirm before moving forward to working with the larger dataset.

The structure look good.

The data will probably not lead not something any super interesting, yet, but probably enough to make a start. Every document set has 1 layer and the set with a 2nd layer is the only one in that specific tree (which means one extra circle).

Something like this:
IMG_6410_r

@TaniaSchlatter I'm pretty sure Jess will get value out of the tabular file above. Yes, we'll get her more data. I may need to tap @scolapasta or @jggautier for their SQL-fu. ๐Ÿ˜„

@djbrooke especially if Jess gets immediate value out of the file above or even just for fun, we might want to create some more datasets here in the sample data repo that are deeper down in a tree of dataverses. Or we could reorganize the datasets we have. That would resolve the problem of "level 3" columns being empty in the file above.

@erikbuunk thanks! Super helpful!

Oh, I did email the tabular file to Jess by the way. Have a good weekend, all!

@TaniaSchlatter @djbrooke @scolapasta @mheppler @jggautier and I talked about this during design standup this morning.

@djbrooke said he'll do the housekeeping in terms of creating new issues, etc. Thanks!

I just heard from Jess. Sounds like she got the file above and is having fun:

"Super impressive! Itโ€™s working well. Canโ€™t wait to get a few more minutes to more fully explore ๐Ÿ˜Š

Feel free to give me a larger dataset at any point, maybe more representative? This one is very well-behaved."

We are all in 100% agreement that the next step is to give her more data, from production, no matter how ill behaved it is. ๐Ÿ˜„

The starting point will probably be the SQL scripts mentioned above. Here they are for safe keeping:

We might as well make them into an API endpoint, I'm thinking, so the issue should probably be created in https://github.com/IQSS/dataverse/issues