Feature Request: FDA expectations around minimising xpt filesizes

Question

Feature Request: FDA expectations around minimising xpt filesizes

rossfarrugia opened this issue 9 months ago · comments

Feature Idea

@elimillera @bms63 our eSubmission team spotted a very useful new feature idea for {xportr}. The request was to have an option in xportr_write() and xportr_length() respectively to split the dataset if >5GB (with flexibility to change the cutoff as China HA expects no bigger than 4GB) and to minimise the variable lengths according to the data values to minmise the filesize.

This is needed as per FDA STUDY DATA TECHNICAL CONFORMANCE GUIDE https://www.fda.gov/media/88173/download:

Relevant Input

No response

Relevant Output

No response

Reproducible Example/Pseudo Code

No response

Celine Piraux · Answer 1 · Tue Nov 21 2023 20:25:02 GMT+0800 (China Standard Time)

I found in ADRG completion guideline some information on split datasets. It is mentioned that the method should be describe in ADRG. In the provided example, the split is based on the parameters category.

I suggest adding a message to users when the file size exceeds 5 GB. This would allow users to choose the method (e.g., parameter category). Additionally, we could add a function argument to enable dataset splitting irrespective of the data.

What do you think?

Ben Straub · Answer 2 · Tue Nov 21 2023 21:42:18 GMT+0800 (China Standard Time)

Thanks @cpiraux.

Do we want xportr_write() to print out a warning messages saying data is greater than 5 GB and a general message saying the the size of the data (e,g, China submission soft check). I think we should have another simpler functions called xportr_split() that a user can then use to help do the split?

I am a little weary to make xportr_write() overly complicated with more arguments

Also I just realized that this is two separate issues

xportr_length() finding smallest length in the data - I think R does this behind the scenes already for us. But if we are using the spec to overwrite this R behavior and apply lengths in the spec....so we need a way to update the spec with the new lengths in the data (no clue on how to do this)
xportr_write() - 5 GB split issue

Ross Farrugia · Answer 3 · Tue Nov 21 2023 22:11:51 GMT+0800 (China Standard Time)

Agree with Ben! For how to do the split my recommendation would be just give users one single option to keep simple - i.e. by subject, as I imagine this holds for most domains and definitely those large enough to warrant needing splitting. Whereas by parameter is only relevant for certain domains. In reality the first step the review team would do would be to append the split datasets back together anyway, so it shouldn't really matter exactly how the split is chosen to be applied.

Eli Miller · Answer 4 · Wed Nov 22 2023 00:42:31 GMT+0800 (China Standard Time)

Some initial ideas.

Dataset size is indeterminate until they are written out. So flow could be:
xportr_write() calls xportr_split if data > 5GB, xportr_split splits then xportr_write again.

Ben Straub · Answer 5 · Thu Jan 18 2024 03:20:33 GMT+0800 (China Standard Time)

Hi @elimillera - So I don't think my example from Advanced R is that helpful, but I like the use of walk2 so will share anyways. https://adv-r.hadley.nz/functionals.html?q=iterating#no-outputs-walk-and-friends