atorus-research / xportr

Tools to build CDISC compliant data sets and check for CDISC compliance.

Home Page:https://atorus-research.github.io/xportr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Request: FDA expectations around minimising xpt filesizes

rossfarrugia opened this issue · comments

Feature Idea

@elimillera @bms63 our eSubmission team spotted a very useful new feature idea for {xportr}. The request was to have an option in xportr_write() and xportr_length() respectively to split the dataset if >5GB (with flexibility to change the cutoff as China HA expects no bigger than 4GB) and to minimise the variable lengths according to the data values to minmise the filesize.

This is needed as per FDA STUDY DATA TECHNICAL CONFORMANCE GUIDE https://www.fda.gov/media/88173/download:

image

Relevant Input

No response

Relevant Output

No response

Reproducible Example/Pseudo Code

No response

I found in ADRG completion guideline some information on split datasets. It is mentioned that the method should be describe in ADRG. In the provided example, the split is based on the parameters category.

I suggest adding a message to users when the file size exceeds 5 GB. This would allow users to choose the method (e.g., parameter category). Additionally, we could add a function argument to enable dataset splitting irrespective of the data.

What do you think?

image
image

Thanks @cpiraux.

Do we want xportr_write() to print out a warning messages saying data is greater than 5 GB and a general message saying the the size of the data (e,g, China submission soft check). I think we should have another simpler functions called xportr_split() that a user can then use to help do the split?

I am a little weary to make xportr_write() overly complicated with more arguments

Also I just realized that this is two separate issues

  • xportr_length() finding smallest length in the data - I think R does this behind the scenes already for us. But if we are using the spec to overwrite this R behavior and apply lengths in the spec....so we need a way to update the spec with the new lengths in the data (no clue on how to do this)
  • xportr_write() - 5 GB split issue

Agree with Ben! For how to do the split my recommendation would be just give users one single option to keep simple - i.e. by subject, as I imagine this holds for most domains and definitely those large enough to warrant needing splitting. Whereas by parameter is only relevant for certain domains. In reality the first step the review team would do would be to append the split datasets back together anyway, so it shouldn't really matter exactly how the split is chosen to be applied.

Some initial ideas.

Dataset size is indeterminate until they are written out. So flow could be:
xportr_write() calls xportr_split if data > 5GB, xportr_split splits then xportr_write again.

Hi @elimillera - So I don't think my example from Advanced R is that helpful, but I like the use of walk2 so will share anyways. https://adv-r.hadley.nz/functionals.html?q=iterating#no-outputs-walk-and-friends