Option to add colnames to new columns

Question

Option to add colnames to new columns

adomingues opened this issue 6 years ago · comments

first of all thank you so much for this package! It is part of my routine analysis for some time now. I would just like to suggest a convenience option to skip column renaming after splitting. Example:

to_split <- structure(list(Sample = c("N2_wt_rep1_untreated", "N2_wt_rep1_untreated", 
"N2_wt_rep1_untreated", "N2_wt_rep2_untreated", "N2_wt_rep2_untreated", 
"N2_wt_rep2_untreated"), Reads = c(470987L, 270891L, 56114L, 
513902L, 310722L, 67263L)), .Names = c("Sample", "Reads"), class = "data.frame", row.names = c(NA, 
-6L))
split <- cSplit(to_split, "Sample", sep="_")
split
#     Reads Sample_1 Sample_2 Sample_3  Sample_4
# 1: 470987       N2       wt     rep1 untreated
# 2: 270891       N2       wt     rep1 untreated
# 3:  56114       N2       wt     rep1 untreated
# 4: 513902       N2       wt     rep2 untreated
# 5: 310722       N2       wt     rep2 untreated
# 6:  67263       N2       wt     rep2 untreated

The new col names are not very informative, so I usually rename them in an extra step:

setnames(split,
   c("Sample_1", "Sample_2", "Sample_3", "Sample_4"),
   c("Background", "Allele", "Replicate", "Treatment")
)

This is fine, but I wonder if it would possible to skip that extra step with cSplit(to_split, "Sample", sep="_"), new_names=c("Background", "Allele", "Replicate", "Treatment")

Cheers.

Ananda Mahto · Answer 1 · Wed Apr 04 2018 10:47:07 GMT+0800 (China Standard Time)

Thanks @adomingues for the comment. I've thought about this in the past. It shouldn't be too difficult to implement, so I'll look into it again.

Here are a couple of reasons I didn't implement it the first time around:

The cSplit function is generalized in the sense that I should be able to split a column not knowing how many columns would be in the result.
The cSplit function is vectorized, so a simple new_names = c(...) wouldn't work--it would have to be something like list(Sample = c("Background", "Allele", "Replicate", "Treatment")

Any thoughts on those?

A. Domingues · Answer 2 · Wed Apr 04 2018 14:57:59 GMT+0800 (China Standard Time)

Thanks for considering this @mrdwab. I was think about implementation, after posting and my very näive thought was to operate on the colnames after spliting. For instance greping the colnames and replacing only those:

cSplit2 <- function(indt, splitCols, newNames, ...){
   split <- cSplit(to_split, "Sample", sep="_")
   newcols <- grep(paste(splitCols, collapse="|"), colnames(split))
   colnames(split)[newcols] <- newNames
   return(split)
}

cSplit2(to_split, splitCols = "Sample", sep="_", newNames = c("Background", "Allele", "Replicate", "Treatment"))

This is of course of the opposite of what you suggested :) but I wonder it would be a good starting point.

Ananda Mahto · Answer 3 · Wed Apr 04 2018 15:27:28 GMT+0800 (China Standard Time)

@adomingues, Here's a POC renamer function that I can probably drop-in at the last stages of the existing cSplit function. Here, I'm just demonstrating it as an external function:

library(splitstackshape)
library(data.table)
df <- data.frame(x = 1:3, y = c("a", "d,e", "g,h"), z = c("1", "2,3,4", "6"))

renamer <- function(data, replacement) {
  if (!is.list(replacement)) stop("replacement should be a named list")
  for (i in seq_along(replacement)) {
    old <- names(data)[startsWith(names(data), names(replacement)[i])]
    setnames(data, old = old, new = replacement[[i]])
  }
  data[]
}

cSplit(df, c("y", "z"))
#    x y_1  y_2 z_1 z_2 z_3
# 1: 1   a <NA>   1  NA  NA
# 2: 2   d    e   2   3   4
# 3: 3   g    h   6  NA  NA

renamer(cSplit(df, c("y", "z")), 
        list(y = c("A", "B"), z = c("AA", "BB", "CC")))
#    x A    B AA BB CC
# 1: 1 a <NA>  1 NA NA
# 2: 2 d    e  2  3  4
# 3: 3 g    h  6 NA NA

So, a possible final implementation might look like:

cSplit(df, c("y", "z"), sep = ",", new_names = list(y = c("A", "B"), z = c("AA", "BB", "CC")))

Alternatively, the entire API can be revisited such that, depending on the input, the function behaves differently:

If a simple character string of column names is provided, use the current approach.
If a list is provided in the splitCols argument, new names can be specified (eg: cSplit(df, splitCols = list(y = c("A", "B"), z = c("AA", "BB", "CC")), sep = ","))

Let me think about it some more, but I'm open to other ideas as well as I'm currently planning a V2 release of the package later this year.

A. Domingues · Answer 4 · Wed Apr 04 2018 15:45:19 GMT+0800 (China Standard Time)

If a list is provided in the splitCols argument, new names can be specified (eg: cSplit(df, splitCols = list(y = c("A", "B"), z = c("AA", "BB", "CC")), sep = ","))

This pretty much solves it, at least for me. Looking forward to V2.