Option to add colnames to new columns
adomingues opened this issue · comments
first of all thank you so much for this package! It is part of my routine analysis for some time now. I would just like to suggest a convenience option to skip column renaming after splitting. Example:
to_split <- structure(list(Sample = c("N2_wt_rep1_untreated", "N2_wt_rep1_untreated",
"N2_wt_rep1_untreated", "N2_wt_rep2_untreated", "N2_wt_rep2_untreated",
"N2_wt_rep2_untreated"), Reads = c(470987L, 270891L, 56114L,
513902L, 310722L, 67263L)), .Names = c("Sample", "Reads"), class = "data.frame", row.names = c(NA,
-6L))
split <- cSplit(to_split, "Sample", sep="_")
split
# Reads Sample_1 Sample_2 Sample_3 Sample_4
# 1: 470987 N2 wt rep1 untreated
# 2: 270891 N2 wt rep1 untreated
# 3: 56114 N2 wt rep1 untreated
# 4: 513902 N2 wt rep2 untreated
# 5: 310722 N2 wt rep2 untreated
# 6: 67263 N2 wt rep2 untreated
The new col names are not very informative, so I usually rename them in an extra step:
setnames(split,
c("Sample_1", "Sample_2", "Sample_3", "Sample_4"),
c("Background", "Allele", "Replicate", "Treatment")
)
This is fine, but I wonder if it would possible to skip that extra step with cSplit(to_split, "Sample", sep="_"), new_names=c("Background", "Allele", "Replicate", "Treatment")
Cheers.
Thanks @adomingues for the comment. I've thought about this in the past. It shouldn't be too difficult to implement, so I'll look into it again.
Here are a couple of reasons I didn't implement it the first time around:
- The
cSplit
function is generalized in the sense that I should be able to split a column not knowing how many columns would be in the result. - The
cSplit
function is vectorized, so a simplenew_names = c(...)
wouldn't work--it would have to be something likelist(Sample = c("Background", "Allele", "Replicate", "Treatment")
Any thoughts on those?
Thanks for considering this @mrdwab. I was think about implementation, after posting and my very näive thought was to operate on the colnames
after spliting. For instance greping the colnames and replacing only those:
cSplit2 <- function(indt, splitCols, newNames, ...){
split <- cSplit(to_split, "Sample", sep="_")
newcols <- grep(paste(splitCols, collapse="|"), colnames(split))
colnames(split)[newcols] <- newNames
return(split)
}
cSplit2(to_split, splitCols = "Sample", sep="_", newNames = c("Background", "Allele", "Replicate", "Treatment"))
This is of course of the opposite of what you suggested :) but I wonder it would be a good starting point.
@adomingues, Here's a POC renamer
function that I can probably drop-in at the last stages of the existing cSplit
function. Here, I'm just demonstrating it as an external function:
library(splitstackshape)
library(data.table)
df <- data.frame(x = 1:3, y = c("a", "d,e", "g,h"), z = c("1", "2,3,4", "6"))
renamer <- function(data, replacement) {
if (!is.list(replacement)) stop("replacement should be a named list")
for (i in seq_along(replacement)) {
old <- names(data)[startsWith(names(data), names(replacement)[i])]
setnames(data, old = old, new = replacement[[i]])
}
data[]
}
cSplit(df, c("y", "z"))
# x y_1 y_2 z_1 z_2 z_3
# 1: 1 a <NA> 1 NA NA
# 2: 2 d e 2 3 4
# 3: 3 g h 6 NA NA
renamer(cSplit(df, c("y", "z")),
list(y = c("A", "B"), z = c("AA", "BB", "CC")))
# x A B AA BB CC
# 1: 1 a <NA> 1 NA NA
# 2: 2 d e 2 3 4
# 3: 3 g h 6 NA NA
So, a possible final implementation might look like:
cSplit(df, c("y", "z"), sep = ",", new_names = list(y = c("A", "B"), z = c("AA", "BB", "CC")))
Alternatively, the entire API can be revisited such that, depending on the input, the function behaves differently:
- If a simple character string of column names is provided, use the current approach.
- If a
list
is provided in thesplitCols
argument, new names can be specified (eg:cSplit(df, splitCols = list(y = c("A", "B"), z = c("AA", "BB", "CC")), sep = ",")
)
Let me think about it some more, but I'm open to other ideas as well as I'm currently planning a V2 release of the package later this year.
If a list is provided in the splitCols argument, new names can be specified (eg: cSplit(df, splitCols = list(y = c("A", "B"), z = c("AA", "BB", "CC")), sep = ","))
This pretty much solves it, at least for me. Looking forward to V2.