Rita94105 / k-fold

*k*-fold cross-validation in protein subcellular localization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

k*-fold cross-validation in protein subcellular localization

PredictProtein

Description

Perform k-fold cross-validation for tuning the following parameters of a random forest model.

  • ntree: 10
  • mtry: 75
  • maxnodes: 20

cmd

k_fold(k, './data/Archaeal_tfpssm.csv', 'performance.csv')

k-fold cross-validation

  • Divide the data into k parts, the number of parts used by each data set
    • (training, validation, testing) = (k-2, 1, 1)
  • The following shows the example of the 5-fold cross validation.

cross-validation

Input: Archaeal_tfpssm.csv

📁 Archaeal_tfpssm.csv download

This CSV doesn't contain a header. The information of columns as below:

  • V2: labels of proteins

    • CP: Cytoplasmic
    • CW: Cell Wall
    • EC: Extracellular
    • IM: Inner membrane
  • V3 ~ V5602: the gapped-dipeptide features of each protein

Output format: performance.csv

  • accuracy = P/N, average of k-fold cross-validation
set training validation test
fold1 0.93 0.91 0.88
fold2 0.92 0.91 0.89
fold3 0.94 0.92 0.90
fold4 0.91 0.89 0.87
fold5 0.90 0.92 0.87
ave. 0.92 0.91 0.88

Code for reference

library(randomForest)

k_fold <- function(fold, input_file, output_file){
  
  # model using random forest & tune best parameters
  model <- randomForest(ntree, mtry, maxnodes)
  # make confusion matrix tabel
  resultframe <- data.frame(truth=tmp$V2,
                            pred=predict(model, type="class"))
  # output the confusion matrix                        
  write.csv()

  return (your_model)
}

References

Please list the code and its reference.

If needed, you should explain the details, i.e., comment like # ChatGPT, respond to “your prompt,” February 16, 2023.

Data Set:

Code:

About

*k*-fold cross-validation in protein subcellular localization


Languages

Language:R 100.0%