Welcome to our comprehensive tutorial on model fitting using the R caret package. In this detailed guide, we cover every step necessary to understand and implement model fitting processes effectively. From loading essential libraries and preparing your data to advanced techniques like hyperparameter tuning and evaluating model performance, this tutorial ensures a deep understanding of the caret package's capabilities.
What You Will Learn:
Setting up your environment by loading required libraries.
Preprocessing data and converting variables.
Splitting data into training and testing sets to avoid overfitting.
Implementing cross-validation techniques to ensure model robustness.
Detailed explanation of hyperparameter tuning using the random forest model.
Evaluating model performance using confusion matrices and understanding variable importance.
Whether you're a student, researcher, or data science professional, this video equips you with the skills to perform sophisticated data analyses in R. Don't forget to engage with us by liking, subscribing, and commenting on your thoughts or questions below!
link to download the data
https://docs.google.com/spreadsheets/d/1OpOOcOB-k-nopu87mDElvRgC8nie-wJ7/edit?usp=drive_link&ouid=109661670790390446227&rtpof=true&sd=true
# code
## Comprehensive Steps for Model Fitting Using the Caret Package
# Step 1: Load Required Libraries
# Step 2: Read Data
# Step 3: Preprocess Data
# Step 4: Split Data into Training and Testing Sets
# Step 5: Set Up Cross-Validation
# Step 6: Define Hyperparameter Tuning Grid
# Step 7: Train the Model
# Step 8: Evaluate Model Performance
# Step 9: Variable Importance
# Actual code implementation
# Load Required Libraries
library(caret) # for modeling
library(readxl) # for reading Excel files
library(randomForest) # for using the randomForest method
# Set working directory and Read Data
setwd("E:\\Rworks\\model fitting using caret package")
data = read_xlsx("Raisin_Dataset.xlsx")
str(data)
unique(data$Class)
# Ensure class variable is a factor
data$Class = as.factor(data$Class)
str(data)
# Splitting data into training and testing sets
set.seed(546) # Ensure reproducibility
training_indices =createDataPartition(data$Class, p=0.8, list=FALSE)
is.matrix(training_indices)
dim(training_indices)
training_set = data[training_indices, ]
testing_set = data[-training_indices, ]
dim(testing_set)
# Setting up cross-validation and ensuring variable importance is recorded
train_control = trainControl(
method = "cv",
number = 10,
savePredictions = "final",
classProbs = TRUE, # Store class probabilities
#importance = TRUE, # Ensure variable importance is calculated
summaryFunction = twoClassSummary # Use appropriate summary function for binary classification
)
# Defining hyperparameter tuning grid for the randomForest method
tune_grid = expand.grid(
mtry = c(2, 4, 6) # Number of variables considered at each split
)
# Training the model using randomForest
model = train(
Class ~ .,
data = training_set,
method = "rf", # Using randomForest for training
trControl = train_control,
tuneGrid = tune_grid
)
model
# Extracting and plotting variable importance using caret's function
importance = varImp(model, scale = TRUE)
#barplot(importance$importance$Overall, names.arg = rownames(importance$importance))
plot(importance, main = "Variable Importance Plot")
attributes(importance)
# Evaluate model performance
print(model)
confusionMatrix(predict(model, testing_set), testing_set$Class)
summary(model$bestTune)
# reduced model
# Training the model using randomForest
model_red = train(
Class ~ Perimeter,
data = training_set,
method = "rf", # Using randomForest for training
trControl = train_control,
tuneGrid = tune_grid
)
confusionMatrix(predict(model_red , testing_set), testing_set$Class)
#DataScience, #RProgramming, #ModelFitting, #MachineLearning, #CaretPackage, #statisticalmodeling
00:00 - Introduction
01:30 - Loading Libraries
03:00 - Data Preprocessing
05:00 - Splitting Data
07:30 - Setting up Cross Validation
10:00 - Hyperparameter Tuning
12:30 - Training the Model
15:00 - Evaluating Model Performance
18:00 - Studying Variable Importance
20:00 - Conclusion & Additional Tips
Facebook page:
https://www.facebook.com/RajendraChoureISC
Mail Id:
[email protected]
youtube playlist:
https://www.youtube.com/playlist?list=PLfAzV0jqypOjX2h3YkeETd5RRO6f3VXpE
Тэги:
#R_programming #Data_visulisation #statistics