R is a statistical computing and data analysis programming language, widely used in data science, machine learning, and statistical modeling. It is particularly popular in academia, research, and finance due to its powerful statistical capabilities.
Key Features & Use Cases
Key Features
- Designed for Statistical Computing – Built-in functions for data manipulation, visualization, and modeling.
- Interpreted Language – Runs code interactively, making it easy to test and analyze results.
- Rich Visualization – Libraries like ggplot2 and lattice create advanced plots and charts.
- Extensive Package Ecosystem – Thousands of community-contributed packages available via CRAN.
- Open Source & Cross-Platform – Runs on Windows, Mac, and Linux.
- Functional & Object-Oriented – Supports both paradigms.
Use Cases
- Statistical testing (t-tests, ANOVA)
- Data visualization (scatter plots, histograms)
- Data manipulation (filtering, summarizing data)
- Predictive modeling (linear regression, decision trees)
Statistical Testing Libraries
stats (Built-in)
- Part of the core R distribution, providing essential statistical functions (t-tests, linear models, ANOVA)
- Useful for quick, standard analyses without installing additional packages
# Example usage of stats package (built-in)
# No need to install or explicitly load 'stats' as it's included by default
# Perform a t-test
sample_data <- rnorm(50, mean = 5, sd = 2)
t.test(sample_data, mu = 5)
# Fit a linear model
model <- lm(mpg ~ cyl + wt, data = mtcars)
summary(model)
car
- Provides advanced regression diagnostics, ANOVA enhancements, and tools for linear hypothesis testing
- Useful for validating model assumptions (e.g., checking multicollinearity or normality of residuals)
# Installation
install.packages("car")
# Load the library
library(car)
# Example usage
data(mtcars)
lm_model <- lm(mpg ~ disp + hp, data = mtcars)
outlierTest(lm_model) # Identify potential outliers
vif(lm_model) # Check for multicollinearity
qqPlot(lm_model, main="QQ Plot") # Visualize residual distribution
broom
- Converts statistical analysis objects (lm, glm, etc.) into tidy data frames
- Allows for easier manipulation and visualization of model summaries
# Installation
install.packages("broom")
# Load the library
library(broom)
# Example usage
model <- lm(mpg ~ cyl + wt, data = mtcars)
tidy_model <- tidy(model)
glance_model <- glance(model)
print(tidy_model) # Coefficients in a tidy data frame
print(glance_model) # Model-level statistics
Data Visualization Libraries
ggplot2
- Provides a powerful system for creating complex and aesthetically pleasing visualizations
- Implements the “Grammar of Graphics,” allowing you to layer graphical elements (geoms) to build custom plots
- Part of the tidyverse, integrating seamlessly with dplyr and other data manipulation packages
# Installation
install.packages("ggplot2")
# Example usage
library(ggplot2)
df <- data.frame(
category = c("A", "B", "C", "D"),
values = c(10, 20, 15, 30)
)
ggplot(df, aes(x = category, y = values)) +
geom_bar(stat = "identity") +
ggtitle("Example Bar Plot")
plotly
- Allows interactive, web-based visualizations in R
- Integrates with ggplot2 or can be used standalone for dynamic charts and dashboards
# Installation
install.packages("plotly")
# Example usage
library(plotly)
df <- data.frame(
x = 1:10,
y = cumsum(rnorm(10))
)
plot_ly(df, x = ~x, y = ~y, type = 'scatter', mode = 'lines')
lattice
- Implements a trellis-style plotting system
- Offers conditioning and grouping for multi-panel plots, making it easier to compare subsets of data
# Installation
install.packages("lattice")
# Load the library
library(lattice)
# Example usage
xyplot(mpg ~ wt | factor(cyl), data = mtcars,
main="MPG vs Weight by Number of Cylinders",
xlab="Weight (1000 lbs)", ylab="Miles Per Gallon")
Data Manipulation Libraries
dplyr
- A grammar of data manipulation providing verbs for filtering, selecting, grouping, and summarizing data
- Helps write clean, readable code by “chaining” operations with the pipe (%>%)
- Works efficiently with in-memory data frames and can also handle large datasets in conjunction with data.table
# Installation
install.packages("dplyr")
# Example usage
library(dplyr)
data <- data.frame(
x = 1:5,
y = c("A", "B", "B", "A", "C"),
z = c(10, 20, 15, 25, 30)
)
summary_data <- data %>%
filter(z > 15) %>%
group_by(y) %>%
summarize(MeanValue = mean(z))
print(summary_data)
tidyr
- Focuses on helping you create tidy data: each variable is a column, each observation is a row
- Includes functions like pivot_longer, pivot_wider, separate, and unite to reshape data
- Often used in combination with dplyr for complete data wrangling workflows
# Installation
install.packages("tidyr")
# Example usage
library(tidyr)
wide_data <- data.frame(
id = 1:3,
varA_2020 = c(10, 12, 14),
varA_2021 = c(15, 16, 19)
)
long_data <- wide_data %>%
pivot_longer(
cols = starts_with("varA"),
names_to = "Year",
values_to = "Value"
)
print(long_data)
data.table
- An enhanced version of data frames with syntax for faster queries, aggregation, and joins
- Extremely performant for large-scale data operations
- Syntax differs slightly from the tidyverse approach, often resembling SQL operations
# Installation
install.packages("data.table")
# Example usage
library(data.table)
dt <- data.table(
x = 1:5,
y = c("A", "B", "B", "A", "C"),
z = c(10, 20, 15, 25, 30)
)
# Fast grouping and aggregation
summary_dt <- dt[z > 15, .(MeanValue = mean(z)), by = y]
print(summary_dt)
Predictive Modeling Libraries
caret
- A unifying interface for building machine learning models in R
- Provides consistent syntax for training, tuning, and evaluating models (classification, regression, etc.)
- Offers functions for feature selection, cross-validation, and performance metrics
# Installation
install.packages("caret")
# Example usage
library(caret)
data(iris)
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
train <- iris[trainIndex,]
test <- iris[-trainIndex,]
model <- train(Species ~ ., data = train, method = "rpart")
predictions <- predict(model, test)
confusionMatrix(predictions, test$Species)
forecast
- Provides methods and tools for forecasting time series data
- Implements popular models like ARIMA, ETS, and others, along with utilities to evaluate forecast accuracy
# Installation
install.packages("forecast")
# Example usage
library(forecast)
ts_data <- ts(c(123, 139, 140, 150, 169, 180), frequency = 1, start = 2015)
fit <- auto.arima(ts_data)
forecasted <- forecast(fit, h = 3)
plot(forecasted)
tidymodels
- A collection of packages (parsnip, recipes, rsample, etc.) for modeling and machine learning within a “tidy” framework
- Simplifies the process of creating, tuning, and evaluating models using consistent syntax
# Installation
install.packages("tidymodels")
# Load the libraries
library(tidymodels)
# Example usage
data(iris)
set.seed(123)
# Split data
iris_split <- initial_split(iris, prop = 0.8)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)
# Define a model (random forest)
rf_model <- rand_forest(mtry = 2, trees = 100, min_n = 5) %>%
set_engine("ranger") %>%
set_mode("classification")
# Create a workflow
iris_workflow <- workflow() %>%
add_model(rf_model) %>%
add_formula(Species ~ .)
# Fit the model
iris_fit <- iris_workflow %>% fit(data = iris_train)
# Evaluate
iris_preds <- predict(iris_fit, iris_test) %>%
bind_cols(iris_test)
metrics(iris_preds, truth = Species, estimate = .pred_class)
rpart
- Implements recursive partitioning, allowing you to build decision trees for classification or regression
- Useful for interpretable models and handling non-linear relationships
# Installation
install.packages("rpart")
# Load the library
library(rpart)
# Example usage
model <- rpart(Species ~ ., data = iris, method = "class")
printcp(model) # Display complexity parameter
plot(model)
text(model, use.n = TRUE)
randomForest
- Implements the random forest algorithm, which combines multiple decision trees to improve predictive accuracy
- Handles both classification and regression tasks while reducing the risk of overfitting
# Installation
install.packages("randomForest")
# Load the library
library(randomForest)
# Example usage
data(iris)
set.seed(123)
rf_model <- randomForest(Species ~ ., data = iris, ntree = 100)
print(rf_model)
varImpPlot(rf_model) # Importance of each feature
Misc
shiny
- A framework for building interactive web applications directly from R
- Ideal for turning analyses and visualizations into shareable, browser-based tools
# Installation
install.packages("shiny")
# Example usage
library(shiny)
ui <- fluidPage(
sliderInput("obs", "Number of observations:", 1, 100, 50),
plotOutput("distPlot")
)
server <- function(input, output) {
output$distPlot <- renderPlot({
hist(rnorm(input$obs))
})
}
shinyApp(ui = ui, server = server)