This file contains descriptive statistics plots and includes the data cleaning process involved with it.
If you would like to see some of our data cleaning methods, click “show” to show the code corresponding to each step of the process below:
airplane_df = read_csv("datasets/airplane_crashes_data.csv", show_col_types = FALSE) |>
janitor::clean_names() |>
filter(ground != "NULL", aboard != "NULL") |>
# removes unnecessary columns for our analyses
select(-flight_number, -time, - registration)
airplane_df = airplane_df |>
mutate(
# remove leading and trailing spaces
date = str_trim(date),
# extract month, day, and year from the date string
m = as.numeric(sub("/.*", "", date)),
d = as.numeric(sub(".*/(.*)/.*", "\\1", date)),
y = as.numeric(sub(".*/(.*)$", "\\1", date)),
# convert 2-digit years (<100) to 4-digit (1900s)
y = ifelse(y < 100, y + 1900, y),
# rebuild the string
date_clean = paste(m, d, y, sep = "/"),
# convert to Date type
date = mdy(date_clean),
# extract numeric year and month
year = year(date),
month = month(date),
month_name = factor(month(date,
label = TRUE,
abbr = TRUE),
levels = month.abb)
) |>
select(-m, -d, -y, -date_clean) # remove unnecessary columns
airplane_df = airplane_df |>
mutate(
aboard = as.numeric(aboard),
fatalities = as.numeric(fatalities),
ground = as.numeric(ground),
operator = as.factor(operator) # to group by operator
)
airplane_df = airplane_df |>
mutate(
decade = floor(year / 10) * 10,
decade = paste0(decade, "s")
) |>
select(date, year, decade, month, month_name, everything())
airplane_df |>
summarise(
across(where(is.numeric), list(
mean = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE),
min = ~min(.x, na.rm = TRUE),
max = ~max(.x, na.rm = TRUE),
missing = ~sum(is.na(.x))
), .names = "{col}_{fn}")
) |>
pivot_longer(everything(), names_to = c("variable","stat"), names_sep = "_") |>
pivot_wider(names_from = stat, values_from = value) |>
gt() |>
tab_header(
title = "Airplane Dataset Summary"
)
| Airplane Dataset Summary | |||||
| variable | mean | sd | min | max | missing |
|---|---|---|---|---|---|
| year | 1971.408327 | 22.336769 | 1908 | 2009 | 0 |
| month | 6.641520 | 3.544293 | 1 | 12 | 0 |
| aboard | 27.589190 | 43.109636 | 0 | 644 | 0 |
| fatalities | 20.104851 | 33.238341 | 0 | 583 | 0 |
| ground | 1.611154 | 54.039316 | 0 | 2750 | 0 |
Our dataset, airplane_df, has 5,236 rows and 14
variables (columns). Some variables had missing values, such as
location, route, type,
cn_in, summary, and operator, but
our key numeric variables had no missing values. This summary table
outlines the mean, sd, and minimum and maximum value of the key numeric
variables in our dataset.
The cleaned dataset included the following variables:
date: A date variable representing the date of the
crash.
year: The year of the crash, stored as a numeric
value.
decade: The decade of the crash, stored as a
character.
month: The month of the crash, stored as a numeric
value.
month_name: The name of the month of the crash,
stored as an ordered factor.
location: The location of the crash, stored as a
character.
operator: The plane operator type, stored as a
factor.
route: The route the plane was taking, stored as a
character.
type: The aircraft type, stored as a
character.
cn_in: A unique identifier of the
construction/serial number of the plane.
aboard: The number of people aboard the plane,
stored as a numeric value.
fatalities: The number of people who died on the
plane, stored as a numeric value.
ground : The number of people who died on the ground
related to the crash, stored as a numeric value.
summary: A description explaining the cause of the
plane crash, stored as a character.
airplane_df |>
count(year) |>
plot_ly(
x = ~year,
y = ~n,
type = "scatter",
mode = "lines+markers",
hovertemplate = "Year: %{x}<br>Crashes: %{y}<extra></extra>"
) |>
layout(
title = list(
text = "Airplane Crashes per Year",
y = 0.98,
font = list(size = 20)
),
xaxis = list(
title = list(
text = "Year",
font = list(size = 16)
)
),
yaxis = list(
title = list(
text = "Number of Crashes",
font = list(size = 16)
)
)
)
In the early years of flight, there were not as many airplanes in the sky, so the relatively low number of crashes from the early 1900s to 1940 make sense. Beginning in the late 1940s to early 50’s, more people started flying, which coincides with an increase in crashes. The commercial jet service was introduced in 1952, which is where we see a steady increase in crashes after that.
plot_ly(
data = airplane_df,
x = ~year,
y = ~aboard,
type = 'scatter',
mode = 'markers',
marker = list(color = 'orange', size = 6),
text = ~paste("Crash Year:", year, "<br>People Aboard:", aboard),
hoverinfo = 'text'
) |>
layout(
title = "Number of People Aboard per Crash Over Time",
xaxis = list(title = "Year"),
yaxis = list(title = "People Aboard")
)
It is useful to see the general trends in the number of people on planes that crashed per year. Overall, there is a steady upward trend in the number of people on the crashed planes as aircrafts could hold more people.
I have excluded the year 2001, which was an outlier and had 5,641 fatalities, because of the 9/11 attacks.
# calculate the avg fatality rate per year
fatality_trends = airplane_df |>
filter (aboard > 0) |>
filter(year != 2001) |>
mutate(fatality_rate = fatalities/aboard) |>
group_by(year) |>
summarize(avg_fatality_rate = mean(fatality_rate, na.rm = TRUE),
n_crashes = n())
# plot the trends interactively
plot_ly(
data = fatality_trends,
x = ~year,
y = ~avg_fatality_rate,
type = 'scatter',
mode = 'lines+markers',
line = list(color = 'blue', width = 2),
marker = list(color = 'blue', size = 6),
text = ~paste("Year:", year,
"<br>Avg Fatality Rate:", round(avg_fatality_rate, 3),
"<br>Number of Crashes:", n_crashes),
hoverinfo = 'text'
) |>
layout(
title = "Average Fatality Rate per Airplane Crash Over Years",
xaxis = list(title = "Year"),
yaxis = list(title = "Average Fatality Rate")
)
The average airplane fatality rate per year was as high as 100% in the 1910s and 1920s and shows a downward trend as the years progress, showing that, when there is a crash, there are less fatalities on the airplane. Although, the percentage is still well above 50%.
airplane_df |>
group_by(operator) |>
summarise(total_crashes = n(), .groups = "drop") |>
slice_max(total_crashes, n = 15) |> # select the top 15 operators
ggplot(aes(x = reorder(operator, total_crashes), y = total_crashes )) +
geom_col(fill = "blue") +
coord_flip() +
labs(
title = "Airline Operators with Most Crashes",
x = "Airline Operators",
y = "Number of Crashes"
)

Aeroflot, the largest airline in Russia, is by far the commercial airline with the greatest crashes at 187 total crashes. The U.S. Air Force, a military airline, comes in close second at 182 total crashes.
airplane_df |>
filter(!is.na(type)) |>
group_by(type) |>
summarize(total_crashes = n(), .groups = "drop") |>
slice_max(total_crashes, n = 15) |>
ggplot(aes(x = reorder(type, total_crashes), y = total_crashes)) +
geom_col(fill = "darkgreen") +
coord_flip() +
labs(
title = "Aircrafts with the Most Crashes",
x = "Number of Crashes",
y = "Aircraft Type"
)

The Douglas DC-3 has had by far the most crashes out of all the aircraft types. Most of the planes in this graph are Douglas aircrafts. Interestingly, the Douglas DC-3 became the most successful airliner in the formative years of air transformation, and was the first to fly profitably without government subsidy.
# heatmap of crashes by year and month
airplane_df |>
count(decade, month_name) |> # counts crashes in the decade/month
ggplot(aes(x = decade, y = month_name, fill = n)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "red") +
labs(title = "Heatmap of Airplane Crashes by Year and Month",
x = "Decade",
y = "Month",
fill = "Number of Crashes")

Based on this map, the 1970s, particularly the January, December, October, September, and August months have especially high numbers of crashes compared to other decades and months. July in the 1990s, also had a particularly high number of crashes, indicated by the darker red color. The monthly and seasonal trends will be looked at in more detail using ANOVA.