Data Analysis of Aviation Crashes
In recent years, airplane crashes seem to receive intense media attention, often framed as sensational “click-bait” headlines. This can create the impression that commercial air travel is increasingly dangerous, even though aviation is widely regarded as one of the safest modes of transportation.
Our project is motivated by two main goals:
Core Questions
As we started the project, our main questions were:
Temporal trends
Aircraft and operators
Seasonality and environment
As data exploration continued over the following weeks, additional questions were developed:
The dataset was sourced as a CSV file from Kaggle and stored in the datasets
folder within our project directory. We read this dataset into R using
read_csv() and maintained all code in an R Project linked
to a public GitHub repository for reproducibility. An additional dataset
was created by our group, which contains geocoded latitude and longitude
information for each crash location. Because the Kaggle file lists
location as text (e.g., “Paris, France” or “5 miles from Cairo”), our
group geocoded each unique location string and stored the results in
geocoded_cache.csv. This allowed us to build interactive
maps to see the spatial distribution of crashes worldwide.
Before performing geographic or temporal analysis, our dataset required substantial cleaning due to inconsistent formatting and missing coordinate data. We wrote the preprocessing code used throughout the analysis pipeline. It included four major steps:
aboard and
ground with missing or placeholder values were removed to
ensure cleaner data for analysis. flight-number,
time, and registration columns were dropped
because these did not contain relevant information for our
analysis.airplane_df = read_csv("datasets/airplane_crashes_data.csv", show_col_types = FALSE) |>
janitor::clean_names() |>
filter(ground != "NULL", aboard != "NULL") |>
# removes unnecessary columns for our analyses
select(-flight_number, -time, - registration)
date column was initially stored as a string, so we
first parsed it manually into numeric month/day/year. Then, the 2 digit
years were able to be converted to 4 digit years to avoid
misinterpreting. Columns for the numeric year
month, and month name were created for easier
grouping, plotting, and analysis.airplane_df = airplane_df |>
mutate(
# remove leading and trailing spaces
date = str_trim(date),
# extract month, day, and year from the date string
m = as.numeric(sub("/.*", "", date)),
d = as.numeric(sub(".*/(.*)/.*", "\\1", date)),
y = as.numeric(sub(".*/(.*)$", "\\1", date)),
# convert 2-digit years (<100) to 4-digit (1900s)
y = ifelse(y < 100, y + 1900, y),
# rebuild the string
date_clean = paste(m, d, y, sep = "/"),
# convert to Date type
date = mdy(date_clean),
# extract numeric year and month
year = year(date),
month = month(date),
month_name = factor(month(date,
label = TRUE,
abbr = TRUE),
levels = month.abb)
) |>
select(-m, -d, -y, -date_clean) # remove unnecessary columns
airplane_df = airplane_df |>
mutate(
aboard = as.numeric(aboard),
fatalities = as.numeric(fatalities),
ground = as.numeric(ground),
operator = as.factor(operator) # to group by operator
)
decade column was created so that plots could
be grouped by decade.airplane_df = airplane_df |>
mutate(
decade = floor(year / 10) * 10,
decade = paste0(decade, "s")
) |>
select(date, year, decade, month, month_name, everything())
Data cleaning process was the same across all group member. Once done, I moved on to creating interactive maps. Below, I am providing an overview of those steps.
df_map <- airplane_df |>
left_join(geocoded_locations,
by = c("location" = "location_clean")) |>
filter(!is.na(Latitude), !is.na(Longitude))
All components update simultaneously without reloading.
shared = SharedData$new(df_display, group = "crashes")
This enabled an interactive interface where users can: - choose an operator - filter by year - select a fatality threshold - see both the map and table update instantly
We used methods including ANOVA to statistically investigate questions about our dataset. To see the code we used to conduct our statistical analyses along with our figures, check out our Statistical Analysis page.
Seasonal Trends: Does the average number of crash events differ across months?
To investigate potential seasonal trends in aviation safety, we focused on data from the Northern Hemisphere and aggregated crash counts by month. We performed a One-Way ANOVA to test the null hypothesis that accident frequency is evenly distributed throughout the year.
Significant Finding: The analysis yielded a p-value of 0.039, falling below the 0.05 significance threshold.
We reject the null hypothesis. There is statistically significant evidence that crash frequencies vary by month in the Northern Hemisphere.
This finding suggest that seasonal factors (such as winter weather conditions) likely contribute to accident rates. However, we note that secondary analysis on fatality rates across months did not yield significant results, implying that while seasons affect the occurrence of crashes, they do not necessarily dictate the survivability of those crashes.
Survivability Trends: Is the fatality rate lower over time?
We evaluated long-term improvements in aviation survivability by modeling the relationship between time (Year) and Fatality Rate (Fatalities / Aboard), with simple linear regression.
However, the low R-squared value (0.015) shows that while the aggregate risk has decreased historically, “time” itself explains very little of the variance for individual crashes. The outcome of any specific incident is determined primarily by immediate situational factors rather than the broad historical era in which it occurred. And the year’s average fatility rate is still largely determined by random effect.
Historical Frequency: Are the number of crash events higher or lower over the years?
To describe the historical trajectory of aviation safety, we analyzed the total number of crashes per year (1908–2009). We compared two modeling approaches: a Simple Linear Regression and a LOESS (Locally Estimated Scatterplot Smoothing) model.
The linear model showed a significant upward trend but have a relatively high error.Linear interpretation is misleading. The LOESS model reveals the true non-linear history of aviation: a sharp rise in accidents during the mid-20th century (the expansion of commercial aviation) followed by a distinct decline in recent decades due to technological and regulatory advancements.
Cause Analysis: What are the primary recurring themes?
To identify the common causes of accidents, we moved beyond numerical testing to perform unstructured text mining on crash summary descriptions.We used text tokenization and stop-word removal, followed by frequency analysis (Word Cloud).
The qualitative data may suggest mechanical failure (engine) and the critical phases of flight (takeoff/landing/approach) are the predominant contexts for historical accidents. Environmental factors, indicated by terms like “fog” and “weather,” also play a substantial role.
In this final stage of the project, we returned to the questions from our Motivation and Initial Questions sections to reflect on what we learned.
The data and knowledge of the world demonstrates that as the total number of flights and passengers has increased over the decades, the aviation industry has become significantly safer due to stringent standards, advanced technology, and well-trained professionals. The absolute number of crashes has not risen in proportion to the volume of air traffic, meaning that the rate of accidents has fallen to historically low levels.
Though the number of sudden fatalities due to aircraft crash is high and therefore is always highly covered by media outlets, commuting by air is considered one of the safest ways to commute.