Title

Data Analysis of Aviation Crashes

Group Members

Maya Arnott
Karolina Wiesiolek
Casandra Laney
Zichen Fan

Motivation

In recent years, airplane crashes seem to receive intense media attention, often framed as sensational “click-bait” headlines. This can create the impression that commercial air travel is increasingly dangerous, even though aviation is widely regarded as one of the safest modes of transportation.

Our project is motivated by two main goals:

Put aviation risk into context over time:

We want to examine how the frequency and severity of airplane crashes have changed across decades. Are crashes actually becoming more or less common? Have fatality rates per crash changed with improvements in technology and safety regulations?

Understand patterns across aircraft, operators, and seasons:

We are interested in which aircraft types, operators, and seasons are associated with more frequent or more severe crashes. We also plan to explore whether climate or seasonal patterns might be related to the timing or characteristics of crashes.

Initial questions

Core Questions

As we started the project, our main questions were:

Temporal trends
- How has the number of aviation crashes changed over time (e.g., by decade or year)?
- How has the severity of crashes (e.g., number of fatalities per crash) changed over time?
Aircraft and operators
- Which aircraft types/models are most frequently involved in crashes?
- Which operators/airlines experience the highest number of crashes?
- Are some operators associated with more severe outcomes (e.g., higher fatality counts per crash)?
Seasonality and environment
- Are crashes more common in certain seasons or months of the year?
- Are crashes more statistically significantly different in certain seasons?

As data exploration continued over the following weeks, additional questions were developed:

What is the best way to present the fatalities that occurred around the world?
Are there notable geographic clusters of crashes?
If certain aircraft or operators have very few crashes, we may shift toward grouping them into broader categories (e.g., commercial vs military, or regional vs major carriers).
What would a linear regression model for aircraft crashes look like?

Data Summary

Data Source

The dataset was sourced as a CSV file from Kaggle and stored in the datasets folder within our project directory. We read this dataset into R using read_csv() and maintained all code in an R Project linked to a public GitHub repository for reproducibility. An additional dataset was created by our group, which contains geocoded latitude and longitude information for each crash location. Because the Kaggle file lists location as text (e.g., “Paris, France” or “5 miles from Cairo”), our group geocoded each unique location string and stored the results in geocoded_cache.csv. This allowed us to build interactive maps to see the spatial distribution of crashes worldwide.

Data and Exploratory Analysis

Data Cleaning and Preparation

Before performing geographic or temporal analysis, our dataset required substantial cleaning due to inconsistent formatting and missing coordinate data. We wrote the preprocessing code used throughout the analysis pipeline. It included four major steps:

The dataset was initially loaded as a tibble to be able to process, clean, and analyze it. Rows from aboard and ground with missing or placeholder values were removed to ensure cleaner data for analysis. flight-number, time, and registration columns were dropped because these did not contain relevant information for our analysis.

airplane_df = read_csv("datasets/airplane_crashes_data.csv", show_col_types = FALSE) |> 
  janitor::clean_names() |> 
  filter(ground != "NULL", aboard != "NULL") |> 

  # removes unnecessary columns for our analyses
  select(-flight_number, -time, - registration)

The date column was initially stored as a string, so we first parsed it manually into numeric month/day/year. Then, the 2 digit years were able to be converted to 4 digit years to avoid misinterpreting. Columns for the numeric year month, and month name were created for easier grouping, plotting, and analysis.

airplane_df = airplane_df |> 
  mutate(
    # remove leading and trailing spaces
    date = str_trim(date),
    
    # extract month, day, and year from the date string
    m = as.numeric(sub("/.*", "", date)),                    
    d = as.numeric(sub(".*/(.*)/.*", "\\1", date)),          
    y = as.numeric(sub(".*/(.*)$", "\\1", date)),            
    
    # convert 2-digit years (<100) to 4-digit (1900s)
    y = ifelse(y < 100, y + 1900, y),
    
    # rebuild the string
    date_clean = paste(m, d, y, sep = "/"),
    
    # convert to Date type
    date = mdy(date_clean),
    
    # extract numeric year and month
    year = year(date),
    month = month(date),
    month_name = factor(month(date, 
                              label = TRUE, 
                              abbr = TRUE), 
                              levels = month.abb)
  ) |>
  select(-m, -d, -y, -date_clean)  # remove unnecessary columns

Key columns were converted to numeric and operator was converted to a factor, so that we could group by operator in our analyses.

airplane_df = airplane_df |> 
  mutate(
    aboard     = as.numeric(aboard),
    fatalities =  as.numeric(fatalities),
    ground     = as.numeric(ground),
    operator   = as.factor(operator)  # to group by operator
    )

Lastly, a decade column was created so that plots could be grouped by decade.

airplane_df = airplane_df |> 
  mutate(
    decade = floor(year / 10) * 10, 
    decade = paste0(decade, "s")
  ) |> 
  select(date, year, decade, month, month_name, everything())

Analysis

Interactive Map

Data cleaning process was the same across all group member. Once done, I moved on to creating interactive maps. Below, I am providing an overview of those steps.

Joining Geocoded Coordinates for Mapping Because the raw dataset contained only text descriptions of crash locations, our group generated a supplementary file (geocoded_cache.csv) containing:

cleaned location names
latitude and longitude for each crash We merged this data with our cleaned dataset, this action produced the final dataset used in all mapping visuals:

df_map <- airplane_df |>
  left_join(geocoded_locations,
            by = c("location" = "location_clean")) |>
  filter(!is.na(Latitude), !is.na(Longitude))

Preparing the Shared Dataset for Interactive Filtering To allow dynamic filtering without Shiny, I used the crosstalk package. This connects:

the map
the data table
the filter widgets

All components update simultaneously without reloading.

shared = SharedData$new(df_display, group = "crashes")

This enabled an interactive interface where users can: - choose an operator - filter by year - select a fatality threshold - see both the map and table update instantly

Statistical Analysis

We used methods including ANOVA to statistically investigate questions about our dataset. To see the code we used to conduct our statistical analyses along with our figures, check out our Statistical Analysis page.

Seasonal Trends: Does the average number of crash events differ across months?

To investigate potential seasonal trends in aviation safety, we focused on data from the Northern Hemisphere and aggregated crash counts by month. We performed a One-Way ANOVA to test the null hypothesis that accident frequency is evenly distributed throughout the year.

Significant Finding: The analysis yielded a p-value of 0.039, falling below the 0.05 significance threshold.
We reject the null hypothesis. There is statistically significant evidence that crash frequencies vary by month in the Northern Hemisphere.

This finding suggest that seasonal factors (such as winter weather conditions) likely contribute to accident rates. However, we note that secondary analysis on fatality rates across months did not yield significant results, implying that while seasons affect the occurrence of crashes, they do not necessarily dictate the survivability of those crashes.

Survivability Trends: Is the fatality rate lower over time?

We evaluated long-term improvements in aviation survivability by modeling the relationship between time (Year) and Fatality Rate (Fatalities / Aboard), with simple linear regression.

The model revealed a highly statistically significant negative correlation (\(p < 0.001\)) between year and fatality rate.
Airplanes have become statistically safer over time regarding survivability.

However, the low R-squared value (0.015) shows that while the aggregate risk has decreased historically, “time” itself explains very little of the variance for individual crashes. The outcome of any specific incident is determined primarily by immediate situational factors rather than the broad historical era in which it occurred. And the year’s average fatility rate is still largely determined by random effect.

Historical Frequency: Are the number of crash events higher or lower over the years?

To describe the historical trajectory of aviation safety, we analyzed the total number of crashes per year (1908–2009). We compared two modeling approaches: a Simple Linear Regression and a LOESS (Locally Estimated Scatterplot Smoothing) model.

The LOESS model provides a better fit.
Linear model fails to describe the historical fluctuations and recent down trend.

The linear model showed a significant upward trend but have a relatively high error.Linear interpretation is misleading. The LOESS model reveals the true non-linear history of aviation: a sharp rise in accidents during the mid-20th century (the expansion of commercial aviation) followed by a distinct decline in recent decades due to technological and regulatory advancements.

Cause Analysis: What are the primary recurring themes？

To identify the common causes of accidents, we moved beyond numerical testing to perform unstructured text mining on crash summary descriptions.We used text tokenization and stop-word removal, followed by frequency analysis (Word Cloud).

After removing generic terms (e.g., “plane”, “crashed”), the most frequent terms included “engine”, “runway”, “landing”, “approach”, and “weather”.

The qualitative data may suggest mechanical failure (engine) and the critical phases of flight (takeoff/landing/approach) are the predominant contexts for historical accidents. Environmental factors, indicated by terms like “fog” and “weather,” also play a substantial role.

Discussion

In this final stage of the project, we returned to the questions from our Motivation and Initial Questions sections to reflect on what we learned.

The data and knowledge of the world demonstrates that as the total number of flights and passengers has increased over the decades, the aviation industry has become significantly safer due to stringent standards, advanced technology, and well-trained professionals. The absolute number of crashes has not risen in proportion to the volume of air traffic, meaning that the rate of accidents has fallen to historically low levels.

Though the number of sudden fatalities due to aircraft crash is high and therefore is always highly covered by media outlets, commuting by air is considered one of the safest ways to commute.

Fly High Project - Final Report