Skip to the content.

Blackout Breakdown

Predictive Analysis for Forecasting Prolonged Outages

Authors: David Chew, Derek Kuang

Table of Contents

  1. Project Overview
  2. Introduction
  3. Cleaning and Exploratory Data Analysis
  4. Hypothesis Testing
  5. Baseline Model
  6. Final Model
  7. Fairness Analysis

Project Overview

This is a data science project on predicting the causes of a major power outage. The dataset used to explore the topic can be find here. This project is for DSC80 at UCSD.

Introduction

Have you ever been caught off guard by a sudden power outage? In today’s society, outages disrupt businesses, endanger public safety, and lead to significant economic losses. This raises our central question:
“What are the primary causes of power outages in different regions of the United States, considering factors such as weather, geography, and infrastructure? How do these causes impact outcomes like economic loss, outage duration, and other related effects?”

Understanding these factors is crucial for guiding infrastructure investments and emergency response strategies, ultimately building more resilient power systems. Our analysis uses a dataset of 1,534 major outages across the United States from 2001 to 2016, which captures both the duration and extent of outages along with regional context.

Here are some of the columns that we thought are relevant to our analysis:

Column Name Description
YEAR Year in which the outage took place.
MONTH Month of the outage occurrence.
U.S._STATE U.S. state where the outage occurred.
NERC.REGION NERC region(s) impacted by the outage.
CLIMATE.REGION Classification of the area’s climate region.
ANOMALY.LEVEL ONI index indicating El Niño/La Niña conditions.
CLIMATE.CATEGORY Classification of climate episodes.
OUTAGE.START.DATE Date (day of the year) when the outage began.
OUTAGE.START.TIME Start time of the outage.
OUTAGE.RESTORATION.DATE Date (day of the year) when power was restored.
OUTAGE.RESTORATION.TIME Time when power was restored.
CAUSE.CATEGORY Primary cause category of the outage.
CAUSE.CATEGORY.DETAIL Detailed description of the cause categories.
OUTAGE.DURATION Length of the outage.
DEMAND.LOSS.MW Peak demand loss in megawatts.
CUSTOMERS.AFFECTED Number of customers impacted by the outage.
RES.PRICE Residential sector monthly electricity price.
COM.PRICE Commercial sector monthly electricity price.
IND.PRICE Industrial sector monthly electricity price.
TOTAL.PRICE Overall average monthly electricity price.
RES.SALES Residential electricity consumption.
COM.SALES Commercial electricity consumption.
IND.SALES Industrial electricity consumption.
TOTAL.SALES Combined total electricity consumption.
RES.PERCEN Share of electricity consumption from the residential sector.
COM.PERCEN Share of electricity consumption from the commercial sector.
IND.PERCEN Share of electricity consumption from the industrial sector.
RES.CUSTOMERS Number of residential customers served annually.
COM.CUSTOMERS Number of commercial customers served annually.
IND.CUSTOMERS Number of industrial customers served annually.
TOTAL.CUSTOMERS Total number of customers served annually.
RES.CUST.PCT Proportion of customers in the residential sector.
COM.CUST.PCT Proportion of customers in the commercial sector.
IND.CUST.PCT Proportion of customers in the industrial sector.
POPPCT_URBAN Percentage of the state’s population residing in urban areas.
POPPCT_UC Percentage of the state’s population living in urban clusters.
POPDEN_URBAN Population density in urban areas.
POPDEN_UC Population density in urban clusters.
POPDEN_RURAL Population density in rural areas.
AREAPCT_URBAN Percentage of state land area occupied by urban areas.
AREAPCT_UC Percentage of state land area covered by urban clusters.

Data Cleaning and Exploratory Data Analysis

Data Cleaning

  1. Data Copy and Indexing: Converted from xlsx to csv format and remove extra rows/cols. Created a copy of the raw dataset to preserve the original data and set the OBS column as the index, ensuring each observation is uniquely identified and facilitating consistent referencing during further analysis.

  2. Datetime Conversion: Combined OUTAGE.START.DATE and OUTAGE.START.TIME into a single OUTAGE.START datetime column, and similarly combined OUTAGE.RESTORATION.DATE and OUTAGE.RESTORATION.TIME into OUTAGE.RESTORATION. This conversion standardizes time-based data, making it easier to perform temporal analyses, while any unparseable entries are coerced to missing values.

  3. Handling Missing Month and Year: Filled missing values in the MONTH and YEAR columns with 0 and converted them to integers. This step ensures these temporal fields are in a consistent numerical format, which is essential for accurate time-series analysis.

  4. Handling Missing Numeric Values: Replaced zero values in critical numerical columns such as OUTAGE.DURATION, CUSTOMERS.AFFECTED, and DEMAND.LOSS.MW with NaN. This prevents misinterpretation of missing or unrecorded data as valid values, leading to more reliable statistical analyses and modeling.

  5. Renaming and Standardizing Column Names: Renamed U.S._STATE to US.STATE and then standardized all column names by converting them to lowercase and replacing periods with underscores. This uniform naming convention reduces the likelihood of errors during data manipulation and simplifies code readability.

  6. Stripping Whitespace: Removed leading and trailing whitespace from all string columns. This cleaning step ensures that categorical values are consistent, preventing issues during grouping, merging, and analysis.

OBS year month us_state postal_code nerc_region climate_region anomaly_level climate_category outage_start_date outage_start_time outage_restoration_date outage_restoration_time cause_category cause_category_detail hurricane_names outage_duration demand_loss_mw customers_affected res_price com_price ind_price total_price res_sales com_sales ind_sales total_sales res_percen com_percen ind_percen res_customers com_customers ind_customers total_customers res_cust_pct com_cust_pct ind_cust_pct pc_realgsp_state pc_realgsp_usa pc_realgsp_rel pc_realgsp_change util_realgsp total_realgsp util_contri pi_util_ofusa population poppct_urban poppct_uc popden_urban popden_uc popden_rural areapct_urban areapct_uc pct_land pct_water_tot pct_water_inland outage_start outage_restoration
1 2011 7 Minnesota MN MRO East North Central -0.3 normal Friday, July 1, 2011 5:00:00 PM Sunday, July 3, 2011 8:00:00 PM severe weather     3060.0   70000.0 11.6 9.18 6.81 9.28 2332915.0 2114774.0 2113291.0 6562520.0 35.54907261 32.22502941 32.20243138 2308736 276286 10673 2595696 88.9448 10.644 0.4112 51268 47586 1.077375699 1.6 4802 274182 1.751391412 2.2 5348119 73.27 15.28 2279.0 1700.5 18.2 2.14 0.6 91.59266587 8.407334131 5.478742983 2011-07-01 17:00:00 2014-05-11 20:00:00
2 2014 5 Minnesota MN MRO East North Central -0.1 normal Sunday, May 11, 2014 6:38:00 PM Sunday, May 11, 2014 6:39:00 PM intentional attack vandalism   1.0   12.12 9.71 6.49 9.28 1586986.0 1807756.0 1887927.0 5284231.0 30.03248722 34.21038936 35.72756376 2345860 284978 9898 2640737 88.8335 10.7916 0.3748 53499 49091 1.089792426 1.9 5226 291955 1.790001884 2.2 5457125 73.27 15.28 2279.0 1700.5 18.2 2.14 0.6 91.59266587 8.407334131 5.478742983 2014-05-11 18:38:00 2010-10-28 18:39:00  
3 2010 10 Minnesota MN MRO East North Central -1.5 cold Tuesday, October 26, 2010 8:00:00 PM Thursday, October 28, 2010 10:00:00 PM severe weather heavy wind   3000.0   70000.0 10.87 8.19 6.07 8.15 1467293.0 1801683.0 1951295.0 5222116.0 28.09767152 34.50101453 37.36598344 2300291 276463 10150 2586905 88.9206 10.687 0.3924 50447 47287 1.066825978 2.7 4571 267895 1.706265514 2.1 5310903 73.27 15.28 2279.0 1700.5 18.2 2.14 0.6 91.59266587 8.407334131 5.478742983 2010-10-26 20:00:00 2012-06-20 22:00:00
4 2012 6 Minnesota MN MRO East North Central -0.1 normal Tuesday, June 19, 2012 4:30:00 AM Wednesday, June 20, 2012 11:00:00 PM severe weather thunderstorm   2550.0   68200.0 11.79 9.25 6.71 9.19 1851519.0 1941174.0 1993026.0 5787064.0 31.99409925 33.54333043 34.43932882 2317336 278466 11010 2606813 88.8954 10.6822 0.4224 51598 48156 1.071476036 0.6 5364 277627 1.932088738 2.2 5380443 73.27 15.28 2279.0 1700.5 18.2 2.14 0.6 91.59266587 8.407334131 5.478742983 2012-06-19 04:30:00 2015-07-19 23:00:00
5 2015 7 Minnesota MN MRO East North Central 1.2 warm Saturday, July 18, 2015 2:00:00 AM Sunday, July 19, 2015 7:00:00 AM severe weather     1740.0 250.0 250000.0 13.07 10.16 7.74 10.43 2028875.0 2161612.0 1777937.0 5970339.0 33.9825762 36.20585029 29.77949828 2374674 289044 9812 2673531 88.8216 10.8113 0.367 54431 49844 1.092027125 1.7 4873 292023 1.668704177 2.2 5489594 73.27 15.28 2279.0 1700.5 18.2 2.14 0.6 91.59266587 8.407334131 5.478742983 2015-07-18 02:00:00 2010-11-14 07:00:00

Univariate Analysis

Frequency of Power Outage Causes

In this bar chart, each bar represents a distinct detailed cause category (e.g., severe weather, intentional attack, and other specific causes), with the height of the bar indicating the number of outages associated with that detailed cause. The color differentiation and rotated labels help improve readability, making it easy to identify which detailed causes are most common.

Notably, “vandalism” emerges as the leading cause, followed by weather-related factors like “thunderstorm” and “winter storm,” suggesting that both human-driven and severe weather events significantly affect power reliability.

Outage Frequency by State

We used a Folium‐based heat map to visualize the total number of power outages in each U.S. state, shading states with higher outage counts in darker colors. This approach makes it easy to spot regions with frequent disruptions while looking at the U.S map.

Notably, states such as California, Washington, Texas, Illinois, and New York stand out with especially high outage counts. This pattern suggests that these regions may be more vulnerable to power disruptions

Bivariate Analysis

Comparison of Outage Duration by Cause Category

In this visualization, we explore the relationship between outage duration and its cause category to identify trends in power disruptions. Using a box plot, we can examine the spread, median, and outliers for each cause of outages. This allows us to assess which causes tend to result in longer or more variable power outages. By comparing different categories, we can gain insights into the most disruptive factors affecting power reliability.

Notably, fuel supply emergencies show a wide range of outage durations, from brief disruptions to prolonged events, while severe weather and fuel supply emergencies are the only categories with major outliers. In contrast, intentional attacks, system operability disruptions, and equipment failures tend to have shorter, more consistent durations with fewer extreme cases, and the median outage duration varies across categories.

Percentage of Outages By Cliamte Region and Cause Category

This visualization examines the percentage of outages by climate region and cause category, providing a regional perspective on power disruption causes. By using a grouped bar chart, we can see how different outage causes contribute to the total outages in each climate region. This allows us to identify patterns in outage causes across geographic areas, highlighting regional vulnerabilities and differences in power grid reliability.

Notably, severe weather is the dominant cause of outages in most regions—especially in the Central, East North Central, South, and Southeast—while intentional attacks are more prevalent in the Northwest and Southwest. The West along with the West North Central show a more balanced mix of causes. In addition, system operability disruptions, equipment failures, and public appeals occur across all regions but at much lower rates compared to severe weather and intentional attacks.

Interesting Aggregates

In this visualization, we aggregate our dataset by year and climate region to examine temporal trends in power outage impacts. The pivot table computes the average outage duration and mean anomaly level for each combination of year and climate region, which are then displayed as two line charts in a dual subplot layout. The left chart reveals how average outage duration has evolved over time across different climate regions, while the right chart shows the corresponding fluctuations in the mean anomaly level.

year climate_region avg_outage_duration mean_anomaly_level
2000 Central 1200 -0.6
2000 Northeast 681 -0.9
2000 South 903 -0.833333
2000 Southeast 5384 -0.95
2000 Southwest 66 -0.833333

In these two plots, each climate region exhibits distinct patterns over time for both average outage duration (left) and mean anomaly level (right). For instance, certain regions display sharp spikes in outage duration around 2010–2011, while others remain more stable. Simultaneously, changes in the mean anomaly level—often related to El Niño or La Niña events—tend to coincide with fluctuations in outage severity. By comparing these trends side by side, we gain insight into how climate variability may exacerbate or mitigate prolonged outages across different regions, helping us pinpoint periods and locations most at risk.

Urbanization and Outage Impact Relationship

These three bar charts compare key outage metrics across four levels of urbanization (Low, Medium-Low, Medium-High, High), segmented by the percentage of each region’s population living in urban areas. The left chart shows the average outage duration, the middle chart highlights the total number of customers affected, and the right chart displays the average demand loss.

urban_quantile avg_outage_duration total_customers_affected average_demand_loss
Low 3374.76 3.72032e+07 535.273
Medium-Low 2009.87 2.61275e+07 575.86
Medium-High 3283.56 4.59132e+07 836.264
High 2124.32 4.72668e+07 843.425

These bar charts reveal that while areas with lower urbanization levels can experience relatively long average outages, more urbanized regions (particularly “Medium-High” and “High”) see the largest total number of customers affected, likely reflecting higher population density. Additionally, average demand loss appears greatest in the “High” group, underscoring the heavier infrastructure usage in densely populated areas. Taken together, these findings suggest that both ends of the urbanization spectrum face unique challenges—rural regions may endure prolonged outages, while highly urbanized areas suffer more extensive disruptions that impact a greater number of customers and place a larger strain on the power grid.

Assessment of Missingness

NMAR Analysis

We believe that cause_category_detail is NMAR (Not Missing At Random) because its missing values are directly related to the nature of the outage itself. When an outage occurs under unusual, ambiguous, or difficult-to-diagnose circumstances, the specific cause details might not be recorded, either due to a lack of conclusive evidence or delays in investigation and reporting. This suggests that the missing values are not randomly distributed across all outages but instead correlated with the complexity or uncertainty of the outage cause.

There are several possible reasons why detailed cause information might not be reported:

To better understand whether cause_category_detail is truly NMAR, we would need additional data, such as incident reports, utility company reporting policies, or follow-up investigation records. If we can show that missing values are more frequent in certain types of outages (e.g., severe weather events or cyberattacks where precise attribution is difficult), this would confirm that the missingness is NMAR. However, if external factors like the region, company size, or outage duration explain the missingness, we might instead classify it as MAR (Missing At Random). Further testing will help validate our hypothesis.

Missingness Dependency

1.outage_duration and cause_category_detail (MAR - Missing at Random)

To determine whether the absence of cause details is related to outage duration, we conducted a permutation test. We compared the mean outage duration between records with missing cause details and those with non-missing cause details.

Hypotheses
Permutation Test Methodology

We first computed the observed difference in mean outage duration between the two groups (missing vs. non-missing cause details). Then, we randomly shuffled the missingness indicator across all records (keeping the outage duration values fixed) over 1,000 iterations to create a null distribution of differences. The p-value was calculated as the proportion of permuted differences that were as extreme as or more extreme than the observed difference.

With a p-value of 0.02, we reject the null hypothesis at the conventional significance level (α = 0.05). This indicates that the missingness in cause_category_detail is significantly associated with outage duration—records with missing cause details tend to have a different (either higher or lower) average outage duration compared to those with complete cause information.

2.outage_duration and month (NMAR - Not Missing at Random)

To examine whether the month of occurrence influences outage duration, we conducted a permutation test.

Hypotheses
Permutation Test Methodology

Specifically, we computed the observed test statistic as the difference between the maximum and minimum average outage durations across months. Then, we randomly shuffled the month labels (keeping outage_duration values unchanged) for a large number of permutations (e.g., 1,000 iterations) to generate a null distribution of the test statistic. The p-value was calculated as the proportion of permuted test statistics that were as large as or larger than the observed difference.

With a p-value of 0.217, we fail to reject the null hypothesis at the conventional significance level (α = 0.05). This indicates that the variation in average outage duration across different months is not statistically significant, and thus, the month does not appear to be a factor influencing outage duration in our dataset.

Hypothesis Testing

In our exploration of how different causes of power outages might affect economic outcomes, we compared the average total_price (a proxy for economic impact) across various detailed cause categories. The bar chart below displays each cause category on the x‐axis and the corresponding mean total_price on the y‐axis.

By highlighting which causes are associated with higher average economic impact, we can begin to see whether certain outage triggers—like vandalism, storms, or equipment failure—tend to have more pronounced cost implications. This initial overview sets the stage for a deeper hypothesis test to determine if any observed differences in total_price are statistically meaningful.

Hypothesis

We use a permutation test with the difference in means as the test statistic. Specifically, we calculate the mean total_price for each cause_category_detail group and define the observed test statistic T_obs as:

T_obs = max(μ1, μ2, ..., μk) - min(μ1, μ2, ..., μk)

where each μi represents the mean total_price for a different detailed cause. This statistic captures the greatest disparity in economic impact across groups. In the permutation test, we randomly reassign the cause_category_detail labels across the dataset multiple times (e.g., 1,000 iterations) and compute the test statistic for each permutation. This process generates a null distribution of test statistic values (denoted as T values). The p-value is then calculated as the proportion of these permuted T values that are as extreme as (or more extreme than) the observed T_obs.

Explanation of Results

The significane level that we picked is 0.05.

Our permutation test yielded a p-value of 0.481. This p-value is well above the our significance level of 0.05, so we fail to reject the null hypothesis. In practical terms, this result indicates that there is not enough evidence to conclude that the mean economic impact differs among the different cause_category_detail groups. The observed disparity in means is consistent with what we would expect to see by chance alone. Consequently, based on this test, the detailed cause does not appear to have a statistically significant effect on the economic impact, as measured by total_price, in our dataset.

Framing a Prediction Problem

Prediction Problem: Classifying Prolonged Outages

We frame a binary classification problem where the goal is to predict whether a newly reported outage will become “prolonged” (last 24 hours or more). Specifically, we define our response variable as:

Justification and Features

Evaluation Metric: Recall

We prioritize recall (true positive rate) because missing a prolonged outage (a false negative) is far more costly than mistakenly labeling a short outage as prolonged (a false positive). By focusing on recall, we reduce the risk of underestimating the severity of an outage, ensuring that critical response measures can be activated promptly when needed. Although we may monitor other metrics (like precision or F1-score) for a more complete picture, recall is our primary measure of success given the operational and safety implications of failing to identify a genuinely prolonged outage.

Baseline Model

Our baseline model is a logistic regression that uses three features: one nominal and two quantitative. We use the categorical feature climate_region (nominal), which is encoded using a one-hot encoder (dropping the first category to avoid redundancy), and the numerical features poppct_urban and popden_urban (both quantitative), which are standardized. The response variable, prolonged, is binary, making this a binary classification problem solved using Logistic Regression.

We chose logistic regression for our baseline model because it is a straightforward, well-understood algorithm for binary classification. Logistic regression estimates the probability of an outage being prolonged, which aligns well with our goal of identifying high-risk situations. Additionally, logistic regression is computationally efficient and less prone to overfitting with a limited set of features.

Below is an overview of the pipeline:

  1. Feature Selection and Encoding:
    • Categorical Feature: climate_region (nominal), one-hot encoded (with drop='first' to avoid collinearity).
    • Numerical Features: poppct_urban and popden_urban (both quantitative), scaled via StandardScaler.
  2. Pipeline Implementation:
    We chain these preprocessing steps with a Logistic Regression classifier, ensuring all data transformations occur sequentially in one unified workflow. The pipeline is trained on 80% of the data (randomly selected), and the remaining 20% is used for testing.
# Define the preprocessing for each feature type with OneHotEncoder(drop='first')
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), cat_features),
        ('num', StandardScaler(), num_features)
    ]
)

# Build the baseline pipeline with a Logistic Regression classifier
baseline_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])
  1. Model Performance:
    After training, we evaluated our baseline model on the test set. The metrics are as follows:
    • Accuracy: 0.6287
    • F1-score: 0.2785
    • Precision: 0.5238
    • Recall: 0.1897
  2. Confusion Matrix Insight:
    A closer look at the confusion matrix reveals that although the model achieves a moderate overall accuracy, it consistently misclassifies a significant portion of actual prolonged outages (low recall). This suggests that our baseline approach tends to underestimate the “prolonged” class, likely due to a relatively small number of prolonged outages in the dataset or insufficiently predictive features for distinguishing these rare events.

  3. Interpretation and Next Steps:

While this baseline model serves as a useful starting benchmark, its low recall for prolonged outages indicates a critical gap for practical applications—especially when failing to identify a truly prolonged outage could have severe operational consequences. To address this, future improvements may involve:

Final Model

Final Model: RandomForestClassifier with Feature Engineering and Hyperparameter Tuning

Motivation for Model Change:
After observing that our baseline logistic regression model struggled to identify prolonged outages (low recall), we switched to a RandomForestClassifier. This ensemble method often captures more complex relationships in the data than linear models, allowing it to better distinguish between prolonged and non-prolonged outages. Moreover, by setting class_weight='balanced', we place extra emphasis on the minority class (prolonged outages), addressing the recall deficit more directly.

Feature Engineering and Added Features:
Compared to the baseline, we incorporated additional feature engineering to enrich the model’s understanding of outage patterns:

These engineered features capture more nuanced insights about population distribution, climate conditions, and energy consumption—factors that may affect how long outages persist.

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), cat_features),
        ('num', StandardScaler(), num_features)
    ]
)

# Note the addition of class_weight='balanced' to improve recall on the minority class.
pipeline = Pipeline([
    ('population_feature_engineer', PopulationFeatureEngineer()),
    ('additional_feature_engineer', AdditionalFeatureEngineer()),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42, criterion='entropy', class_weight='balanced'))
])

Model Pipeline and Hyperparameter Tuning:
All transformations (feature engineering, one-hot encoding, scaling) and model training steps are unified in a single scikit-learn pipeline.

Hyperparameter tuning for our RandomForestClassifier is a nuanced process, aimed at refining the model to balance learning from the data with the ability to generalize to unseen data. We tuned several key parameters: max_depth, which controls the maximum number of levels in each decision tree (too low leads to underfitting and too high risks overfitting); n_estimators, the number of trees in the ensemble (more trees can capture more complexity but increase computation); and min_samples_split, which sets the minimum number of samples required to split an internal node (ensuring splits are made only when there is sufficient data).

We employed GridSearchCV with 5-fold cross-validation to systematically explore various combinations of these hyperparameters, allowing us to identify the configuration that best improves model performance while mitigating overfitting.

We found that the optimal settings for our RandomForestClassifier are: max_depth=None, min_samples_split=10, and n_estimators=300. This configuration achieves the best balance between capturing complex patterns in the data and maintaining generalizability.

Performance with Default Threshold:
After selecting the best hyperparameters, the final model achieves:

While the accuracy is fairly high, it’s the recall that we particularly value, given the operational importance of catching as many prolonged outages as possible.

Adjusting the Decision Threshold:
We further improved recall by lowering the prediction threshold (from 0.5 to 0.2). This results in:

Although accuracy and precision dip slightly (accurary decreased by less than 0.05), the recall increaed significantly by 0.21 to more than 0.90 means fewer truly prolonged outages go undetected—critical in real-world scenarios where failing to anticipate a prolonged outage can have serious consequences.

Conclusion and Next Steps:
By incorporating additional engineered features, switching from logistic regression to a more flexible random forest approach, and fine-tuning hyperparameters through 5-fold cross-validation, our final model substantially improves recall while maintaining a reasonable overall accuracy.

Metric Baseline Final Difference
Accuracy 0.6287 0.7600 0.1313
F1-score 0.2785 0.7708 0.4923
Precision 0.5238 0.6687 0.1449
Recall 0.1897 0.9098 0.7201

The final model shows substantial improvements over the baseline. In particular, the recall increased by around 0.72, highlighting a much better ability to detect prolonged outages—a critical factor in our application. Overall, the enhancements in accuracy, F1-score, and precision confirm that our advanced feature engineering and model tuning have significantly improved predictive performance.

Fairness Analysis

Fairness Analysis by Season (Cold vs. Warm Months)

We investigated whether our final model’s recall for detecting prolonged outages differs between two seasonal groups because seasonal variations can have a profound impact on both the frequency and the characteristics of power outages. Cold months often bring harsh weather conditions such as snowstorms and ice, which can lead to more severe and prolonged outages, while warm months may be associated with different stressors like heat waves or thunderstorms.

These differences may affect the model’s ability to accurately detect prolonged outages across seasons. By comparing recall between cold and warm months, we can identify whether our model performs consistently throughout the year or if there are seasonal biases that need to be addressed, ensuring the model’s robustness and fairness in real-world applications.

Hypotheses

We explored the model’s performance by splitting the test data into two seasonal groups—cold months (October through March) and warm months (April through September)—to see whether the model’s recall differs by season. This initial exploratory data analysis revealed a modest gap in recall between cold and warm months, prompting us to conduct a permutation test to assess whether this difference could be attributed to chance.

Test Statistic and Significance Level

Observed Results

Since our p-value (0.1654) is greater than the 5% significance level (α = 0.05), we fail to reject the null hypothesis. While the model’s recall appears higher during cold months, this difference could plausibly arise by chance rather than indicating a systematic bias. Therefore, based on this test, we do not have sufficient evidence to conclude that the model’s performance is unfairly skewed toward one seasonal group.