Blackout Breakdown

Predictive Analysis for Forecasting Prolonged Outages

Authors: David Chew, Derek Kuang

Project Overview
Introduction
Cleaning and Exploratory Data Analysis
Hypothesis Testing
- Framing a Prediction Problem
Baseline Model
Final Model
Fairness Analysis

Project Overview

This is a data science project on predicting the causes of a major power outage. The dataset used to explore the topic can be find here. This project is for DSC80 at UCSD.

Introduction

Have you ever been caught off guard by a sudden power outage? In today’s society, outages disrupt businesses, endanger public safety, and lead to significant economic losses. This raises our central question:
“What are the primary causes of power outages in different regions of the United States, considering factors such as weather, geography, and infrastructure? How do these causes impact outcomes like economic loss, outage duration, and other related effects?”

Understanding these factors is crucial for guiding infrastructure investments and emergency response strategies, ultimately building more resilient power systems. Our analysis uses a dataset of 1,534 major outages across the United States from 2001 to 2016, which captures both the duration and extent of outages along with regional context.

Here are some of the columns that we thought are relevant to our analysis:

Column Name	Description
`YEAR`	Year in which the outage took place.
`MONTH`	Month of the outage occurrence.
`U.S._STATE`	U.S. state where the outage occurred.
`NERC.REGION`	NERC region(s) impacted by the outage.
`CLIMATE.REGION`	Classification of the area’s climate region.
`ANOMALY.LEVEL`	ONI index indicating El Niño/La Niña conditions.
`CLIMATE.CATEGORY`	Classification of climate episodes.
`OUTAGE.START.DATE`	Date (day of the year) when the outage began.
`OUTAGE.START.TIME`	Start time of the outage.
`OUTAGE.RESTORATION.DATE`	Date (day of the year) when power was restored.
`OUTAGE.RESTORATION.TIME`	Time when power was restored.
`CAUSE.CATEGORY`	Primary cause category of the outage.
`CAUSE.CATEGORY.DETAIL`	Detailed description of the cause categories.
`OUTAGE.DURATION`	Length of the outage.
`DEMAND.LOSS.MW`	Peak demand loss in megawatts.
`CUSTOMERS.AFFECTED`	Number of customers impacted by the outage.
`RES.PRICE`	Residential sector monthly electricity price.
`COM.PRICE`	Commercial sector monthly electricity price.
`IND.PRICE`	Industrial sector monthly electricity price.
`TOTAL.PRICE`	Overall average monthly electricity price.
`RES.SALES`	Residential electricity consumption.
`COM.SALES`	Commercial electricity consumption.
`IND.SALES`	Industrial electricity consumption.
`TOTAL.SALES`	Combined total electricity consumption.
`RES.PERCEN`	Share of electricity consumption from the residential sector.
`COM.PERCEN`	Share of electricity consumption from the commercial sector.
`IND.PERCEN`	Share of electricity consumption from the industrial sector.
`RES.CUSTOMERS`	Number of residential customers served annually.
`COM.CUSTOMERS`	Number of commercial customers served annually.
`IND.CUSTOMERS`	Number of industrial customers served annually.
`TOTAL.CUSTOMERS`	Total number of customers served annually.
`RES.CUST.PCT`	Proportion of customers in the residential sector.
`COM.CUST.PCT`	Proportion of customers in the commercial sector.
`IND.CUST.PCT`	Proportion of customers in the industrial sector.
`POPPCT_URBAN`	Percentage of the state’s population residing in urban areas.
`POPPCT_UC`	Percentage of the state’s population living in urban clusters.
`POPDEN_URBAN`	Population density in urban areas.
`POPDEN_UC`	Population density in urban clusters.
`POPDEN_RURAL`	Population density in rural areas.
`AREAPCT_URBAN`	Percentage of state land area occupied by urban areas.
`AREAPCT_UC`	Percentage of state land area covered by urban clusters.

Data Cleaning and Exploratory Data Analysis

Data Cleaning

Data Copy and Indexing: Converted from xlsx to csv format and remove extra rows/cols. Created a copy of the raw dataset to preserve the original data and set the OBS column as the index, ensuring each observation is uniquely identified and facilitating consistent referencing during further analysis.
Datetime Conversion: Combined OUTAGE.START.DATE and OUTAGE.START.TIME into a single OUTAGE.START datetime column, and similarly combined OUTAGE.RESTORATION.DATE and OUTAGE.RESTORATION.TIME into OUTAGE.RESTORATION. This conversion standardizes time-based data, making it easier to perform temporal analyses, while any unparseable entries are coerced to missing values.
Handling Missing Month and Year: Filled missing values in the MONTH and YEAR columns with 0 and converted them to integers. This step ensures these temporal fields are in a consistent numerical format, which is essential for accurate time-series analysis.
Handling Missing Numeric Values: Replaced zero values in critical numerical columns such as OUTAGE.DURATION, CUSTOMERS.AFFECTED, and DEMAND.LOSS.MW with NaN. This prevents misinterpretation of missing or unrecorded data as valid values, leading to more reliable statistical analyses and modeling.
Renaming and Standardizing Column Names: Renamed U.S._STATE to US.STATE and then standardized all column names by converting them to lowercase and replacing periods with underscores. This uniform naming convention reduces the likelihood of errors during data manipulation and simplifies code readability.
Stripping Whitespace: Removed leading and trailing whitespace from all string columns. This cleaning step ensures that categorical values are consistent, preventing issues during grouping, merging, and analysis.

OBS	year	month	us_state	postal_code	nerc_region	climate_region	anomaly_level	climate_category	outage_start_date	outage_start_time	outage_restoration_date	outage_restoration_time	cause_category	cause_category_detail	outage_duration	demand_loss_mw	customers_affected	res_price	com_price	ind_price	total_price	res_sales	com_sales	ind_sales	total_sales	res_percen	com_percen	ind_percen	res_customers	com_customers	ind_customers	total_customers	res_cust_pct	com_cust_pct	ind_cust_pct	pc_realgsp_state	pc_realgsp_usa	pc_realgsp_rel	pc_realgsp_change	util_realgsp	total_realgsp	util_contri	pi_util_ofusa	population	poppct_urban	poppct_uc	popden_urban	popden_uc	popden_rural	areapct_urban	areapct_uc	pct_land	pct_water_tot	pct_water_inland	outage_start	outage_restoration
1	2011	7	Minnesota	MN	MRO	East North Central	-0.3	normal	Friday, July 1, 2011	5:00:00 PM	Sunday, July 3, 2011	8:00:00 PM	severe weather		3060.0		70000.0	11.6	9.18	6.81	9.28	2332915.0	2114774.0	2113291.0	6562520.0	35.54907261	32.22502941	32.20243138	2308736	276286	10673	2595696	88.9448	10.644	0.4112	51268	47586	1.077375699	1.6	4802	274182	1.751391412	2.2	5348119	73.27	15.28	2279.0	1700.5	18.2	2.14	0.6	91.59266587	8.407334131	5.478742983	2011-07-01 17:00:00	2014-05-11 20:00:00
2	2014	5	Minnesota	MN	MRO	East North Central	-0.1	normal	Sunday, May 11, 2014	6:38:00 PM	Sunday, May 11, 2014	6:39:00 PM	intentional attack	vandalism	1.0		12.12	9.71	6.49	9.28	1586986.0	1807756.0	1887927.0	5284231.0	30.03248722	34.21038936	35.72756376	2345860	284978	9898	2640737	88.8335	10.7916	0.3748	53499	49091	1.089792426	1.9	5226	291955	1.790001884	2.2	5457125	73.27	15.28	2279.0	1700.5	18.2	2.14	0.6	91.59266587	8.407334131	5.478742983	2014-05-11 18:38:00	2010-10-28 18:39:00
3	2010	10	Minnesota	MN	MRO	East North Central	-1.5	cold	Tuesday, October 26, 2010	8:00:00 PM	Thursday, October 28, 2010	10:00:00 PM	severe weather	heavy wind	3000.0		70000.0	10.87	8.19	6.07	8.15	1467293.0	1801683.0	1951295.0	5222116.0	28.09767152	34.50101453	37.36598344	2300291	276463	10150	2586905	88.9206	10.687	0.3924	50447	47287	1.066825978	2.7	4571	267895	1.706265514	2.1	5310903	73.27	15.28	2279.0	1700.5	18.2	2.14	0.6	91.59266587	8.407334131	5.478742983	2010-10-26 20:00:00	2012-06-20 22:00:00
4	2012	6	Minnesota	MN	MRO	East North Central	-0.1	normal	Tuesday, June 19, 2012	4:30:00 AM	Wednesday, June 20, 2012	11:00:00 PM	severe weather	thunderstorm	2550.0		68200.0	11.79	9.25	6.71	9.19	1851519.0	1941174.0	1993026.0	5787064.0	31.99409925	33.54333043	34.43932882	2317336	278466	11010	2606813	88.8954	10.6822	0.4224	51598	48156	1.071476036	0.6	5364	277627	1.932088738	2.2	5380443	73.27	15.28	2279.0	1700.5	18.2	2.14	0.6	91.59266587	8.407334131	5.478742983	2012-06-19 04:30:00	2015-07-19 23:00:00
5	2015	7	Minnesota	MN	MRO	East North Central	1.2	warm	Saturday, July 18, 2015	2:00:00 AM	Sunday, July 19, 2015	7:00:00 AM	severe weather		1740.0	250.0	250000.0	13.07	10.16	7.74	10.43	2028875.0	2161612.0	1777937.0	5970339.0	33.9825762	36.20585029	29.77949828	2374674	289044	9812	2673531	88.8216	10.8113	0.367	54431	49844	1.092027125	1.7	4873	292023	1.668704177	2.2	5489594	73.27	15.28	2279.0	1700.5	18.2	2.14	0.6	91.59266587	8.407334131	5.478742983	2015-07-18 02:00:00	2010-11-14 07:00:00

Univariate Analysis

Frequency of Power Outage Causes

In this bar chart, each bar represents a distinct detailed cause category (e.g., severe weather, intentional attack, and other specific causes), with the height of the bar indicating the number of outages associated with that detailed cause. The color differentiation and rotated labels help improve readability, making it easy to identify which detailed causes are most common.

Notably, “vandalism” emerges as the leading cause, followed by weather-related factors like “thunderstorm” and “winter storm,” suggesting that both human-driven and severe weather events significantly affect power reliability.

Outage Frequency by State

We used a Folium‐based heat map to visualize the total number of power outages in each U.S. state, shading states with higher outage counts in darker colors. This approach makes it easy to spot regions with frequent disruptions while looking at the U.S map.

Notably, states such as California, Washington, Texas, Illinois, and New York stand out with especially high outage counts. This pattern suggests that these regions may be more vulnerable to power disruptions

Bivariate Analysis

Comparison of Outage Duration by Cause Category

In this visualization, we explore the relationship between outage duration and its cause category to identify trends in power disruptions. Using a box plot, we can examine the spread, median, and outliers for each cause of outages. This allows us to assess which causes tend to result in longer or more variable power outages. By comparing different categories, we can gain insights into the most disruptive factors affecting power reliability.

Notably, fuel supply emergencies show a wide range of outage durations, from brief disruptions to prolonged events, while severe weather and fuel supply emergencies are the only categories with major outliers. In contrast, intentional attacks, system operability disruptions, and equipment failures tend to have shorter, more consistent durations with fewer extreme cases, and the median outage duration varies across categories.

Percentage of Outages By Cliamte Region and Cause Category

This visualization examines the percentage of outages by climate region and cause category, providing a regional perspective on power disruption causes. By using a grouped bar chart, we can see how different outage causes contribute to the total outages in each climate region. This allows us to identify patterns in outage causes across geographic areas, highlighting regional vulnerabilities and differences in power grid reliability.

Notably, severe weather is the dominant cause of outages in most regions—especially in the Central, East North Central, South, and Southeast—while intentional attacks are more prevalent in the Northwest and Southwest. The West along with the West North Central show a more balanced mix of causes. In addition, system operability disruptions, equipment failures, and public appeals occur across all regions but at much lower rates compared to severe weather and intentional attacks.

Interesting Aggregates

Temporal Trends in Outage Impacts and Climate Anomalies by Region

In this visualization, we aggregate our dataset by year and climate region to examine temporal trends in power outage impacts. The pivot table computes the average outage duration and mean anomaly level for each combination of year and climate region, which are then displayed as two line charts in a dual subplot layout. The left chart reveals how average outage duration has evolved over time across different climate regions, while the right chart shows the corresponding fluctuations in the mean anomaly level.

year	climate_region	avg_outage_duration	mean_anomaly_level
2000	Central	1200	-0.6
2000	Northeast	681	-0.9
2000	South	903	-0.833333
2000	Southeast	5384	-0.95
2000	Southwest	66	-0.833333

In these two plots, each climate region exhibits distinct patterns over time for both average outage duration (left) and mean anomaly level (right). For instance, certain regions display sharp spikes in outage duration around 2010–2011, while others remain more stable. Simultaneously, changes in the mean anomaly level—often related to El Niño or La Niña events—tend to coincide with fluctuations in outage severity. By comparing these trends side by side, we gain insight into how climate variability may exacerbate or mitigate prolonged outages across different regions, helping us pinpoint periods and locations most at risk.

Urbanization and Outage Impact Relationship

These three bar charts compare key outage metrics across four levels of urbanization (Low, Medium-Low, Medium-High, High), segmented by the percentage of each region’s population living in urban areas. The left chart shows the average outage duration, the middle chart highlights the total number of customers affected, and the right chart displays the average demand loss.

urban_quantile	avg_outage_duration	total_customers_affected	average_demand_loss
Low	3374.76	3.72032e+07	535.273
Medium-Low	2009.87	2.61275e+07	575.86
Medium-High	3283.56	4.59132e+07	836.264
High	2124.32	4.72668e+07	843.425

These bar charts reveal that while areas with lower urbanization levels can experience relatively long average outages, more urbanized regions (particularly “Medium-High” and “High”) see the largest total number of customers affected, likely reflecting higher population density. Additionally, average demand loss appears greatest in the “High” group, underscoring the heavier infrastructure usage in densely populated areas. Taken together, these findings suggest that both ends of the urbanization spectrum face unique challenges—rural regions may endure prolonged outages, while highly urbanized areas suffer more extensive disruptions that impact a greater number of customers and place a larger strain on the power grid.

Assessment of Missingness

NMAR Analysis

We believe that cause_category_detail is NMAR (Not Missing At Random) because its missing values are directly related to the nature of the outage itself. When an outage occurs under unusual, ambiguous, or difficult-to-diagnose circumstances, the specific cause details might not be recorded, either due to a lack of conclusive evidence or delays in investigation and reporting. This suggests that the missing values are not randomly distributed across all outages but instead correlated with the complexity or uncertainty of the outage cause.

There are several possible reasons why detailed cause information might not be reported:

Investigative Challenges – If an outage is particularly complex or involves multiple contributing factors, it may take longer to determine a precise cause, leading to missing entries.
Reporting Limitations – Certain outages may not require detailed reporting, especially if they fall outside regulatory thresholds for documentation.
Human and System Factors – If the event occurs during an emergency or crisis, data collection may be deprioritized, resulting in gaps in the dataset.

To better understand whether cause_category_detail is truly NMAR, we would need additional data, such as incident reports, utility company reporting policies, or follow-up investigation records. If we can show that missing values are more frequent in certain types of outages (e.g., severe weather events or cyberattacks where precise attribution is difficult), this would confirm that the missingness is NMAR. However, if external factors like the region, company size, or outage duration explain the missingness, we might instead classify it as MAR (Missing At Random). Further testing will help validate our hypothesis.

Missingness Dependency

1.`outage_duration` and `cause_category_detail` (MAR - Missing at Random)

To determine whether the absence of cause details is related to outage duration, we conducted a permutation test. We compared the mean outage duration between records with missing cause details and those with non-missing cause details.

Hypotheses

Null Hypothesis (H₀): The missingness of cause_category_detail is independent of outage duration. That is, the average outage duration is the same for records with and without cause details.
Alternative Hypothesis (H₁): The average outage duration differs between records with missing and non-missing cause_category_detail, indicating that the missingness is not random with respect to outage duration.

Permutation Test Methodology

We first computed the observed difference in mean outage duration between the two groups (missing vs. non-missing cause details). Then, we randomly shuffled the missingness indicator across all records (keeping the outage duration values fixed) over 1,000 iterations to create a null distribution of differences. The p-value was calculated as the proportion of permuted differences that were as extreme as or more extreme than the observed difference.

With a p-value of 0.02, we reject the null hypothesis at the conventional significance level (α = 0.05). This indicates that the missingness in cause_category_detail is significantly associated with outage duration—records with missing cause details tend to have a different (either higher or lower) average outage duration compared to those with complete cause information.

2.`outage_duration` and `month` (NMAR - Not Missing at Random)

To examine whether the month of occurrence influences outage duration, we conducted a permutation test.

Hypotheses

Null Hypothesis (H₀): The average outage duration is independent of the month. In other words, there is no significant difference in the mean outage duration across different months.
Alternative Hypothesis (H₁): At least one month has a different average outage duration compared to the others, indicating that the month influences outage duration.

Permutation Test Methodology

Specifically, we computed the observed test statistic as the difference between the maximum and minimum average outage durations across months. Then, we randomly shuffled the month labels (keeping outage_duration values unchanged) for a large number of permutations (e.g., 1,000 iterations) to generate a null distribution of the test statistic. The p-value was calculated as the proportion of permuted test statistics that were as large as or larger than the observed difference.

With a p-value of 0.217, we fail to reject the null hypothesis at the conventional significance level (α = 0.05). This indicates that the variation in average outage duration across different months is not statistically significant, and thus, the month does not appear to be a factor influencing outage duration in our dataset.

Hypothesis Testing

In our exploration of how different causes of power outages might affect economic outcomes, we compared the average total_price (a proxy for economic impact) across various detailed cause categories. The bar chart below displays each cause category on the x‐axis and the corresponding mean total_price on the y‐axis.

By highlighting which causes are associated with higher average economic impact, we can begin to see whether certain outage triggers—like vandalism, storms, or equipment failure—tend to have more pronounced cost implications. This initial overview sets the stage for a deeper hypothesis test to determine if any observed differences in total_price are statistically meaningful.

Hypothesis

Null Hypothesis (H₀): The mean economic impact (total_price) is equal for all cause_category_detail groups; that is, μ1 = μ2 = ... = μk, where each μi represents the mean total_price for a different detailed cause.
Alternative Hypothesis (H₁): At least one cause_category_detail group has a mean economic impact that differs from the others.

We use a permutation test with the difference in means as the test statistic. Specifically, we calculate the mean total_price for each cause_category_detail group and define the observed test statistic T_obs as:

T_obs = max(μ1, μ2, ..., μk) - min(μ1, μ2, ..., μk)

where each μi represents the mean total_price for a different detailed cause. This statistic captures the greatest disparity in economic impact across groups. In the permutation test, we randomly reassign the cause_category_detail labels across the dataset multiple times (e.g., 1,000 iterations) and compute the test statistic for each permutation. This process generates a null distribution of test statistic values (denoted as T values). The p-value is then calculated as the proportion of these permuted T values that are as extreme as (or more extreme than) the observed T_obs.

Explanation of Results

The significane level that we picked is 0.05.

Our permutation test yielded a p-value of 0.481. This p-value is well above the our significance level of 0.05, so we fail to reject the null hypothesis. In practical terms, this result indicates that there is not enough evidence to conclude that the mean economic impact differs among the different cause_category_detail groups. The observed disparity in means is consistent with what we would expect to see by chance alone. Consequently, based on this test, the detailed cause does not appear to have a statistically significant effect on the economic impact, as measured by total_price, in our dataset.

Framing a Prediction Problem

Prediction Problem: Classifying Prolonged Outages

We frame a binary classification problem where the goal is to predict whether a newly reported outage will become “prolonged” (last 24 hours or more). Specifically, we define our response variable as:

prolonged = 1 if outage_duration is greater than or equal to 1440 minutes (24 hours).
prolonged = 0 otherwise.

Justification and Features

Why Prolonged Outages?
Prolonged outages (outages that last longer than 24 hours) can severely disrupt communities, businesses, and critical infrastructure. Identifying these outages early allows utilities and emergency services to allocate resources more effectively.
Features:
In our predictive model for classifying prolonged outages, we carefully select features that are known or estimable at the time of prediction, ensuring the model’s practical utility in real-world scenarios.

Demographic and Geographic Features poppct_urban, poppct_uc, popden_urban, popden_uc, popden_rural, ratio_urban_rural, overall_density These features capture the urbanization and population density characteristics of a region, which are critical for understanding the infrastructure’s resilience and vulnerability to prolonged outages.

Climate and Weather-Related Features climate_region, climate_category, anomaly_level, climate_category_ord These variables provide insight into the prevailing weather conditions and seasonal patterns that can influence outage duration, particularly during severe weather events.

Economic and Energy Consumption Features res_sales, ind_sales, com_sales, combined_sales These metrics reflect local economic activity and energy usage levels, acting as proxies for grid strain and the maintenance standards that might affect outage longevity.

Cause-Related Features cause_category This categorical variable indicates the primary trigger of an outage, offering additional context that may help differentiate between outages that resolve quickly and those that become prolonged.

By leveraging this diverse set of features, our model aims to accurately predict whether an outage will last 24 hours or more, thereby providing a valuable early warning tool for utilities and emergency services.

Evaluation Metric: Recall

We prioritize recall (true positive rate) because missing a prolonged outage (a false negative) is far more costly than mistakenly labeling a short outage as prolonged (a false positive). By focusing on recall, we reduce the risk of underestimating the severity of an outage, ensuring that critical response measures can be activated promptly when needed. Although we may monitor other metrics (like precision or F1-score) for a more complete picture, recall is our primary measure of success given the operational and safety implications of failing to identify a genuinely prolonged outage.

Baseline Model

Our baseline model is a logistic regression that uses three features: one nominal and two quantitative. We use the categorical feature climate_region (nominal), which is encoded using a one-hot encoder (dropping the first category to avoid redundancy), and the numerical features poppct_urban and popden_urban (both quantitative), which are standardized. The response variable, prolonged, is binary, making this a binary classification problem solved using Logistic Regression.

We chose logistic regression for our baseline model because it is a straightforward, well-understood algorithm for binary classification. Logistic regression estimates the probability of an outage being prolonged, which aligns well with our goal of identifying high-risk situations. Additionally, logistic regression is computationally efficient and less prone to overfitting with a limited set of features.

Below is an overview of the pipeline:

Feature Selection and Encoding:
- Categorical Feature: climate_region (nominal), one-hot encoded (with drop='first' to avoid collinearity).
- Numerical Features: poppct_urban and popden_urban (both quantitative), scaled via StandardScaler.
Pipeline Implementation:
We chain these preprocessing steps with a Logistic Regression classifier, ensuring all data transformations occur sequentially in one unified workflow. The pipeline is trained on 80% of the data (randomly selected), and the remaining 20% is used for testing.

# Define the preprocessing for each feature type with OneHotEncoder(drop='first')
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), cat_features),
        ('num', StandardScaler(), num_features)
    ]
)

# Build the baseline pipeline with a Logistic Regression classifier
baseline_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

Model Performance:
After training, we evaluated our baseline model on the test set. The metrics are as follows:
- Accuracy: 0.6287
- F1-score: 0.2785
- Precision: 0.5238
- Recall: 0.1897
Confusion Matrix Insight:
A closer look at the confusion matrix reveals that although the model achieves a moderate overall accuracy, it consistently misclassifies a significant portion of actual prolonged outages (low recall). This suggests that our baseline approach tends to underestimate the “prolonged” class, likely due to a relatively small number of prolonged outages in the dataset or insufficiently predictive features for distinguishing these rare events.
Interpretation and Next Steps:

While this baseline model serves as a useful starting benchmark, its low recall for prolonged outages indicates a critical gap for practical applications—especially when failing to identify a truly prolonged outage could have severe operational consequences. To address this, future improvements may involve:

Adding More Features: Incorporating additional relevant predictors (e.g., climate anomalies, population density metrics, or time-of-year effects) to enhance the model’s ability to recognize prolonged outages.
Experimenting with Alternative Classifiers: Trying tree-based models or ensemble methods that may better capture complex relationships.
Threshold Tuning or Class Weights: Adjusting the decision boundary or applying class weights to mitigate the imbalance and increase recall, thereby reducing the risk of missing genuine prolonged outages.

Final Model

Final Model: RandomForestClassifier with Feature Engineering and Hyperparameter Tuning

Motivation for Model Change:
After observing that our baseline logistic regression model struggled to identify prolonged outages (low recall), we switched to a RandomForestClassifier. This ensemble method often captures more complex relationships in the data than linear models, allowing it to better distinguish between prolonged and non-prolonged outages. Moreover, by setting class_weight='balanced', we place extra emphasis on the minority class (prolonged outages), addressing the recall deficit more directly.

Feature Engineering and Added Features:
Compared to the baseline, we incorporated additional feature engineering to enrich the model’s understanding of outage patterns:

Population-Based Features: ratio_urban_rural (urban vs. rural ratio) and overall_density (sum of urban, UC, and rural density). The model gains insight into how variations in urban versus rural infrastructure and population density influence outage patterns.
Climate Category Ordinal Encoding: climate_category_ord (cold=1, normal=2, warm=3). The ordinal encoding of climate_category into climate_category_ord converts qualitative weather conditions into a measurable factor that reflects the severity of climatic influences.
Sales Aggregation: combined_sales as the sum of residential, industrial, and commercial electricity sales. The aggregation of sales data (res_sales, ind_sales, com_sales) into combined_sales provides a comprehensive indicator of energy consumption and economic activity, which can signal grid stress. Together, these enhancements allow our model to more accurately reflect the real-world data generating process, leading to improved predictive performance.

These engineered features capture more nuanced insights about population distribution, climate conditions, and energy consumption—factors that may affect how long outages persist.

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), cat_features),
        ('num', StandardScaler(), num_features)
    ]
)

# Note the addition of class_weight='balanced' to improve recall on the minority class.
pipeline = Pipeline([
    ('population_feature_engineer', PopulationFeatureEngineer()),
    ('additional_feature_engineer', AdditionalFeatureEngineer()),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42, criterion='entropy', class_weight='balanced'))
])

Model Pipeline and Hyperparameter Tuning:
All transformations (feature engineering, one-hot encoding, scaling) and model training steps are unified in a single scikit-learn pipeline.

Hyperparameter tuning for our RandomForestClassifier is a nuanced process, aimed at refining the model to balance learning from the data with the ability to generalize to unseen data. We tuned several key parameters: max_depth, which controls the maximum number of levels in each decision tree (too low leads to underfitting and too high risks overfitting); n_estimators, the number of trees in the ensemble (more trees can capture more complexity but increase computation); and min_samples_split, which sets the minimum number of samples required to split an internal node (ensuring splits are made only when there is sufficient data).

We employed GridSearchCV with 5-fold cross-validation to systematically explore various combinations of these hyperparameters, allowing us to identify the configuration that best improves model performance while mitigating overfitting.

We found that the optimal settings for our RandomForestClassifier are: max_depth=None, min_samples_split=10, and n_estimators=300. This configuration achieves the best balance between capturing complex patterns in the data and maintaining generalizability.

Performance with Default Threshold:
After selecting the best hyperparameters, the final model achieves:

Accuracy: 0.8036
F1-score: 0.7611
Precision: 0.8269
Recall: 0.7049

While the accuracy is fairly high, it’s the recall that we particularly value, given the operational importance of catching as many prolonged outages as possible.

Adjusting the Decision Threshold:
We further improved recall by lowering the prediction threshold (from 0.5 to 0.2). This results in:

Accuracy: 0.7600
F1-score: 0.7708
Precision: 0.6687
Recall: 0.9098

Although accuracy and precision dip slightly (accurary decreased by less than 0.05), the recall increaed significantly by 0.21 to more than 0.90 means fewer truly prolonged outages go undetected—critical in real-world scenarios where failing to anticipate a prolonged outage can have serious consequences.

Conclusion and Next Steps:
By incorporating additional engineered features, switching from logistic regression to a more flexible random forest approach, and fine-tuning hyperparameters through 5-fold cross-validation, our final model substantially improves recall while maintaining a reasonable overall accuracy.

Metric	Baseline	Final	Difference
Accuracy	0.6287	0.7600	0.1313
F1-score	0.2785	0.7708	0.4923
Precision	0.5238	0.6687	0.1449
Recall	0.1897	0.9098	0.7201

The final model shows substantial improvements over the baseline. In particular, the recall increased by around 0.72, highlighting a much better ability to detect prolonged outages—a critical factor in our application. Overall, the enhancements in accuracy, F1-score, and precision confirm that our advanced feature engineering and model tuning have significantly improved predictive performance.

Fairness Analysis

Fairness Analysis by Season (Cold vs. Warm Months)

We investigated whether our final model’s recall for detecting prolonged outages differs between two seasonal groups because seasonal variations can have a profound impact on both the frequency and the characteristics of power outages. Cold months often bring harsh weather conditions such as snowstorms and ice, which can lead to more severe and prolonged outages, while warm months may be associated with different stressors like heat waves or thunderstorms.

These differences may affect the model’s ability to accurately detect prolonged outages across seasons. By comparing recall between cold and warm months, we can identify whether our model performs consistently throughout the year or if there are seasonal biases that need to be addressed, ensuring the model’s robustness and fairness in real-world applications.

Group 1 (Cold Months): October, November, December, January, February, and March
Group 2 (Warm Months): April, May, June, July, August, and September

Hypotheses

Null Hypothesis (H₀): The model is fair with respect to seasonality; there is no difference in recall between cold and warm months.
Alternative Hypothesis (H₁): The model is not fair; recall in cold months differs from that in warm months.

We explored the model’s performance by splitting the test data into two seasonal groups—cold months (October through March) and warm months (April through September)—to see whether the model’s recall differs by season. This initial exploratory data analysis revealed a modest gap in recall between cold and warm months, prompting us to conduct a permutation test to assess whether this difference could be attributed to chance.

Test Statistic and Significance Level

Test Statistic: The difference in recall between Group 1 (Cold Month) and Group 2 (Warm Month).
Significance Level (α): 0.05

Observed Results

Recall (Cold Months): 0.7679
Recall (Warm Months): 0.6515
Observed Difference (Cold − Warm): 0.1163
Permutation Test p-value: 0.1654

Since our p-value (0.1654) is greater than the 5% significance level (α = 0.05), we fail to reject the null hypothesis. While the model’s recall appears higher during cold months, this difference could plausibly arise by chance rather than indicating a systematic bias. Therefore, based on this test, we do not have sufficient evidence to conclude that the model’s performance is unfairly skewed toward one seasonal group.

Blackout Breakdown

Predictive Analysis for Forecasting Prolonged Outages

Table of Contents

Project Overview

Introduction

Data Cleaning and Exploratory Data Analysis

Data Cleaning

Univariate Analysis

Frequency of Power Outage Causes

Outage Frequency by State

Bivariate Analysis

Comparison of Outage Duration by Cause Category

Percentage of Outages By Cliamte Region and Cause Category

Interesting Aggregates

Temporal Trends in Outage Impacts and Climate Anomalies by Region

Urbanization and Outage Impact Relationship

Assessment of Missingness

NMAR Analysis

Missingness Dependency

1.outage_duration and cause_category_detail (MAR - Missing at Random)

Hypotheses

Permutation Test Methodology

2.outage_duration and month (NMAR - Not Missing at Random)

Hypotheses

Permutation Test Methodology

Hypothesis Testing

Hypothesis

Explanation of Results

Framing a Prediction Problem

Prediction Problem: Classifying Prolonged Outages

Justification and Features

Evaluation Metric: Recall

Baseline Model

Final Model

Final Model: RandomForestClassifier with Feature Engineering and Hyperparameter Tuning

Fairness Analysis

Fairness Analysis by Season (Cold vs. Warm Months)

Hypotheses

Test Statistic and Significance Level

Observed Results

1.`outage_duration` and `cause_category_detail` (MAR - Missing at Random)

2.`outage_duration` and `month` (NMAR - Not Missing at Random)