Blackout Breakdown
Predictive Analysis for Forecasting Prolonged Outages
Authors: David Chew, Derek Kuang
Table of Contents
- Project Overview
- Introduction
- Cleaning and Exploratory Data Analysis
- Hypothesis Testing
- Baseline Model
- Final Model
- Fairness Analysis
Project Overview
This is a data science project on predicting the causes of a major power outage. The dataset used to explore the topic can be find here. This project is for DSC80 at UCSD.
Introduction
Have you ever been caught off guard by a sudden power outage? In today’s society, outages disrupt businesses, endanger public safety, and lead to significant economic losses. This raises our central question:
“What are the primary causes of power outages in different regions of the United States, considering factors such as weather, geography, and infrastructure? How do these causes impact outcomes like economic loss, outage duration, and other related effects?”
Understanding these factors is crucial for guiding infrastructure investments and emergency response strategies, ultimately building more resilient power systems. Our analysis uses a dataset of 1,534 major outages across the United States from 2001 to 2016, which captures both the duration and extent of outages along with regional context.
Here are some of the columns that we thought are relevant to our analysis:
| Column Name | Description |
|---|---|
YEAR |
Year in which the outage took place. |
MONTH |
Month of the outage occurrence. |
U.S._STATE |
U.S. state where the outage occurred. |
NERC.REGION |
NERC region(s) impacted by the outage. |
CLIMATE.REGION |
Classification of the area’s climate region. |
ANOMALY.LEVEL |
ONI index indicating El Niño/La Niña conditions. |
CLIMATE.CATEGORY |
Classification of climate episodes. |
OUTAGE.START.DATE |
Date (day of the year) when the outage began. |
OUTAGE.START.TIME |
Start time of the outage. |
OUTAGE.RESTORATION.DATE |
Date (day of the year) when power was restored. |
OUTAGE.RESTORATION.TIME |
Time when power was restored. |
CAUSE.CATEGORY |
Primary cause category of the outage. |
CAUSE.CATEGORY.DETAIL |
Detailed description of the cause categories. |
OUTAGE.DURATION |
Length of the outage. |
DEMAND.LOSS.MW |
Peak demand loss in megawatts. |
CUSTOMERS.AFFECTED |
Number of customers impacted by the outage. |
RES.PRICE |
Residential sector monthly electricity price. |
COM.PRICE |
Commercial sector monthly electricity price. |
IND.PRICE |
Industrial sector monthly electricity price. |
TOTAL.PRICE |
Overall average monthly electricity price. |
RES.SALES |
Residential electricity consumption. |
COM.SALES |
Commercial electricity consumption. |
IND.SALES |
Industrial electricity consumption. |
TOTAL.SALES |
Combined total electricity consumption. |
RES.PERCEN |
Share of electricity consumption from the residential sector. |
COM.PERCEN |
Share of electricity consumption from the commercial sector. |
IND.PERCEN |
Share of electricity consumption from the industrial sector. |
RES.CUSTOMERS |
Number of residential customers served annually. |
COM.CUSTOMERS |
Number of commercial customers served annually. |
IND.CUSTOMERS |
Number of industrial customers served annually. |
TOTAL.CUSTOMERS |
Total number of customers served annually. |
RES.CUST.PCT |
Proportion of customers in the residential sector. |
COM.CUST.PCT |
Proportion of customers in the commercial sector. |
IND.CUST.PCT |
Proportion of customers in the industrial sector. |
POPPCT_URBAN |
Percentage of the state’s population residing in urban areas. |
POPPCT_UC |
Percentage of the state’s population living in urban clusters. |
POPDEN_URBAN |
Population density in urban areas. |
POPDEN_UC |
Population density in urban clusters. |
POPDEN_RURAL |
Population density in rural areas. |
AREAPCT_URBAN |
Percentage of state land area occupied by urban areas. |
AREAPCT_UC |
Percentage of state land area covered by urban clusters. |
Data Cleaning and Exploratory Data Analysis
Data Cleaning
-
Data Copy and Indexing: Converted from xlsx to csv format and remove extra rows/cols. Created a copy of the raw dataset to preserve the original data and set the
OBScolumn as the index, ensuring each observation is uniquely identified and facilitating consistent referencing during further analysis. -
Datetime Conversion: Combined
OUTAGE.START.DATEandOUTAGE.START.TIMEinto a singleOUTAGE.STARTdatetime column, and similarly combinedOUTAGE.RESTORATION.DATEandOUTAGE.RESTORATION.TIMEintoOUTAGE.RESTORATION.This conversion standardizes time-based data, making it easier to perform temporal analyses, while any unparseable entries are coerced to missing values. -
Handling Missing Month and Year: Filled missing values in the
MONTHandYEARcolumns with 0 and converted them to integers. This step ensures these temporal fields are in a consistent numerical format, which is essential for accurate time-series analysis. -
Handling Missing Numeric Values: Replaced zero values in critical numerical columns such as
OUTAGE.DURATION,CUSTOMERS.AFFECTED, andDEMAND.LOSS.MWwith NaN. This prevents misinterpretation of missing or unrecorded data as valid values, leading to more reliable statistical analyses and modeling. -
Renaming and Standardizing Column Names: Renamed
U.S._STATEtoUS.STATEand then standardized all column names by converting them to lowercase and replacing periods with underscores. This uniform naming convention reduces the likelihood of errors during data manipulation and simplifies code readability. -
Stripping Whitespace: Removed leading and trailing whitespace from all string columns. This cleaning step ensures that categorical values are consistent, preventing issues during grouping, merging, and analysis.
| OBS | year | month | us_state | postal_code | nerc_region | climate_region | anomaly_level | climate_category | outage_start_date | outage_start_time | outage_restoration_date | outage_restoration_time | cause_category | cause_category_detail | hurricane_names | outage_duration | demand_loss_mw | customers_affected | res_price | com_price | ind_price | total_price | res_sales | com_sales | ind_sales | total_sales | res_percen | com_percen | ind_percen | res_customers | com_customers | ind_customers | total_customers | res_cust_pct | com_cust_pct | ind_cust_pct | pc_realgsp_state | pc_realgsp_usa | pc_realgsp_rel | pc_realgsp_change | util_realgsp | total_realgsp | util_contri | pi_util_ofusa | population | poppct_urban | poppct_uc | popden_urban | popden_uc | popden_rural | areapct_urban | areapct_uc | pct_land | pct_water_tot | pct_water_inland | outage_start | outage_restoration |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2011 | 7 | Minnesota | MN | MRO | East North Central | -0.3 | normal | Friday, July 1, 2011 | 5:00:00 PM | Sunday, July 3, 2011 | 8:00:00 PM | severe weather | 3060.0 | 70000.0 | 11.6 | 9.18 | 6.81 | 9.28 | 2332915.0 | 2114774.0 | 2113291.0 | 6562520.0 | 35.54907261 | 32.22502941 | 32.20243138 | 2308736 | 276286 | 10673 | 2595696 | 88.9448 | 10.644 | 0.4112 | 51268 | 47586 | 1.077375699 | 1.6 | 4802 | 274182 | 1.751391412 | 2.2 | 5348119 | 73.27 | 15.28 | 2279.0 | 1700.5 | 18.2 | 2.14 | 0.6 | 91.59266587 | 8.407334131 | 5.478742983 | 2011-07-01 17:00:00 | 2014-05-11 20:00:00 | |||
| 2 | 2014 | 5 | Minnesota | MN | MRO | East North Central | -0.1 | normal | Sunday, May 11, 2014 | 6:38:00 PM | Sunday, May 11, 2014 | 6:39:00 PM | intentional attack | vandalism | 1.0 | 12.12 | 9.71 | 6.49 | 9.28 | 1586986.0 | 1807756.0 | 1887927.0 | 5284231.0 | 30.03248722 | 34.21038936 | 35.72756376 | 2345860 | 284978 | 9898 | 2640737 | 88.8335 | 10.7916 | 0.3748 | 53499 | 49091 | 1.089792426 | 1.9 | 5226 | 291955 | 1.790001884 | 2.2 | 5457125 | 73.27 | 15.28 | 2279.0 | 1700.5 | 18.2 | 2.14 | 0.6 | 91.59266587 | 8.407334131 | 5.478742983 | 2014-05-11 18:38:00 | 2010-10-28 18:39:00 | |||
| 3 | 2010 | 10 | Minnesota | MN | MRO | East North Central | -1.5 | cold | Tuesday, October 26, 2010 | 8:00:00 PM | Thursday, October 28, 2010 | 10:00:00 PM | severe weather | heavy wind | 3000.0 | 70000.0 | 10.87 | 8.19 | 6.07 | 8.15 | 1467293.0 | 1801683.0 | 1951295.0 | 5222116.0 | 28.09767152 | 34.50101453 | 37.36598344 | 2300291 | 276463 | 10150 | 2586905 | 88.9206 | 10.687 | 0.3924 | 50447 | 47287 | 1.066825978 | 2.7 | 4571 | 267895 | 1.706265514 | 2.1 | 5310903 | 73.27 | 15.28 | 2279.0 | 1700.5 | 18.2 | 2.14 | 0.6 | 91.59266587 | 8.407334131 | 5.478742983 | 2010-10-26 20:00:00 | 2012-06-20 22:00:00 | ||
| 4 | 2012 | 6 | Minnesota | MN | MRO | East North Central | -0.1 | normal | Tuesday, June 19, 2012 | 4:30:00 AM | Wednesday, June 20, 2012 | 11:00:00 PM | severe weather | thunderstorm | 2550.0 | 68200.0 | 11.79 | 9.25 | 6.71 | 9.19 | 1851519.0 | 1941174.0 | 1993026.0 | 5787064.0 | 31.99409925 | 33.54333043 | 34.43932882 | 2317336 | 278466 | 11010 | 2606813 | 88.8954 | 10.6822 | 0.4224 | 51598 | 48156 | 1.071476036 | 0.6 | 5364 | 277627 | 1.932088738 | 2.2 | 5380443 | 73.27 | 15.28 | 2279.0 | 1700.5 | 18.2 | 2.14 | 0.6 | 91.59266587 | 8.407334131 | 5.478742983 | 2012-06-19 04:30:00 | 2015-07-19 23:00:00 | ||
| 5 | 2015 | 7 | Minnesota | MN | MRO | East North Central | 1.2 | warm | Saturday, July 18, 2015 | 2:00:00 AM | Sunday, July 19, 2015 | 7:00:00 AM | severe weather | 1740.0 | 250.0 | 250000.0 | 13.07 | 10.16 | 7.74 | 10.43 | 2028875.0 | 2161612.0 | 1777937.0 | 5970339.0 | 33.9825762 | 36.20585029 | 29.77949828 | 2374674 | 289044 | 9812 | 2673531 | 88.8216 | 10.8113 | 0.367 | 54431 | 49844 | 1.092027125 | 1.7 | 4873 | 292023 | 1.668704177 | 2.2 | 5489594 | 73.27 | 15.28 | 2279.0 | 1700.5 | 18.2 | 2.14 | 0.6 | 91.59266587 | 8.407334131 | 5.478742983 | 2015-07-18 02:00:00 | 2010-11-14 07:00:00 |
Univariate Analysis
Frequency of Power Outage Causes
In this bar chart, each bar represents a distinct detailed cause category (e.g., severe weather, intentional attack, and other specific causes), with the height of the bar indicating the number of outages associated with that detailed cause. The color differentiation and rotated labels help improve readability, making it easy to identify which detailed causes are most common.
Notably, “vandalism” emerges as the leading cause, followed by weather-related factors like “thunderstorm” and “winter storm,” suggesting that both human-driven and severe weather events significantly affect power reliability.
Outage Frequency by State
We used a Folium‐based heat map to visualize the total number of power outages in each U.S. state, shading states with higher outage counts in darker colors. This approach makes it easy to spot regions with frequent disruptions while looking at the U.S map.
Notably, states such as California, Washington, Texas, Illinois, and New York stand out with especially high outage counts. This pattern suggests that these regions may be more vulnerable to power disruptions
Bivariate Analysis
Comparison of Outage Duration by Cause Category
In this visualization, we explore the relationship between outage duration and its cause category to identify trends in power disruptions. Using a box plot, we can examine the spread, median, and outliers for each cause of outages. This allows us to assess which causes tend to result in longer or more variable power outages. By comparing different categories, we can gain insights into the most disruptive factors affecting power reliability.
Notably, fuel supply emergencies show a wide range of outage durations, from brief disruptions to prolonged events, while severe weather and fuel supply emergencies are the only categories with major outliers. In contrast, intentional attacks, system operability disruptions, and equipment failures tend to have shorter, more consistent durations with fewer extreme cases, and the median outage duration varies across categories.
Percentage of Outages By Cliamte Region and Cause Category
This visualization examines the percentage of outages by climate region and cause category, providing a regional perspective on power disruption causes. By using a grouped bar chart, we can see how different outage causes contribute to the total outages in each climate region. This allows us to identify patterns in outage causes across geographic areas, highlighting regional vulnerabilities and differences in power grid reliability.
Notably, severe weather is the dominant cause of outages in most regions—especially in the Central, East North Central, South, and Southeast—while intentional attacks are more prevalent in the Northwest and Southwest. The West along with the West North Central show a more balanced mix of causes. In addition, system operability disruptions, equipment failures, and public appeals occur across all regions but at much lower rates compared to severe weather and intentional attacks.
Interesting Aggregates
Temporal Trends in Outage Impacts and Climate Anomalies by Region
In this visualization, we aggregate our dataset by year and climate region to examine temporal trends in power outage impacts. The pivot table computes the average outage duration and mean anomaly level for each combination of year and climate region, which are then displayed as two line charts in a dual subplot layout. The left chart reveals how average outage duration has evolved over time across different climate regions, while the right chart shows the corresponding fluctuations in the mean anomaly level.
| year | climate_region | avg_outage_duration | mean_anomaly_level |
|---|---|---|---|
| 2000 | Central | 1200 | -0.6 |
| 2000 | Northeast | 681 | -0.9 |
| 2000 | South | 903 | -0.833333 |
| 2000 | Southeast | 5384 | -0.95 |
| 2000 | Southwest | 66 | -0.833333 |
In these two plots, each climate region exhibits distinct patterns over time for both average outage duration (left) and mean anomaly level (right). For instance, certain regions display sharp spikes in outage duration around 2010–2011, while others remain more stable. Simultaneously, changes in the mean anomaly level—often related to El Niño or La Niña events—tend to coincide with fluctuations in outage severity. By comparing these trends side by side, we gain insight into how climate variability may exacerbate or mitigate prolonged outages across different regions, helping us pinpoint periods and locations most at risk.
Urbanization and Outage Impact Relationship
These three bar charts compare key outage metrics across four levels of urbanization (Low, Medium-Low, Medium-High, High), segmented by the percentage of each region’s population living in urban areas. The left chart shows the average outage duration, the middle chart highlights the total number of customers affected, and the right chart displays the average demand loss.
| urban_quantile | avg_outage_duration | total_customers_affected | average_demand_loss |
|---|---|---|---|
| Low | 3374.76 | 3.72032e+07 | 535.273 |
| Medium-Low | 2009.87 | 2.61275e+07 | 575.86 |
| Medium-High | 3283.56 | 4.59132e+07 | 836.264 |
| High | 2124.32 | 4.72668e+07 | 843.425 |
These bar charts reveal that while areas with lower urbanization levels can experience relatively long average outages, more urbanized regions (particularly “Medium-High” and “High”) see the largest total number of customers affected, likely reflecting higher population density. Additionally, average demand loss appears greatest in the “High” group, underscoring the heavier infrastructure usage in densely populated areas. Taken together, these findings suggest that both ends of the urbanization spectrum face unique challenges—rural regions may endure prolonged outages, while highly urbanized areas suffer more extensive disruptions that impact a greater number of customers and place a larger strain on the power grid.
Assessment of Missingness
NMAR Analysis
We believe that cause_category_detail is NMAR (Not Missing At Random) because its missing values are directly related to the nature of the outage itself. When an outage occurs under unusual, ambiguous, or difficult-to-diagnose circumstances, the specific cause details might not be recorded, either due to a lack of conclusive evidence or delays in investigation and reporting. This suggests that the missing values are not randomly distributed across all outages but instead correlated with the complexity or uncertainty of the outage cause.
There are several possible reasons why detailed cause information might not be reported:
- Investigative Challenges – If an outage is particularly complex or involves multiple contributing factors, it may take longer to determine a precise cause, leading to missing entries.
- Reporting Limitations – Certain outages may not require detailed reporting, especially if they fall outside regulatory thresholds for documentation.
- Human and System Factors – If the event occurs during an emergency or crisis, data collection may be deprioritized, resulting in gaps in the dataset.
To better understand whether cause_category_detail is truly NMAR, we would need additional data, such as incident reports, utility company reporting policies, or follow-up investigation records. If we can show that missing values are more frequent in certain types of outages (e.g., severe weather events or cyberattacks where precise attribution is difficult), this would confirm that the missingness is NMAR. However, if external factors like the region, company size, or outage duration explain the missingness, we might instead classify it as MAR (Missing At Random). Further testing will help validate our hypothesis.
Missingness Dependency
1.outage_duration and cause_category_detail (MAR - Missing at Random)
To determine whether the absence of cause details is related to outage duration, we conducted a permutation test. We compared the mean outage duration between records with missing cause details and those with non-missing cause details.
Hypotheses
- Null Hypothesis (H₀): The missingness of
cause_category_detailis independent of outage duration. That is, the average outage duration is the same for records with and without cause details. - Alternative Hypothesis (H₁): The average outage duration differs between records with missing and non-missing
cause_category_detail, indicating that the missingness is not random with respect to outage duration.
Permutation Test Methodology
We first computed the observed difference in mean outage duration between the two groups (missing vs. non-missing cause details). Then, we randomly shuffled the missingness indicator across all records (keeping the outage duration values fixed) over 1,000 iterations to create a null distribution of differences. The p-value was calculated as the proportion of permuted differences that were as extreme as or more extreme than the observed difference.
With a p-value of 0.02, we reject the null hypothesis at the conventional significance level (α = 0.05). This indicates that the missingness in cause_category_detail is significantly associated with outage duration—records with missing cause details tend to have a different (either higher or lower) average outage duration compared to those with complete cause information.
2.outage_duration and month (NMAR - Not Missing at Random)
To examine whether the month of occurrence influences outage duration, we conducted a permutation test.
Hypotheses
- Null Hypothesis (H₀): The average outage duration is independent of the month. In other words, there is no significant difference in the mean outage duration across different months.
- Alternative Hypothesis (H₁): At least one month has a different average outage duration compared to the others, indicating that the month influences outage duration.
Permutation Test Methodology
Specifically, we computed the observed test statistic as the difference between the maximum and minimum average outage durations across months. Then, we randomly shuffled the month labels (keeping outage_duration values unchanged) for a large number of permutations (e.g., 1,000 iterations) to generate a null distribution of the test statistic. The p-value was calculated as the proportion of permuted test statistics that were as large as or larger than the observed difference.
With a p-value of 0.217, we fail to reject the null hypothesis at the conventional significance level (α = 0.05). This indicates that the variation in average outage duration across different months is not statistically significant, and thus, the month does not appear to be a factor influencing outage duration in our dataset.
Hypothesis Testing
In our exploration of how different causes of power outages might affect economic outcomes, we compared the average total_price (a proxy for economic impact) across various detailed cause categories. The bar chart below displays each cause category on the x‐axis and the corresponding mean total_price on the y‐axis.
By highlighting which causes are associated with higher average economic impact, we can begin to see whether certain outage triggers—like vandalism, storms, or equipment failure—tend to have more pronounced cost implications. This initial overview sets the stage for a deeper hypothesis test to determine if any observed differences in total_price are statistically meaningful.
Hypothesis
- Null Hypothesis (H₀): The mean economic impact (total_price) is equal for all cause_category_detail groups; that is,
μ1 = μ2 = ... = μk, where eachμirepresents the mean total_price for a different detailed cause. - Alternative Hypothesis (H₁): At least one cause_category_detail group has a mean economic impact that differs from the others.
We use a permutation test with the difference in means as the test statistic. Specifically, we calculate the mean total_price for each cause_category_detail group and define the observed test statistic T_obs as:
T_obs = max(μ1, μ2, ..., μk) - min(μ1, μ2, ..., μk)
where each μi represents the mean total_price for a different detailed cause.
This statistic captures the greatest disparity in economic impact across groups. In the permutation test, we randomly reassign the cause_category_detail labels across the dataset multiple times (e.g., 1,000 iterations) and compute the test statistic for each permutation. This process generates a null distribution of test statistic values (denoted as T values). The p-value is then calculated as the proportion of these permuted T values that are as extreme as (or more extreme than) the observed T_obs.
Explanation of Results
The significane level that we picked is 0.05.
Our permutation test yielded a p-value of 0.481. This p-value is well above the our significance level of 0.05, so we fail to reject the null hypothesis. In practical terms, this result indicates that there is not enough evidence to conclude that the mean economic impact differs among the different cause_category_detail groups. The observed disparity in means is consistent with what we would expect to see by chance alone. Consequently, based on this test, the detailed cause does not appear to have a statistically significant effect on the economic impact, as measured by total_price, in our dataset.
Framing a Prediction Problem
Prediction Problem: Classifying Prolonged Outages
We frame a binary classification problem where the goal is to predict whether a newly reported outage will become “prolonged” (last 24 hours or more). Specifically, we define our response variable as:
- prolonged = 1 if
outage_durationis greater than or equal to 1440 minutes (24 hours). - prolonged = 0 otherwise.
Justification and Features
-
Why Prolonged Outages?
Prolonged outages (outages that last longer than 24 hours) can severely disrupt communities, businesses, and critical infrastructure. Identifying these outages early allows utilities and emergency services to allocate resources more effectively. -
Features:
In our predictive model for classifying prolonged outages, we carefully select features that are known or estimable at the time of prediction, ensuring the model’s practical utility in real-world scenarios.Demographic and Geographic Features
poppct_urban,poppct_uc,popden_urban,popden_uc,popden_rural,ratio_urban_rural,overall_densityThese features capture the urbanization and population density characteristics of a region, which are critical for understanding the infrastructure’s resilience and vulnerability to prolonged outages.Climate and Weather-Related Features
climate_region,climate_category,anomaly_level,climate_category_ordThese variables provide insight into the prevailing weather conditions and seasonal patterns that can influence outage duration, particularly during severe weather events.Economic and Energy Consumption Features
res_sales,ind_sales,com_sales,combined_salesThese metrics reflect local economic activity and energy usage levels, acting as proxies for grid strain and the maintenance standards that might affect outage longevity.Cause-Related Features
cause_categoryThis categorical variable indicates the primary trigger of an outage, offering additional context that may help differentiate between outages that resolve quickly and those that become prolonged.By leveraging this diverse set of features, our model aims to accurately predict whether an outage will last 24 hours or more, thereby providing a valuable early warning tool for utilities and emergency services.
Evaluation Metric: Recall
We prioritize recall (true positive rate) because missing a prolonged outage (a false negative) is far more costly than mistakenly labeling a short outage as prolonged (a false positive). By focusing on recall, we reduce the risk of underestimating the severity of an outage, ensuring that critical response measures can be activated promptly when needed. Although we may monitor other metrics (like precision or F1-score) for a more complete picture, recall is our primary measure of success given the operational and safety implications of failing to identify a genuinely prolonged outage.
Baseline Model
Our baseline model is a logistic regression that uses three features: one nominal and two quantitative. We use the categorical feature climate_region (nominal), which is encoded using a one-hot encoder (dropping the first category to avoid redundancy), and the numerical features poppct_urban and popden_urban (both quantitative), which are standardized. The response variable, prolonged, is binary, making this a binary classification problem solved using Logistic Regression.
We chose logistic regression for our baseline model because it is a straightforward, well-understood algorithm for binary classification. Logistic regression estimates the probability of an outage being prolonged, which aligns well with our goal of identifying high-risk situations. Additionally, logistic regression is computationally efficient and less prone to overfitting with a limited set of features.
Below is an overview of the pipeline:
- Feature Selection and Encoding:
- Categorical Feature:
climate_region(nominal), one-hot encoded (withdrop='first'to avoid collinearity). - Numerical Features:
poppct_urbanandpopden_urban(both quantitative), scaled viaStandardScaler.
- Categorical Feature:
- Pipeline Implementation:
We chain these preprocessing steps with a Logistic Regression classifier, ensuring all data transformations occur sequentially in one unified workflow. The pipeline is trained on 80% of the data (randomly selected), and the remaining 20% is used for testing.
# Define the preprocessing for each feature type with OneHotEncoder(drop='first')
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(drop='first'), cat_features),
('num', StandardScaler(), num_features)
]
)
# Build the baseline pipeline with a Logistic Regression classifier
baseline_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
- Model Performance:
After training, we evaluated our baseline model on the test set. The metrics are as follows:- Accuracy: 0.6287
- F1-score: 0.2785
- Precision: 0.5238
- Recall: 0.1897
-
Confusion Matrix Insight:
A closer look at the confusion matrix reveals that although the model achieves a moderate overall accuracy, it consistently misclassifies a significant portion of actual prolonged outages (low recall). This suggests that our baseline approach tends to underestimate the “prolonged” class, likely due to a relatively small number of prolonged outages in the dataset or insufficiently predictive features for distinguishing these rare events. - Interpretation and Next Steps:
While this baseline model serves as a useful starting benchmark, its low recall for prolonged outages indicates a critical gap for practical applications—especially when failing to identify a truly prolonged outage could have severe operational consequences. To address this, future improvements may involve:
- Adding More Features: Incorporating additional relevant predictors (e.g., climate anomalies, population density metrics, or time-of-year effects) to enhance the model’s ability to recognize prolonged outages.
- Experimenting with Alternative Classifiers: Trying tree-based models or ensemble methods that may better capture complex relationships.
- Threshold Tuning or Class Weights: Adjusting the decision boundary or applying class weights to mitigate the imbalance and increase recall, thereby reducing the risk of missing genuine prolonged outages.
Final Model
Final Model: RandomForestClassifier with Feature Engineering and Hyperparameter Tuning
Motivation for Model Change:
After observing that our baseline logistic regression model struggled to identify prolonged outages (low recall), we switched to a RandomForestClassifier. This ensemble method often captures more complex relationships in the data than linear models, allowing it to better distinguish between prolonged and non-prolonged outages. Moreover, by setting class_weight='balanced', we place extra emphasis on the minority class (prolonged outages), addressing the recall deficit more directly.
Feature Engineering and Added Features:
Compared to the baseline, we incorporated additional feature engineering to enrich the model’s understanding of outage patterns:
- Population-Based Features:
ratio_urban_rural(urban vs. rural ratio) andoverall_density(sum of urban, UC, and rural density). The model gains insight into how variations in urban versus rural infrastructure and population density influence outage patterns. - Climate Category Ordinal Encoding:
climate_category_ord(cold=1, normal=2, warm=3). The ordinal encoding ofclimate_categoryintoclimate_category_ordconverts qualitative weather conditions into a measurable factor that reflects the severity of climatic influences. - Sales Aggregation:
combined_salesas the sum of residential, industrial, and commercial electricity sales. The aggregation of sales data (res_sales,ind_sales,com_sales) intocombined_salesprovides a comprehensive indicator of energy consumption and economic activity, which can signal grid stress. Together, these enhancements allow our model to more accurately reflect the real-world data generating process, leading to improved predictive performance.
These engineered features capture more nuanced insights about population distribution, climate conditions, and energy consumption—factors that may affect how long outages persist.
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(drop='first'), cat_features),
('num', StandardScaler(), num_features)
]
)
# Note the addition of class_weight='balanced' to improve recall on the minority class.
pipeline = Pipeline([
('population_feature_engineer', PopulationFeatureEngineer()),
('additional_feature_engineer', AdditionalFeatureEngineer()),
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42, criterion='entropy', class_weight='balanced'))
])
Model Pipeline and Hyperparameter Tuning:
All transformations (feature engineering, one-hot encoding, scaling) and model training steps are unified in a single scikit-learn pipeline.
Hyperparameter tuning for our RandomForestClassifier is a nuanced process, aimed at refining the model to balance learning from the data with the ability to generalize to unseen data. We tuned several key parameters: max_depth, which controls the maximum number of levels in each decision tree (too low leads to underfitting and too high risks overfitting); n_estimators, the number of trees in the ensemble (more trees can capture more complexity but increase computation); and min_samples_split, which sets the minimum number of samples required to split an internal node (ensuring splits are made only when there is sufficient data).
We employed GridSearchCV with 5-fold cross-validation to systematically explore various combinations of these hyperparameters, allowing us to identify the configuration that best improves model performance while mitigating overfitting.
We found that the optimal settings for our RandomForestClassifier are: max_depth=None, min_samples_split=10, and n_estimators=300. This configuration achieves the best balance between capturing complex patterns in the data and maintaining generalizability.
Performance with Default Threshold:
After selecting the best hyperparameters, the final model achieves:
- Accuracy: 0.8036
- F1-score: 0.7611
- Precision: 0.8269
- Recall: 0.7049
While the accuracy is fairly high, it’s the recall that we particularly value, given the operational importance of catching as many prolonged outages as possible.
Adjusting the Decision Threshold:
We further improved recall by lowering the prediction threshold (from 0.5 to 0.2). This results in:
- Accuracy: 0.7600
- F1-score: 0.7708
- Precision: 0.6687
- Recall: 0.9098
Although accuracy and precision dip slightly (accurary decreased by less than 0.05), the recall increaed significantly by 0.21 to more than 0.90 means fewer truly prolonged outages go undetected—critical in real-world scenarios where failing to anticipate a prolonged outage can have serious consequences.
Conclusion and Next Steps:
By incorporating additional engineered features, switching from logistic regression to a more flexible random forest approach, and fine-tuning hyperparameters through 5-fold cross-validation, our final model substantially improves recall while maintaining a reasonable overall accuracy.
| Metric | Baseline | Final | Difference |
|---|---|---|---|
| Accuracy | 0.6287 | 0.7600 | 0.1313 |
| F1-score | 0.2785 | 0.7708 | 0.4923 |
| Precision | 0.5238 | 0.6687 | 0.1449 |
| Recall | 0.1897 | 0.9098 | 0.7201 |
The final model shows substantial improvements over the baseline. In particular, the recall increased by around 0.72, highlighting a much better ability to detect prolonged outages—a critical factor in our application. Overall, the enhancements in accuracy, F1-score, and precision confirm that our advanced feature engineering and model tuning have significantly improved predictive performance.
Fairness Analysis
Fairness Analysis by Season (Cold vs. Warm Months)
We investigated whether our final model’s recall for detecting prolonged outages differs between two seasonal groups because seasonal variations can have a profound impact on both the frequency and the characteristics of power outages. Cold months often bring harsh weather conditions such as snowstorms and ice, which can lead to more severe and prolonged outages, while warm months may be associated with different stressors like heat waves or thunderstorms.
These differences may affect the model’s ability to accurately detect prolonged outages across seasons. By comparing recall between cold and warm months, we can identify whether our model performs consistently throughout the year or if there are seasonal biases that need to be addressed, ensuring the model’s robustness and fairness in real-world applications.
- Group 1 (Cold Months): October, November, December, January, February, and March
- Group 2 (Warm Months): April, May, June, July, August, and September
Hypotheses
- Null Hypothesis (H₀): The model is fair with respect to seasonality; there is no difference in recall between cold and warm months.
- Alternative Hypothesis (H₁): The model is not fair; recall in cold months differs from that in warm months.
We explored the model’s performance by splitting the test data into two seasonal groups—cold months (October through March) and warm months (April through September)—to see whether the model’s recall differs by season. This initial exploratory data analysis revealed a modest gap in recall between cold and warm months, prompting us to conduct a permutation test to assess whether this difference could be attributed to chance.
Test Statistic and Significance Level
- Test Statistic: The difference in recall between Group 1 (Cold Month) and Group 2 (Warm Month).
- Significance Level (α): 0.05
Observed Results
- Recall (Cold Months): 0.7679
- Recall (Warm Months): 0.6515
- Observed Difference (Cold − Warm): 0.1163
- Permutation Test p-value: 0.1654
Since our p-value (0.1654) is greater than the 5% significance level (α = 0.05), we fail to reject the null hypothesis. While the model’s recall appears higher during cold months, this difference could plausibly arise by chance rather than indicating a systematic bias. Therefore, based on this test, we do not have sufficient evidence to conclude that the model’s performance is unfairly skewed toward one seasonal group.