Model Implementation

Classification

Classification Purpose: Classification supervised machine learning models are applied to evaluate their ability to predict the presence of water quality that exceeds the MCL level and is therefore harmful to humans.

Data limits: The models must be trained on labeled balanced data. Furthermore, many of the models often require numeric continuous data. Model performance must then be assessed on disjoint test data groups.

Recommended Use: Predicting the presence of a water sample that exceeds the MCL level is very applicable to real world situations, potentially eliminating the need for “boots on the ground” work.

For the following prediction classification models, the model will attempt to predict whether a water sample will exceed the maximum contaminant level (MCL). The MCL is the established threshold determining whether water can be delivered to users of the public water system. In-short exceeding this level would likely result in adverse health impacts for users. Models will be fed basic information on when and where the sample was collected then asked to predict whether the sample exceeds the MCL.

For the following analysis adjustments to the original data frame were made to ensure model compatibility - including one-hot encoding of categorical variables.

The data was then split into test and training sets the training set containing 80% of the data. The training sets were then under-sampled to provide adequate examples of observations which both exceed and don’t exceed the MCL level.

To determine the best model for this binary classification problem three models were created, and their performance was evaluated against one another: decision tree, naive bayes, and support vector machines.

The naive bayes was included due to its ability to identify feature interactions and its high interpretability lending itself to classifications. For this naive bayes model a multinomial model was implemented to best manage count and frequency data.

Decision Tree

Multinomial Naive Bayes

As evidenced by the produced confusion matrix (right) and the above performance statistics the multinomial naive bayes model also performed adequately in the context of MCL prediction. However, the NB model over assumes that an observation will exceed the threshold. Additionally, mislabeling about 1/4 of observations that exceed the MCL. These inaccuracies are likely due to the model’s assumption of independence between features.

Snipit of pre-processed data

Snipit of processed training-data

The decision tree model was included due to its high interpretability and strong predictive capabilities for binary classifications. For this model and gini criterion was implemented and tree depth was limited to 5 and the minimum samples split was set to 1% to maximize model performance while preventing overfitting.

As evidenced by the produced confusion matrix (left) and the above performance statistics the tree performed quite well in predicting whether a sample will exceed the MCL. Exhibiting a recall score of 77% on the exceed MCL label. However, the tree tends to over predict an observation will exceed the threshold. Given the context of the problem, this false-positive is preferred to the alternative false-negative. Further tests can be run on contaminated water, but if water is cleared to be safe it could have catastrophic impacts on the public.

Support Vector Machine

The support vector machine was included due to powerful and flexible nature of separating data for classification. For this model an rbf kernel was implemented as the data should not be assumed to be linearly separable. However, it is assumed to be separated in higher dimensional space. Furthermore, a gamma parameter of 1 was implemented to maximize model performance while reducing the rick of overfitting.

As evidenced by the above performance statistics the rbf svm performed very well in these metrics, boasting an accuracy score of 90%. However, the model recall on the exceed MCL label is only 72%. As with the other models, this RBF SVM also exhibits a high false-positive rate.

Classification Model Comparison

In addition to assessing the confusion matrices and performance statistics for each of the three created models the following ROC-AUC and PR comparison curves were also created.

It is clear from the nearly all of the metrics that the rbf support vector machine model does the best job at predicting whether a observation will exceed the MCL. This model boasts he highest AUC (0.89), AP(0.19), accuracy (.9), and F1 scores (.95, .20). Given this performance this will likely be the model implemented. However, given the context of the problem and the high consequence of false negatives, the decision tree model should also be considered given that is has the highest recall (.77) for the exceed MCL category.

Fairness Audit

After the creation of the above classification models, we performed a fairness audit to examine if the best performing model - the SVM - was underperforming with specific subgroups. Specifically, we explored how the model performed with groups exhibiting different poverty rates. The SVM performance was assessed when predicting a response for an observations from the lowest quartile for poverty rate (<9.3%) compared with predictions for observations from the highest quartile (>15.6%). We also then ran a fairness metric test (output below) for the highest poverty quantile vs. all other groups. We are specifically concerned that the model will underperform for higher poverty groups.

Fairness metric test output

Upon initial examination of the two model performance metric and the fairness test output there does not seem to be any major red flags. In fact, the high poverty group exhibits a higher accuracy rate than the low poverty group. However, the recall for high poverty rate observations that exceed the MCL is far worse (69%) than the same recall rate for the low poverty group (78%). In other-words, while the model exhibits a high accuracy rate with the high-poverty group, it does a significantly worse job at identifying harmful water samples when compared to the low-poverty group.

Classification Code

Frequent Pattern Mining - Apriori

Apriori Purpose: Apriori is used to find frequent itemsets and identify association rules from datasets.

Data limits: The Apyori package’s apriori() method requires the input dataset to be transformed into a list of “transactions”, where each row is a ‘“transaction” filled with items you’d like to find the frequent itemsets for. Other inputs are the minimum support, confidence, and lift values you’d like the itemsets to meet.

Recommended Use: For this project, apriori() was ideal for finding common PFAS contaminants that appear together frequently on the sample, facility, county, and state-level. Some constraints are that large datasets and longer transactions, such as those on the state-level, take longer in terms of runtime.

Apriori was used to determine which contaminants tend to occur together. PFAS does not refer to just one compound - it encompasses a wide range of chemicals. We want to know which PFAS, specifically, tend to appear together. Do they appear because they're a byproduct of a certain process? Apriori from the package Apyori was used. We tried to use Mlxtend's apriori() as well, but the runtime was significantly longer than that of Apyori.

Not only do we want to know which contaminants appear together more often, more specifically we want to know which contaminants that exceed their MCL appear more often. Quite frankly, a lot of these PFAS chemicals are so widespread at this point that looking through the entire dataset would not only take a long time, but wouldn't provide that much useful information. A new dataframe was therefore made containing only contaminants that exceeded their MCL as those are more of concern and high priority for remediation.

The dataset first had to be transformed to a format that the apriori() method could accept (below). Apriori was then run on those ‘transactions’ with specified minimum support, confidence, and lift values that have been tweaked to get a manageable amount of results (below). The results were converted to an array where the support, confidence, and lift values for each itemset were extracted and then everything was reformatted back to a dataframe. Duplicates were removed, and we only filtered for 3+ itemsets as a higher number of items can give us more insight about their relationships. For county and state-level apriori models, we also found the county and states with those itemsets and plotted them on a map for better visualization (below). The counts for states that have these occurrences were found and then plotted based on how many there are.

Code snippet of data transformation for apriori()

Code snippet of apriori()

Code snippet for mapping itemsets back to states that have them

Sample-Level

Apriori for co-occurring contaminants at the sample level yielded the results seen in Figure 18. Since there are a lot of samples, the lower support and confidence levels of around 20% are not terrible. However, a lift of 1 is a bit concerning as that means that these co-occurrences have the same probability of happening by chance or independently. The sample-level may be too localized to see any major commonalities, as water quality will differ across the country and individual samples would not capture all those differences.

Sample-level Apriori results

Facility-Level

We then looked at the common contaminant sets by facilities (below). Based on the results here, it seems that the lift for these itemsets are quite high (over 2) which means that they are strongly associated - these contaminants are twice as likely to occur together than by sheer chance. They also have a fairly high support value for such a large dataset/large number of facilities. The confidence values are solidly between 25-50%, which for 4-itemsets and the large amounts of PFAS there are, are pretty good. Unfortunately, because there are so many facilities across the US, finding which facilities match these itemsets isn't very helpful. The results below (Figure 20) show the sheer number of facilities and often how not descriptive they are. That's why we then looked at the county level.

Facility-level apriori results

Facilities that have the common contaminant occurrences

County-Level

The county-level apriori results were quite good (below) - the lift here is above 2 so these contaminants are very positively associated. The confidence here is even higher than that of the facility level, and the support value is still respectable for how big the dataset is. These associations make sense, as they are comprised of long chain PFAS and their short chain derivatives and substitutes. The PFOA and PFOS might be degrading over time into these other variants.

The counties with these common occurrence sets were found and then mapped out for better visualization . Like with data visualization, only the continental US was looked at here. Hawaii, Alaska, tribal nations, and other US territories have been excluded for this study. Future studies should absolutely include these communities.

County level apriori results

Counties that have the common contaminant occurrences

Map of counties in the continental US of contaminant itemset matches

This map shows counties that have the most common sets of contaminants exceeding their MCL, and the contamination is widespread. The main spots that don't have the common contaminant occurrences are in the sparsely populated, dry desert area of the country (western Montana, Nevada, etc.). Some counties in the Southeast and Mid Atlantic have less of the itemsets, which is interesting, as there are definitely industrial activities and population centers there. However, that doesn't mean that the water there isn't contaminated - the waterways could have other PFAS as well that wasn't picked up by the Apriori model. There is also always the possibility that sampling varies across the country and thus leads to slight variations in data collection and reporting.

State-Level

We looked at 4-itemsets here to avoid seeing the same few contaminants in different permutations. Here, the lift or association is still positive, but not as high as that of the county level or facility level model. The support and confidence levels are higher though. This might be because statewide data has more noise, and county/facility level analyses have more specific patterns that the model picked up on. For the state level, we're lumping together a lot of data that might behave differently when looked at under the microscope. This is a typical Modifiable Area Unit Problem in mapping where how you set your boundaries/control areas affects your results

Based on this apriori model, it seems that (PFHxS, PFHxA, PFBS, PFPeA) and many combinations of those are more likely to appear together than if they were independent, as their lift values are all greater than 1. If we were to look at 5-itemsets to reduce looking at various permutations of the same itemsets, it seems that PFHxS, PFPeA, PFHxA, and PFOS all tend to occur together - the only two that aren't appearing together as much are PFBS and PFOA in a 5-itemset. PFBS and PFOA do occur together in various permutations in 4-itemsets, however. These 6 contaminants seem to be the main PFAS that contaminate our water. We then mapped these itemsets back to the states that have them and plotted the results. All of these contaminants are either byproducts of one another, are currently produced, or were very popular when PFAS were first introduced, so these combinations make sense.

State level apriori results (4- and 5-itemsets)

States that have the common contaminant occurrences

Map of states in the continental US of contaminant itemset match

This state-level map shows that the most common PFAS contaminant combinations that tend to be found together (and are exceeding their MCLs) are occuring basically throughout the continental United States, which more or less matches the county map. However, while the county map shows that some counties in Wyoming, Montana, Louisiana, etc. still have these itemsets, the state level map shows that Wyoming and Louisiana barely have any, and Montana, Nebraska, Arkansas, Mississippi, North Dakota, and South Dakota do not have these combinations at all. Again, that doesn't mean they don't have PFAS, but they just don't have these 4-contaminant combinations that occur together. Wyoming, Nebraska, Montana, and the Dakotas are states with relatively low populations, so perhaps that has impacted the PFAS in their water. There may be less manufacturing, or slightly different water sampling methodology.

Louisiana had unexpected results since there are a lot of oil and chemical plants. We would've expected LA to also have those sets of contaminants. Similarly, we did not expect Arkansas and Mississippi to not have any of those sets either. These three states have industry, have a lot of shared waterways, and are near the Mississippi River in a location that is downstream for all the other states that the river goes through. Even if they don't produce those PFAS chemicals specifically, we expected that runoff would bring the contaminants there anyways. We checked the dataset to see if the dataset was imbalanced and found that Colorado, a state with all the combinations of the contaminants, did have more samples that exceeded the MCL than Arkansas and Louisiana, for example. However, when checking the total number of occurrences for those states, Louisiana had even more samples than Colorado - they just had less that were above the MCL.

Some potential reasons for this might be that the PFAS that do occur in states like Louisiana, Montana, etc. are not the popular co-occurring ones that we found. There are a lot of other more obscure PFAS compounds that are in the dataset. Another reason is that there can also be differences in sampling methodology between states over the years. The EPA seems to only sample every decade or so, and a lot can and has changed over time in environmental policy and protection.

Model Comparison

Diving deeper into the identified compounds, out of the 29 PFAS in the dataset, these 6 appear the most in water samples taken across the country and were also exceeding their Maximum Contamination Level.

PFHxS, or perfluorohexanesulphonic acid is used in surfactants, protective coatings, and metal plating agents. While its production is on the decline, it can also result from PFOS production or from the degradation of other PFAS chemicals. PFBA (perfluorobutanoic acid) and PFBS (perfluorobutane sulfonate) also have typical PFAS properties such as being used in nonstick and stain resistant products, firefighting foam, photographic film, etc. They are also a breakdown product of other PFAS.Like PFHxS, PFBA's production is on the decline but it certainly lives up to its name of a forever chemical. PFBS is still being produced by 3M as a safer alternative to PFOS [3][4]. PFPeA is perfluoropentanoic acid. PFHpA, PFHxA, PFBS, and PFPeA are labeled as "safer" alternatives to other PFAS like PFOA and PFOS. However, these short chain PFAS still induced adverse effects in mice [5]. PFOA and PFOS are long (C8) PFAS and are the "traditional" PFAS that have been linked to numerous health and environmental issues.

It seems that overall, the "safer" short chain PFAS appear together more often. With that said, the C8s like PFOA and PFOS are still appearing together along with chemicals that are a byproduct or a degradation of the C8s, which makes sense. It is concerning that the C8s are exceeding their MCL enough to be in these itemsets, especially as they're allegedly more toxic.

The main takeaway from this analysis is how widespread the long chain PFOA/PFOS and their derivatives are. Even when they degrade, they still exist in the water system in different forms. The county vs state level maps show that the state level map perhaps generalized the contamination patterns too much, because there is still contamination in all those counties, but when clumped together, that contamination pattern gets diluted.

The best model here would be the county level model as it more accurately portrays the association of the contaminants and is detailed enough to show differences within a state while not getting too into the weeds (like with the facility level model). It has high lift values to show the positive associations, and high confidence and support values while being able to be visualized geospatially to show all the hotspots in the country. This model and map would be more effective for policy making than the state, which is too generalized, and the facility one, which is too localized.

Link Apriori Code

Clustering

Purpose: The purpose of KMeans clustering is to group data points into clusters based on similarity, so that points in the same cluster are more similar to each other than to points in other clusters.

Data limits: KMeans clustering assumes spherical clusters, which are often uncommon in environmental and socioeconomic data. It also relies on a predetermined value for the “k” number of clusters, so the final clustering can depend heavily on the number of clusters used.

Recommended Use: KMeans clustering was beneficial for clustering contaminant data based on factors such as type of water (ground water, surface water, etc.) as well as regions and income levels. The clusters can help reveal and predict where PFA contaminants are the highest, which in turn can be analyzed further to reveal what type of water these contaminants are in, what regions they are located, and how those regions are recorded on poverty/income level metrics.

For the clustering models, we looked to answer the following question: can we group facilities based on water system characteristics, contaminant presence, and local socioeconomic factors - and do the clusters reveal patterns of contamination risk or environmental injustice. Two clustering algorithms were implemented (KMeans and DBSCAN clustering) and compared against one another using clustering model performance metrics.

To preprocess the data for the KMeans clustering algorithm, we first read in the dataset and selected the variables of interest for this specific research question. In this case, we are curious about the facilities, contamination levels, regions, and socioeconomic factors (poverty data). Next, rows with missing information for any of these features were dropped, and a smaller sample of the dataset was used in the final model for faster processing. Next the categorical variables FacilityWaterType, Contaminant, and County were label encoded to adhere to the model, and the features were split into a training/test set and standardized.

KMeans

We want to use the elbow method on the data to determine the number of clusters that would be beneficial, and gather the most amount of information from the data. We can determine this by looking at the elbow plot, and finding the point where the line starts to flatten, indicating that the amount of clusters at that point is sufficient for explaining the information in the dataset.

Looking at the plot above, we can see that the line starts to flatten around 20 clusters. However, for the sake of interpretability, less clusters are used (k=15) as this value still encompasses much of the data and allows for more interpretable results. KMeans clustering with k=20 was tested, but did not change much of the outcome and did not provide much more information to the outcome.

Next the KMeans model is tested, and output is interpreted by the size of each cluster, and as well as the means of the variables of interest in each of the clusters. This data was able to give us some significant insights about, specifically, the socioeconomic relationship between the relative MCL and poverty/income rates.

Elbow Method for Optimal K Value

Kmeans model outputs

After looking at these clusters and the means of each feature - there are some interesting observations related to income, poverty levels, and MCL levels. Cluster 9 only had two samples, so these are likely outliers. However, this cluster had an extremely high relative MCL of around 46.75, and also a higher median income. This *could* mean that contamination isn't strictly tied to poverty, but this is also a broad generalization and is likely more of a rare case. Cluster 7 also had a high relative MCL around 10.01, but again not many samples in this cluster (only 14) so it would be another generalization. This cluster, similarly to cluster 8, also had a relatively high median income, indicating that poverty is not strictly tied to clean water, but this is rare in this dataset/clusters. Cluster 3, however, likely holds more significance. There are 115 samples here so it has a higher impact, and has a relative MCL value of 1.83 (not as high as for clusters 8 and 3, but large enough to exceed the MCL level). It also has a more moderate median income and poverty level, thus indicating relatively unclean water for middle class incomes. Cluster 11 is also interesting to point out, since it has a very high median income and very low poverty rate, while also maintaining one of the lowest MCL levels. This is likely the most "safe" water cluster. Cluster 5, on the other hand, had a high poverty rate and low median income. The MCL levels did not exceed the threshold, but were still relatively high.

Next, we want to evaluate the performance of this model, using the Silhouette Score and Davies-Bouldin Index metrics. Ideally, a Silhouette score (which measures how similar a point is to its own cluster) should be above 0.5 to be considered well-clustered, and the Davies-Bouldin index (which measures the average similarity of each cluster, compared to the cluster next to it) should be low.

The Silhouette Score and the Davies-Bouldin score indicate that the model isn't ideal. The Silhouette score is low, but indicated at least a moderate amount of well-clustering. The Davies-Bouldin score is also relatively high, indicating that there may be more similarity between clusters than what would be expected. Because of this, it may be worth implementing a different model. It should be noted that KMeans clustering assumes spherical, equally sized clusters, which are relatively uncommon in environmental or socioeconomic datasets.

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups data points based on how closely they are packed together. It identifies clusters as dense regions separated by areas of lower density. It uses an epsilon value for noise, and the distances can be sorted and plotted to identify a good value for epsilon.

The plot indicates that all of the data points, even when sorted by the distance, should be included for clustering. This also likely indicates that the DBSCAN is not going to provide helpful insights for our data. Because of this, a couple different values for epsilon were tested to decide if the model would provide multiple clusters, or if the original KMeans model provides the best information for clustering points.

K-Distance Graph for DBSCAN

The code below indicates that, no matter how many data points are in the model, DBSCAN will only produce one cluster, no matter the value of epsilon (i.e. how much noise is added to the data). Thus, given the two clustering algorithms performed here, KMeans still gives us the most information about this dataset. The KMeans clustering data is analyzed and visualized further below.

Code Snippet of DBSCAN Tests

KMeans Visualizations

To answer the two other parts of the question of interest for clustering (impact of water type and the impact of region) the clusters were grouped and plotted first by the water type to assess any interesting patterns. To streamline interpretability, the clusters mentioned above of interest relating to income metrics (clusters 9, 7, 3, 11, and 5) are what are analyzed more closely in this section.

From the plot above, it should be noted that clusters 9, 7, and 3 have ground water as the highest influence of water type. These clusters are the ones that also had the highest MCL contamination levels, as well as moderately low income and moderately high poverty values. The other clusters of interest, 12 and 5, which had some of the highest income rates as well as low MCL contamination, both have surface water as the most significant water type. Thus, there may be a small relationship between the type of water and the contamination level, which also has a relation to income or poverty.

Visualizing the clusters by region provides somewhat less information than by the water type, but still provides us interesting results. Clusters 9, 7, and 3 all had their highest County rate in California or Texas counties. The top counties for these clusters were Fort Bend (Texas), Riverside and Fresno (California) and Orange, Los Angeles, and San Bernardino (California). Clusters 11 and 5 vary more, with cluster 11 being primarily in Washington (Minnesota) and cluster 5 in Harris (Texas). Cluster 11 is particularly interesting here because it is the only one not in Texas or California, but rather Minnesota. This was also the cluster with surface water being the primary water type and one of the moderately high incomes.

Water Type Distribution Across Clusters

Water Type Distribution Across Clusters

Link to Clustering Code

Regression

Purpose: The regression model was developed to correlate socioeconomic and regional factors (e.g., median income, poverty rate, neighborhood, and property type) o the level of PFAS contamination

Data limits: The available data was limited in geographic scope and completeness. Some features had missing values or inconsistently reported metrics across counties and facilities. Additionally, PFAS concentration data was only available for some counties and not all were able to be matched to socioeconomic data.

Recommended Use: This model is best used as an exploratory tool for identifying potential drivers of contamination and assessing regional disparities in water quality. It can support policy discussions or guide further targeted data collection but should not be used for precise prediction at the individual facility due to data sparsity and resolution limitations.

Data Pre-processing

Below shows the initial data set (columns, count, type), the pre-processing steps undertaken and the resulting dataset used for modelling linear regression. The data size was vastly reduced.

Data pre-processing steps taken prior data modelling

Model Assumptions

Linear regression assumes that the relationship between the independent and dependent variables is approximately linear, that residuals are normally distributed, and that errors have constant variance (homoscedasticity). Additionally we look at influential outliers, as these can influence our model significantly.

Normal distribution of residuals (Boxcox transformed)

Below shows histograms and corresponding q-q-plots for the aggregated county data. To better analyze the distributions we transformed them into log-scale. However this showed heteroscadicity issues later on. A transformation that yielded much better results was using a BoxCar transformation.

The top row shows the distribution of the mean relative MCL contamination (i.e. by how many MCL thresholds a typical sample is contaminated) and the bottom row the ratio of contaminated samples (as defined by simply exceeding the MCL threshold). Comparing the distributions both before and after the transformation, we conclude that a boxcar transformation describes our data much (even though with spread out tails, as would be expected).

The data shown here is prior removal of any outliers.

BOXCOX-Histograms and corresponding q-q-plots for normal distributions for MCL County data

Outliers (Cooks distance)

The Cook’s Distance plots in below for each of the four Box-Cox transformed regression models reveal a number of data points with high influence on model estimates. The red dashed line represents the threshold for potential influence.. Most observations fall well below this threshold, indicating limited undue influence on the regression models. This suggests that while the overall models are robust, a few individual counties may disproportionately affect the slope and intercept. These influential observations were identified and removed to improve the model’s generalizability and reduce the risk of overfitting to extreme cases.

Outlier detection using Cook’s distance on all for modelling plots

Linearity and homoscedastic assumptions

The residual vs. fitted value plots for all four linear regression models (using Box-Cox transformed target variables) show a significant improvement in meeting model assumptions. In each plot, residuals appear randomly scattered around the zero line without clear patterns, indicating that the assumption of linearity is reasonably satisfied. Additionally, the spread of residuals remains fairly constant across fitted values, which supports homoscedasticity.

Minor deviations in the boxcox_Mean_Relative_MCL ~ Poverty_Rate plot suggest slight heteroscedasticity, but not to a degree that would seriously compromise model validity. Overall, the BoxCox transformation has improved the model fit tremendously and residual structure, supporting that using linear regression for this analysis is appropriate.

Residuals vs Fitted plots for all four models

Model Results

Below presents mean relative contamination (in units of MCL) at the top and the percentage of samples exceeding the MCL threshold at the bottom. While the data exhibits high variance and a relatively low R^2, two key trends emerge: counties with higher median incomes tend to have fewer samples exceeding the MCL threshold and lower mean contamination levels. Conversely, counties with higher poverty rates show a greater incidence of contaminated samples and higher mean water contamination.

In below the MSE, RMSE, and R^2 evaluation metrics are shown for all models.

The highest R² value (0.1360) is observed for the model predicting boxcox_Ratio_MCL from Median_Income, suggesting that 13.6% of the variability in contamination exceedance (Ratio_MCL) can be explained by median income (lowest MSE and RMSE). The negative slope in this model indicates that higher median income is moderately associated with fewer samples exceeding MCL thresholds, hinting at possible environmental or infrastructural inequities.

In contrast, the relationship between Median_Income and boxcox_Mean_Relative_MCL yields an R² of just 0.0473, indicating a weaker relationship where only about 4.7% of variation is explained. However, the trend still suggests a slight decrease in average contamination with increasing income.

For Poverty Rate, both models show weaker associations. The model for boxcox_Ratio_MCL has an R² of 0.0561, and the model for boxcox_Mean_Relative_MCL shows the weakest of all at 0.0439. Both still show positive slopes, suggesting that higher poverty is mildly linked with more contamination and more frequent MCL exceedances, although the explanatory power is limited.

While the linear regression models show statistically weak but directionally consistent relationships, the negative correlation with income and positive correlation with poverty in contamination metrics reinforce environmental justice concerns — that economically disadvantaged areas are more vulnerable to water contamination. Despite the rather low R² values, these patterns may warrant further investigation using nonlinear models or incorporating additional contextual variables like infrastructure age, proximity to industrial activity, or regulation strength. The use of Box-Cox transformations also significantly improved model normality and residual behavior, enabling more reliable inference from these regression models. This is underlined by the low MSE and RMSE values for all linear regression models shown.

Linear regression between Contamination and Socioeconomic Factors

Evaluation metrics corresponding to the plots of Figure 40

Link to Regression Code