Conclusions

Results and Discussion

For KMeans clustering, visualizing the clusters by region provides somewhat less information than by the water type, but still provides us interesting results. Clusters 9, 7, and 3 all had their highest County rate in California or Texas counties. The top counties for these clusters were Fort Bend (Texas), Riverside and Fresno (California) and Orange, Los Angeles, and San Bernardino (California). Clusters 11 and 5 vary more, with cluster 11 being primarily in Washington (Minnesota) and cluster 5 in Harris (Texas). Cluster 11 is particularly interesting here because it is the only one not in Texas or California, but rather Minnesota. This was also the cluster with surface water being the primary water type and one of the moderately high incomes.

No matter how many data points are in the model, DBSCAN will only produce one cluster, with any value of epsilon (i.e. how much noise is added to the data). Thus, given the two clustering algorithms performed here, KMeans still gives us the most information about this dataset.

For apriori, the main takeaway from this analysis is how widespread the long chain PFOA/PFOS and their derivatives are. Even as they degrade, they still exist in the water system in different forms. The county vs state level maps show that the state level map perhaps generalized the contamination patterns too much. There is still contamination in all those counties, but when clumped together, that contamination pattern gets diluted. Therefore, the county-level model provides the best insight into contamination patterns.

While the linear regression models show statistically weak but directionally consistent relationships, the negative correlation with income and positive correlation with poverty in contamination metrics reinforce environmental justice concerns — that economically disadvantaged areas are more vulnerable to water contamination. Despite the rather low R² values, these patterns may warrant further investigation using nonlinear models or incorporating additional contextual variables like infrastructure age, proximity to industrial activity, or regulation strength. The use of Box-Cox transformations also significantly improved model normality and residual behavior, enabling more reliable inference from these regression models. This is underlined by the low MSE and RMSE values for all linear regression models shown.

Conversely, when population statistics were examined in conjunction with other readily available information regarding sample water source, supervised machine learning models were able to extrapolate clear patterns. Specifically, a support vector machine trained using a radial basis function kernel – to account for highly nonlinear relationships between features – was able to predict the presence of hazardous water quality at a rate of over 90% accuracy. However, we found that during the fairness audit our model under-performed when identifying water samples that exceeded the MCL level in high-poverty areas.

Conclusion and Future Work

Based on our analysis, it appears that PFAS contamination is still a major issue in much of the water sources across the continental US, whether it be the legacy long-chain compounds or its short-chain derivatives. All variants are detrimental to human health, and more severe contamination patterns are associated with lower income, highlighting a major environmental injustice. However, given the ability of predictive machine learning models to determine the likelihood of hazardous water samples, areas of greatest concern can be easily identified (and in theory managed). This comes with the caveat that areas of greatest concern are not as easily identifiable in high-poverty regions, as a result models should be further enhanced by the incorporation of ensemble learning methods to reduce individual model biases. It will also be of paramount importance to track and understand water quality shifts that coincide with changing political legislation. Furthermore, continued research should be conducted to include modeling and analysis done for Hawaii, Alaska, tribal nations, and other US territories.