Extreme Value Analysis of K-Means Single (k=1) Cluster Statistical Distances, a Heuristic Approach with Los Angeles and San Francisco Hourly Temperatures and Monthly Precipitation

Fisk, Charles; Fisk, Charles

K-Means clustering is a statistical method that partitions n observations into k clusters such that each given observation is assigned to the particular cluster in which its statistical distance to the centroid is a minimum. A K-means analysis will output, among other figures, each observation’s statistical distance from its respective ‘kth’ cluster centroid, but such distances are only “local”, pertaining to the particular cluster concerned, so what about the nature and use of overall statistical distances, those generated if the observations are not clustered (i.e., a “forced” k=1 treatment)? Departing from the usual objective of resolving multiple clusters, the process of generating k=1 statistical distances can be expanded heuristically into a “global” multivariate-type extreme-value analysis, taking advantage of the scale reduction normalizations/standardizations that are the preprocessing steps in a cluster analysis. Compared to a typical extreme-value analysis that makes use of one or two different physical variables (e.g., “copulas” for the latter), extreme-value results utilizing the scale-invariant statistical distances can identify particularly atypical patterns comprising many variables.

To this end, an exploratory analysis is performed on two different climatological parameters for Los Angeles (KLAX) and San Francisco (KSFO) each 1.) daily 0000 LST-2400 LST hour-to-hour temperatures, by calendar month and 2.) July-June monthly precipitation totals (those for Downtown San Francisco). The former will be a 25-dimensional application, covering the period January 1948 through June 2023, the latter a 12 dimensional one, covering the July-June seasons 1876-77 to 2022-23 for Los Angeles, and 1849-50 thru 2022-23 for San Francisco Downtown. Results will include time-series graphs, probability density distribution fittings, and estimated return-periods. The Squared Euclidean method is chosen as the statistical distance metric after the re-scalings.

A software package employed fits block-maximum type statistical distances to more than 60 continuous probability density distributions, with parameter counts ranging from one to six, and utilizing two goodness-of-fit techniques, the Kolmogorov\Smirnov and the Anderson-Darling, the models are ranked.

The Kolmorgorov-Smirnov or “K-S test” compares a sample’s actual empirical cumulative distribution function with that of a chosen .pdf and evaluates the departures between the two. If departures exceed certain critical values, the hypothesis regarding the distributional form similarity of the selected candidate will be rejected. From literature sources, the K-S is most sensitive to areas near the center of the distribution as opposed to the tails, so in this regard it is somewhat less ideal to an extreme-value type analysis. However, the Anderson-Darling test, a modification of the K-S test, does give more weight to the tails, and in general it’s considered a more sensitive test overall. Consequently, the Anderson-Darling test’s rank is chosen as the primary means of designating the best-fitting-model, assuming that the model’s parameters are four or less. In most of these cases the K-S ranking is similarly high and it’s ranking is considered also, but secondarily.

130 Extreme Value Analysis of K-Means Single (k=1) Cluster Statistical Distances, a Heuristic Approach with Los Angeles and San Francisco Hourly Temperatures and Monthly Precipitation