# Approximation algorithms for NP-hard clustering problems

## Access full-text files

## Date

## Authors

## Journal Title

## Journal ISSN

## Volume Title

## Publisher

## Abstract

Given a set of n points and their pairwise distances, the goal of clustering is to partition the points into a “small” number of “related” sets. Clustering algorithms are used widely to manage, classify, and summarize many kinds of data. In this dissertation, we study the classic facility location and k-median problems in the context of clustering, and formulate and study a new optimization problem that we call the online median problem. For each of these problems, it is known to be N P-hard to compute a solution with cost less than a certain constant factor times the optimal cost. We give simple constant-factor approximation algorithms for the facility location, k-median, and online median problems with optimal or near-optimal time bounds. We also study distance functions that are “approximately” metric, and show that such distance functions allow us to obtain a faster online median algorithm and to generalize our analysis to other objective functions, such as that of the well-known k-means heuristic. Given n points, the associated interpoint distances and nonnegative point weights, and a nonnegative penalty for each point, the facility location problem asks us to identify a set of cluster centers so that the weighted average cluster radii and the sum of the cluster center penalties are both minimized. The k-median problem asks us to identify exactly k cluster centers while minimizing just the weighted average cluster radii. We give a simple greedy algorithm for the facility location problem that runs in O(n2) time and produces a solution with cost at most 3 times optimal. For the k-median problem, we develop and make use of a sampling technique that we call successive sampling, and give a randomized constant-factor approximation algorithm that runs in O(n(k + log n + log2 n)) time. We also give an Ω(nk) lower bound on the running time of any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible constant probability. In many settings, it is desirable to browse a given data set at differing levels of granularity (i.e., number of clusters). To address this concern, we formulate a generalization of the k-median problem that we call the online median problem. The online median problem asks us to compute an ordering of the points so that, over all i, when a prefix of length i is taken as a set of cluster centers, the weighted average radii of the induced clusters is minimized. We show that a natural generalization of the greedy strategy that we call hierarchically greedy yields an algorithm that produces an ordering such that every prefix of the ordering is within a constant factor of the associated optimal cost. Furthermore, our algorithm has a running time of Θ(n2). Finally, we study the performance of our algorithms in practice. We present implementations of our k-median and online median algorithms; our experimental results indicate that our approximation algorithms may be useful in practice.