Algoritma Pengelompokan K-Means di Java

1. Ikhtisar

Pengelompokan adalah istilah umum untuk kelas algoritme tanpa pengawasan untuk menemukan kelompok benda, orang, atau ide yang terkait erat satu sama lain .

Dalam definisi satu baris yang tampaknya sederhana ini, kami melihat beberapa kata kunci. Apa sebenarnya clustering itu? Apa itu algoritma tanpa pengawasan?

Dalam tutorial ini, kita akan, pertama, menjelaskan beberapa konsep ini. Kemudian, kita akan melihat bagaimana mereka dapat memanifestasikan dirinya di Jawa.

2. Algoritma Tanpa Pengawasan

Sebelum kita menggunakan sebagian besar algoritme pembelajaran, entah bagaimana kami harus memberikan beberapa contoh data kepada mereka dan memungkinkan algoritme untuk belajar dari data tersebut. Dalam terminologi Machine Learning, kami menyebutnya data pelatihan kumpulan data sampel. Selain itu, seluruh proses ini dikenal sebagai proses pelatihan.

Bagaimanapun, kita dapat mengklasifikasikan algoritma pembelajaran berdasarkan jumlah pengawasan yang mereka butuhkan selama proses pelatihan. Dua jenis utama algoritma pembelajaran dalam kategori ini adalah:

  • Pembelajaran yang Diawasi : Dalam algoritma yang diawasi, data pelatihan harus menyertakan solusi aktual untuk setiap poin. Misalnya, jika kami akan melatih algoritme pemfilteran spam, kami memasukkan email sampel dan labelnya, yaitu spam atau bukan-spam, ke algoritme. Secara matematis, kita akan menyimpulkan f (x) dari set pelatihan termasuk xs dan ys.
  • Pembelajaran Tanpa Pengawasan : Jika tidak ada label dalam data pelatihan, maka algoritme adalah yang tidak diawasi. Misalnya, kami memiliki banyak data tentang musisi dan kami akan menemukan grup musisi serupa di data.

3. Pengelompokan

Pengelompokan adalah algoritme tanpa pengawasan untuk menemukan kelompok hal, ide, atau orang yang serupa. Tidak seperti algoritme yang diawasi, kami tidak melatih algoritme pengelompokan dengan contoh label yang diketahui. Sebaliknya, pengelompokan mencoba menemukan struktur di dalam set pelatihan di mana tidak ada titik data yang menjadi label.

3.1. Pengelompokan K-Means

K-Means adalah algoritme pengelompokan dengan satu properti dasar: jumlah kluster ditentukan sebelumnya . Selain K-Means, ada jenis lain dari algoritma pengelompokan seperti Hierarchical Clustering, Affinity Propagation, atau Spectral Clustering.

3.2. Bagaimana K-Means Bekerja

Misalkan tujuan kita adalah menemukan beberapa grup serupa dalam kumpulan data seperti:

K-Means dimulai dengan k sentroid yang ditempatkan secara acak. Centroid, seperti namanya, adalah titik pusat dari cluster . Misalnya, di sini kami menambahkan empat sentroid acak:

Kemudian kami menetapkan setiap titik data yang ada ke pusat terdekatnya:

Setelah penugasan, kami memindahkan sentroid ke lokasi rata-rata poin yang ditugaskan padanya. Ingat, sentroid seharusnya menjadi titik pusat cluster:

Iterasi saat ini berakhir setiap kali kita selesai merelokasi sentroid. Kami mengulangi iterasi ini sampai penugasan antara beberapa iterasi yang berurutan berhenti berubah:

Saat algoritme dihentikan, keempat cluster tersebut ditemukan seperti yang diharapkan. Sekarang kita tahu cara kerja K-Means, mari kita terapkan di Java.

3.3. Representasi Fitur

Saat memodelkan kumpulan data pelatihan yang berbeda, kita memerlukan struktur data untuk mewakili atribut model dan nilai yang sesuai. Misalnya, seorang musisi dapat memiliki atribut genre dengan nilai seperti Rock . Kami biasanya menggunakan istilah fitur untuk merujuk pada kombinasi atribut dan nilainya.

Untuk menyiapkan set data untuk algoritme pembelajaran tertentu, kami biasanya menggunakan sekumpulan atribut numerik umum yang dapat digunakan untuk membandingkan item yang berbeda. Misalnya, jika kami mengizinkan pengguna menandai setiap artis dengan genre tertentu, pada akhirnya, kami dapat menghitung berapa kali setiap artis diberi tag dengan genre tertentu:

Vektor fitur untuk artis seperti Linkin Park adalah [rock -> 7890, nu-metal -> 700, alternatif -> 520, pop -> 3]. Jadi jika kita dapat menemukan cara untuk merepresentasikan atribut sebagai nilai numerik, maka kita cukup membandingkan dua item yang berbeda, misalnya artis, dengan membandingkan entri vektor yang sesuai.

Karena vektor numerik adalah struktur data yang serbaguna, kita akan merepresentasikan fitur yang menggunakannya . Berikut cara kami mengimplementasikan vektor fitur di Java:

public class Record { private final String description; private final Map features; // constructor, getter, toString, equals and hashcode }

3.4. Menemukan Barang Serupa

Dalam setiap iterasi K-Means, kita membutuhkan cara untuk menemukan centroid terdekat ke setiap item dalam dataset. Salah satu cara termudah untuk menghitung jarak antara dua vektor fitur adalah dengan menggunakan Jarak Euclidean. Jarak Euclidean antara dua vektor seperti [p1, q1] dan [p2, q2] sama dengan:

Mari mengimplementasikan fungsi ini di Java. Pertama, abstraksi:

public interface Distance { double calculate(Map f1, Map f2); }

Selain jarak Euclidean, ada pendekatan lain untuk menghitung jarak atau kesamaan antara item yang berbeda seperti Koefisien Korelasi Pearson . Abstraksi ini memudahkan untuk beralih di antara metrik jarak yang berbeda.

Mari kita lihat implementasi untuk jarak Euclidean:

public class EuclideanDistance implements Distance { @Override public double calculate(Map f1, Map f2) { double sum = 0; for (String key : f1.keySet()) { Double v1 = f1.get(key); Double v2 = f2.get(key); if (v1 != null && v2 != null) { sum += Math.pow(v1 - v2, 2); } } return Math.sqrt(sum); } }

Pertama, kami menghitung jumlah selisih kuadrat antara entri yang sesuai. Kemudian, dengan menerapkan fungsi sqrt , kami menghitung jarak Euclidean sebenarnya.

3.5. Representasi Centroid

Centroid berada di ruang yang sama dengan fitur normal, jadi kami dapat merepresentasikannya mirip dengan fitur:

public class Centroid { private final Map coordinates; // constructors, getter, toString, equals and hashcode }

Sekarang setelah kita memiliki beberapa abstraksi yang diperlukan, saatnya menulis implementasi K-Means kita. Berikut sekilas tanda tangan metode kami:

public class KMeans { private static final Random random = new Random(); public static Map
    
      fit(List records, int k, Distance distance, int maxIterations) { // omitted } }
    

Mari kita uraikan tanda tangan metode ini:

  • The dataset adalah satu set vektor fitur. Karena setiap vektor fitur adalah Record, maka tipe datasetnya adalah List
  • The k parameter menentukan jumlah cluster, yang kita harus menyediakan terlebih dahulu
  • distance encapsulates the way we're going to calculate the difference between two features
  • K-Means terminates when the assignment stops changing for a few consecutive iterations. In addition to this termination condition, we can place an upper bound for the number of iterations, too. The maxIterations argument determines that upper bound
  • When K-Means terminates, each centroid should have a few assigned features, hence we're using a Map as the return type. Basically, each map entry corresponds to a cluster

3.6. Centroid Generation

The first step is to generate k randomly placed centroids.

Although each centroid can contain totally random coordinates, it's a good practice to generate random coordinates between the minimum and maximum possible values for each attribute. Generating random centroids without considering the range of possible values would cause the algorithm to converge more slowly.

First, we should compute the minimum and maximum value for each attribute, and then, generate the random values between each pair of them:

private static List randomCentroids(List records, int k) { List centroids = new ArrayList(); Map maxs = new HashMap(); Map mins = new HashMap(); for (Record record : records) { record.getFeatures().forEach((key, value) -> ); } Set attributes = records.stream() .flatMap(e -> e.getFeatures().keySet().stream()) .collect(toSet()); for (int i = 0; i < k; i++) { Map coordinates = new HashMap(); for (String attribute : attributes) { double max = maxs.get(attribute); double min = mins.get(attribute); coordinates.put(attribute, random.nextDouble() * (max - min) + min); } centroids.add(new Centroid(coordinates)); } return centroids; }

Now, we can assign each record to one of these random centroids.

3.7. Assignment

First off, given a Record, we should find the centroid nearest to it:

private static Centroid nearestCentroid(Record record, List centroids, Distance distance) { double minimumDistance = Double.MAX_VALUE; Centroid nearest = null; for (Centroid centroid : centroids) { double currentDistance = distance.calculate(record.getFeatures(), centroid.getCoordinates()); if (currentDistance < minimumDistance) { minimumDistance = currentDistance; nearest = centroid; } } return nearest; }

Each record belongs to its nearest centroid cluster:

private static void assignToCluster(Map
    
      clusters, Record record, Centroid centroid) { clusters.compute(centroid, (key, list) -> { if (list == null) { list = new ArrayList(); } list.add(record); return list; }); }
    

3.8. Centroid Relocation

If, after one iteration, a centroid does not contain any assignments, then we won't relocate it. Otherwise, we should relocate the centroid coordinate for each attribute to the average location of all assigned records:

private static Centroid average(Centroid centroid, List records) { if (records == null || records.isEmpty()) { return centroid; } Map average = centroid.getCoordinates(); records.stream().flatMap(e -> e.getFeatures().keySet().stream()) .forEach(k -> average.put(k, 0.0)); for (Record record : records) { record.getFeatures().forEach( (k, v) -> average.compute(k, (k1, currentValue) -> v + currentValue) ); } average.forEach((k, v) -> average.put(k, v / records.size())); return new Centroid(average); }

Since we can relocate a single centroid, now it's possible to implement the relocateCentroids method:

private static List relocateCentroids(Map
    
      clusters) { return clusters.entrySet().stream().map(e -> average(e.getKey(), e.getValue())).collect(toList()); }
    

This simple one-liner iterates through all centroids, relocates them, and returns the new centroids.

3.9. Putting It All Together

In each iteration, after assigning all records to their nearest centroid, first, we should compare the current assignments with the last iteration.

If the assignments were identical, then the algorithm terminates. Otherwise, before jumping to the next iteration, we should relocate the centroids:

public static Map
    
      fit(List records, int k, Distance distance, int maxIterations) { List centroids = randomCentroids(records, k); Map
     
       clusters = new HashMap(); Map
      
        lastState = new HashMap(); // iterate for a pre-defined number of times for (int i = 0; i < maxIterations; i++) { boolean isLastIteration = i == maxIterations - 1; // in each iteration we should find the nearest centroid for each record for (Record record : records) { Centroid centroid = nearestCentroid(record, centroids, distance); assignToCluster(clusters, record, centroid); } // if the assignments do not change, then the algorithm terminates boolean shouldTerminate = isLastIteration || clusters.equals(lastState); lastState = clusters; if (shouldTerminate) { break; } // at the end of each iteration we should relocate the centroids centroids = relocateCentroids(clusters); clusters = new HashMap(); } return lastState; }
      
     
    

4. Example: Discovering Similar Artists on Last.fm

Last.fm builds a detailed profile of each user's musical taste by recording details of what the user listens to. In this section, we're going to find clusters of similar artists. To build a dataset appropriate for this task, we'll use three APIs from Last.fm:

  1. API to get a collection of top artists on Last.fm.
  2. Another API to find popular tags. Each user can tag an artist with something, e.g. rock. So, Last.fm maintains a database of those tags and their frequencies.
  3. And an API to get the top tags for an artist, ordered by popularity. Since there are many such tags, we'll only keep those tags that are among the top global tags.

4.1. Last.fm's API

To use these APIs, we should get an API Key from Last.fm and send it in every HTTP request. We're going to use the following Retrofit service for calling those APIs:

public interface LastFmService { @GET("/2.0/?method=chart.gettopartists&format=json&limit=50") Call topArtists(@Query("page") int page); @GET("/2.0/?method=artist.gettoptags&format=json&limit=20&autocorrect=1") Call topTagsFor(@Query("artist") String artist); @GET("/2.0/?method=chart.gettoptags&format=json&limit=100") Call topTags(); // A few DTOs and one interceptor }

So, let's find the most popular artists on Last.fm:

// setting up the Retrofit service private static List getTop100Artists() throws IOException { List artists = new ArrayList(); // Fetching the first two pages, each containing 50 records. for (int i = 1; i <= 2; i++) { artists.addAll(lastFm.topArtists(i).execute().body().all()); } return artists; }

Similarly, we can fetch the top tags:

private static Set getTop100Tags() throws IOException { return lastFm.topTags().execute().body().all(); }

Finally, we can build a dataset of artists along with their tag frequencies:

private static List datasetWithTaggedArtists(List artists, Set topTags) throws IOException { List records = new ArrayList(); for (String artist : artists) { Map tags = lastFm.topTagsFor(artist).execute().body().all(); // Only keep popular tags. tags.entrySet().removeIf(e -> !topTags.contains(e.getKey())); records.add(new Record(artist, tags)); } return records; }

4.2. Forming Artist Clusters

Now, we can feed the prepared dataset to our K-Means implementation:

List artists = getTop100Artists(); Set topTags = getTop100Tags(); List records = datasetWithTaggedArtists(artists, topTags); Map
    
      clusters = KMeans.fit(records, 7, new EuclideanDistance(), 1000); // Printing the cluster configuration clusters.forEach((key, value) -> { System.out.println("-------------------------- CLUSTER ----------------------------"); // Sorting the coordinates to see the most significant tags first. System.out.println(sortedCentroid(key)); String members = String.join(", ", value.stream().map(Record::getDescription).collect(toSet())); System.out.print(members); System.out.println(); System.out.println(); });
    

If we run this code, then it would visualize the clusters as text output:

------------------------------ CLUSTER ----------------------------------- Centroid {classic rock=65.58333333333333, rock=64.41666666666667, british=20.333333333333332, ... } David Bowie, Led Zeppelin, Pink Floyd, System of a Down, Queen, blink-182, The Rolling Stones, Metallica, Fleetwood Mac, The Beatles, Elton John, The Clash ------------------------------ CLUSTER ----------------------------------- Centroid {Hip-Hop=97.21428571428571, rap=64.85714285714286, hip hop=29.285714285714285, ... } Kanye West, Post Malone, Childish Gambino, Lil Nas X, A$AP Rocky, Lizzo, xxxtentacion, Travi$ Scott, Tyler, the Creator, Eminem, Frank Ocean, Kendrick Lamar, Nicki Minaj, Drake ------------------------------ CLUSTER ----------------------------------- Centroid {indie rock=54.0, rock=52.0, Psychedelic Rock=51.0, psychedelic=47.0, ... } Tame Impala, The Black Keys ------------------------------ CLUSTER ----------------------------------- Centroid {pop=81.96428571428571, female vocalists=41.285714285714285, indie=22.785714285714285, ... } Ed Sheeran, Taylor Swift, Rihanna, Miley Cyrus, Billie Eilish, Lorde, Ellie Goulding, Bruno Mars, Katy Perry, Khalid, Ariana Grande, Bon Iver, Dua Lipa, Beyoncé, Sia, P!nk, Sam Smith, Shawn Mendes, Mark Ronson, Michael Jackson, Halsey, Lana Del Rey, Carly Rae Jepsen, Britney Spears, Madonna, Adele, Lady Gaga, Jonas Brothers ------------------------------ CLUSTER ----------------------------------- Centroid {indie=95.23076923076923, alternative=70.61538461538461, indie rock=64.46153846153847, ... } Twenty One Pilots, The Smiths, Florence + the Machine, Two Door Cinema Club, The 1975, Imagine Dragons, The Killers, Vampire Weekend, Foster the People, The Strokes, Cage the Elephant, Arcade Fire, Arctic Monkeys ------------------------------ CLUSTER ----------------------------------- Centroid {electronic=91.6923076923077, House=39.46153846153846, dance=38.0, ... } Charli XCX, The Weeknd, Daft Punk, Calvin Harris, MGMT, Martin Garrix, Depeche Mode, The Chainsmokers, Avicii, Kygo, Marshmello, David Guetta, Major Lazer ------------------------------ CLUSTER ----------------------------------- Centroid {rock=87.38888888888889, alternative=72.11111111111111, alternative rock=49.16666666, ... } Weezer, The White Stripes, Nirvana, Foo Fighters, Maroon 5, Oasis, Panic! at the Disco, Gorillaz, Green Day, The Cure, Fall Out Boy, OneRepublic, Paramore, Coldplay, Radiohead, Linkin Park, Red Hot Chili Peppers, Muse

Since centroid coordinations are sorted by the average tag frequency, we can easily spot the dominant genre in each cluster. For example, the last cluster is a cluster of a good old rock-bands, or the second one is filled with rap stars.

Although this clustering makes sense, for the most part, it's not perfect since the data is merely collected from user behavior.

5. Visualization

A few moments ago, our algorithm visualized the cluster of artists in a terminal-friendly way. If we convert our cluster configuration to JSON and feed it to D3.js, then with a few lines of JavaScript, we'll have a nice human-friendly Radial Tidy-Tree:

We have to convert our Map to a JSON with a similar schema like this d3.js example.

6. Number of Clusters

One of the fundamental properties of K-Means is the fact that we should define the number of clusters in advance. So far, we used a static value for k, but determining this value can be a challenging problem. There are two common ways to calculate the number of clusters:

  1. Domain Knowledge
  2. Mathematical Heuristics

If we're lucky enough that we know so much about the domain, then we might be able to simply guess the right number. Otherwise, we can apply a few heuristics like Elbow Method or Silhouette Method to get a sense on the number of clusters.

Before going any further, we should know that these heuristics, although useful, are just heuristics and may not provide clear-cut answers.

6.1. Elbow Method

To use the elbow method, we should first calculate the difference between each cluster centroid and all its members. As we group more unrelated members in a cluster, the distance between the centroid and its members goes up, hence the cluster quality decreases.

One way to perform this distance calculation is to use the Sum of Squared Errors. Sum of squared errors or SSE is equal to the sum of squared differences between a centroid and all its members:

public static double sse(Map
    
      clustered, Distance distance) { double sum = 0; for (Map.Entry
     
       entry : clustered.entrySet()) { Centroid centroid = entry.getKey(); for (Record record : entry.getValue()) { double d = distance.calculate(centroid.getCoordinates(), record.getFeatures()); sum += Math.pow(d, 2); } } return sum; }
     
    

Then, we can run the K-Means algorithm for different values of kand calculate the SSE for each of them:

List records = // the dataset; Distance distance = new EuclideanDistance(); List sumOfSquaredErrors = new ArrayList(); for (int k = 2; k <= 16; k++) { Map
    
      clusters = KMeans.fit(records, k, distance, 1000); double sse = Errors.sse(clusters, distance); sumOfSquaredErrors.add(sse); }
    

At the end of the day, it's possible to find an appropriate k by plotting the number of clusters against the SSE:

Usually, as the number of clusters increases, the distance between cluster members decreases. However, we can't choose any arbitrary large values for k, since having multiple clusters with just one member defeats the whole purpose of clustering.

Ide di balik metode siku adalah menemukan nilai yang sesuai untuk k dengan cara SSE menurun drastis di sekitar nilai tersebut. Misalnya, k = 9 bisa menjadi kandidat yang baik di sini.

7. Kesimpulan

Dalam tutorial ini, pertama, kami membahas beberapa konsep penting dalam Machine Learning. Kemudian kami mengakali mekanisme algoritma clustering K-Means. Terakhir, kami menulis implementasi sederhana untuk K-Means, menguji algoritme kami dengan kumpulan data dunia nyata dari Last.fm, dan memvisualisasikan hasil pengelompokan dengan cara grafis yang bagus.

Seperti biasa, kode sampel tersedia di proyek GitHub kami, jadi pastikan untuk memeriksanya!