📒
Machine & Deep Learning Compendium
  • The Machine & Deep Learning Compendium
    • Thanks Page
  • The Ops Compendium
  • Types Of Machine Learning
    • Overview
    • Model Families
    • Weakly Supervised
    • Semi Supervised
    • Active Learning
    • Online Learning
    • N-Shot Learning
    • Unlearning
  • Foundation Knowledge
    • Data Science
    • Data Science Tools
    • Management
    • Project & Program Management
    • Data Science Management
    • Calculus
    • Probability & Statistics
    • Probability
    • Hypothesis Testing
    • Feature Types
    • Multi Label Classification
    • Distribution
    • Distribution Transformation
    • Normalization & Scaling
    • Regularization
    • Information Theory
    • Game Theory
    • Multi CPU Processing
    • Benchmarking
  • Validation & Evaluation
    • Features
    • Evaluation Metrics
    • Datasets
    • Dataset Confidence
    • Hyper Parameter Optimization
    • Training Strategies
    • Calibration
    • Datasets Reliability & Correctness
    • Data & Model Tests
    • Fairness, Accountability, and Transparency
    • Interpretable & Explainable AI (XAI)
    • Federated Learning
  • Machine Learning
    • Algorithms 101
    • Meta Learning (AutoML)
    • Probabilistic, Regression
    • Data Mining
    • Process Mining
    • Label Algorithms
    • Clustering Algorithms
    • Anomaly Detection
    • Decision Trees
    • Active Learning Algorithms
    • Linear Separator Algorithms
    • Regression
    • Ensembles
    • Reinforcement Learning
    • Incremental Learning
    • Dimensionality Reduction Methods
    • Genetic Algorithms & Genetic Programming
    • Learning Classifier Systems
    • Recommender Systems
    • Timeseries
    • Fourier Transform
    • Digital Signal Processing (DSP)
    • Propensity Score Matching
    • Diffusion models
  • Classical Graph Models
    • Graph Theory
    • Social Network Analysis
  • Deep Learning
    • Deep Neural Nets Basics
    • Deep Neural Frameworks
    • Embedding
    • Deep Learning Models
    • Deep Network Optimization
    • Attention
    • Deep Neural Machine Vision
    • Deep Neural Tabular
    • Deep Neural Time Series
  • Audio
    • Basics
    • Terminology
    • Feature Engineering
    • Deep Neural Audio
    • Algorithms
  • Natural Language Processing
    • A Reality Check
    • NLP Tools
    • Foundation NLP
    • Name Matching
    • String Matching
    • TF-IDF
    • Language Detection Identification Generation (NLD, NLI, NLG)
    • Topics Modeling
    • Named Entity Recognition (NER)
    • SEARCH
    • Neural NLP
    • Tokenization
    • Decoding Algorithms For NLP
    • Multi Language
    • Augmentation
    • Knowledge Graphs
    • Annotation & Disagreement
    • Sentiment Analysis
    • Question Answering
    • Summarization
    • Chat Bots
    • Conversation
  • Generative AI
    • Methods
    • Gen AI Industry
    • Speech
    • Prompt
    • Fairness, Accountability, and Transparency In Prompts
    • Large Language Models (LLMs)
    • Vision
    • GPT
    • Mix N Match
    • Diffusion Models
    • GenAI Applications
    • Agents
    • RAG
    • Chat UI/UX
  • Experimental Design
    • Design Of Experiments
    • DOE Tools
    • A/B Testing
    • Multi Armed Bandits
    • Contextual Bandits
    • Factorial Design
  • Business Domains
    • Follow the regularized leader
    • Growth
    • Root Cause Effects (RCE/RCA)
    • Log Parsing / Templatization
    • Fraud Detection
    • Life Time Value (LTV)
    • Survival Analysis
    • Propaganda Detection
    • NYC TAXI
    • Drug Discovery
    • Intent Recognition
    • Churn Prediction
    • Electronic Network Frequency Analysis
    • Marketing
  • Product Management
    • Expanding Your Data Science Skills
    • Product Vision & Strategy
    • Product / Program Managers
    • Product Management Resources
    • Product Tools
    • User Experience Design (UX)
    • Business
    • Marketing
    • Ideation
  • MLOps (www.OpsCompendium.com)
  • DataOps (www.OpsCompendium.com)
  • Humor
Powered by GitBook
On this page
  • Block Modeling - Distance Matrices
  • Kmeans
  • K-mediods
  • K-modes
  • X-means
  • G-means
  • GMM - Gaussian Mixture Models
  • KMEANS++ / Kernel Kmeans
  • KNN
  • DBSCAN
  • ST-DBSCAN
  • HDBSCAN*
  • OPTICS
  • SVM CLUSTERING
  • COP-CLUSTERING

Was this helpful?

  1. Machine Learning

Clustering Algorithms

PreviousLabel AlgorithmsNextAnomaly Detection

Last updated 2 years ago

Was this helpful?

  1. , - classify a new sample by looking at the majority vote of its K-nearest neighbours. k=1 special case. Even amount of classes needs an odd K that is not a multiple of the amount of classes in order to break ties.

  2. ,

TOOLS

Block Modeling - Distance Matrices

  1. Sensitive to outliers, can skew results (because we rely on the mean)

- basically k-means with a most center object rather than a center virtual point that was based on mean distance from all points, we keep choosing medoids samples based on minimised SSE

  • k-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori.

  • Does Not scale to many samples, its O(K*n-K)^2

  • Randomized resampling can assure efficiency and quality.

K-modes

X-means

G-means

GMM - Gaussian Mixture Models

In k-means, you carry out the following procedure:

- specify k centroids, initialising their coordinates randomly

- calculate the distance of each data point to each centroid

- assign each data point to its nearest centroid

- update the coordinates of the centroid to the mean of all points assigned to it

- iterate until convergence.

In a GMM, you carry out the following procedure:

- specify k multivariate Gaussians (termed components), initialising their mean and variance randomly

- calculate the probability of each data point being produced by each component (sometimes termed the responsibility each component takes for the data point)

- assign each data point to the component it belongs to with the highest probability

- update the mean and variance of the component to the mean and variance of all data points assigned to it

- iterate until convergence

You may notice the similarity between these two procedures. In fact, k-means is a GMM with fixed-variance components. Under a GMM, the probabilities (I think) you're looking for are the responsibilities each component takes for each data point.

KMEANS++ / Kernel Kmeans

KNN

DBSCAN

  1. Optimized dbscans:

ST-DBSCAN

HDBSCAN*

  1. Transform the space according to the density/sparsity.

  2. Build the minimum spanning tree of the distance weighted graph.

  3. Construct a cluster hierarchy of connected components.

  4. Condense the cluster hierarchy based on minimum cluster size.

  5. Extract the stable clusters from the condensed tree.

OPTICS

  • it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density.

  • (How?) the points of the database are (linearly) ordered such that points which are spatially closest become neighbors in the ordering.

SVM CLUSTERING

An SVM-based clustering algorithm is introduced that clusters data with no a priori knowledge of input classes.

  1. The algorithm initializes by first running a binary SVM classifier against a data set with each vector in the set randomly labelled, this is repeated until an initial convergence occurs.

  2. Once this initialization step is complete, the SVM confidence parameters for classification on each of the training instances can be accessed.

  3. The lowest confidence data (e.g., the worst of the mislabelled data) then has its' labels switched to the other class label.

  4. The SVM is then re-run on the data set (with partly re-labelled data) and is guaranteed to converge in this situation since it converged previously, and now it has fewer data points to carry with mislabelling penalties.

  5. This approach appears to limit exposure to the local minima traps that can occur with other approaches. Thus, the algorithm then improves on its weakly convergent result by SVM re-training after each re-labeling on the worst of the misclassified vectors – i.e., those feature vectors with confidence factor values beyond some threshold.

  6. The repetition of the above process improves the accuracy, here a measure of separability, until there are no misclassifications. Variations on this type of clustering approach are shown.

COP-CLUSTERING

Clustering is traditionally viewed as an unsupervised method for data analysis. However, in some cases information about the problem domain is available in addition to the data instances themselves. In this paper, we demonstrate how the popular k-means clustering algorithm can be profitably modified to make use of this information. In experiments with artificial constraints on six data sets, we observe improvements in clustering accuracy. We also apply this method to the real-world problem of automatically detecting road lanes from GPS data and observe dramatic increases in performance.

Any of the “precomputed” algorithms in sklearn, just remember to . I.e., using dbscan/hdbscan/optics, you need a dissimilarity matrix.

It is more robust to noise and outliers as compared to because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances.

A can be defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal. i.e. it is a most centrally located point in the cluster.

"Python implementations of the k-modes and k-prototypes clustering algorithms, for clustering categorical data" -

, i.e., numerics, categoricals

X-means():

behind bic calculation with a formula.

Code:

G-means in the paper: The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution using the Anderson Darling test. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each enter. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center.

?- in short its knn with mean/variance centroids, a sample can be in several centroids with a certain probability.

Let us briefly talk about a probabilistic generalisation of k-means: the (GMM).

using ellipsoids

,

,

, knn in scale! On github

dbscan

, - A fast, exact, and scalable algorithm for DBSCAN clustering. This repository contains the implementation for the distributed spatial clustering algorithm proposed in the paper μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality

- A fast and memory-efficient implementation of DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

- A lightweight, fast dbscan implementation for use on peptide strings. It uses pure C for the distance calculations and clustering. This code is then wrapped in python.

(what is?) HDBSCAN is a clustering algorithm developed by . It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.

(great) with examples, for clustering, outlier detection, comparison, benchmarking and analysis!

() - take a look and see how to use it, usage examples are also in the docs and github

What are the algorithm’s :

() Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based in spatial data

Its basic idea is similar to ,

a special distance is stored for each point that represents the density that needs to be accepted for a cluster in order to have both points belong to the same cluster. (This is represented as a .)

Constrained K-means algorithm, , , is a semi-supervised algorithm.

In the context of partitioning algorithms, instance level constraints are a useful way to express a priori knowledge about which instances should or should not be grouped together. Consequently, we consider two types of pairwise constraints: • Must-link constraints specify that two instances have to be in the same cluster. • Cannot-link constraints specify that two instances must not be placed in the same cluster.

do 1-distanceMatrix
Kmeans
K-mediods
k-means
medoid
From youtube (okay video)
git
a guide to clustering mixed types
paper
Theory
Calculate bic in k-means
Improves on X-means
What is GMM
Gaussian Mixture Model
Gmm code on sklearn
How to select the K using bic
Density estimation for gmm - nice graph
A comparison of kmeans++ vs kernel kmeans
Kernel Kmeans is part of TSLearn
Elbow method
elbow and mean silhouette
elbow on medium using mean distance per cluster from the center
Kneed a library to find the knee in a curve
how to?
Nearpy
finding the optimal K
Benchmark of nearest neighbours libraries
billion scale aprox nearest neighbour search
How to use effectively
a DBSCAN visualization - very good!
DBSCAN for GPS.
A practical guide to dbscan - pretty good
Custom DBSCAN “predict”
Haversine distances for
muDBSCAN
paper
Dbscan multiplex
Fast dbscan
Faster dbscan paper
Paper - st-dbscan an algo for clustering spatio temporal data
Popular git
git
Campello, Moulavi, and Sander
Github code
Documentation
jupytr example
steps
What is?
[1]
clusters
DBSCAN
[3]
dendrogram
Paper
git
paper
Vidhya on clustering and methods
KNN
intuition 2
thorough explanation 3
Determinging the number of clusters, a comparison of several methods, elbow, silhouette etc
A good visual example of kmeans / gmm
Kmeans with DTW, probably fixed length vectors, using tslearn
Kmeans for variable length
notebook
pyClustering
Biclustering and spectral co clustering
Clustering correlation, or distance matrices.