cluster-analysis – Make Me Engineer

Cluster one-dimensional data optimally? [closed]

May 13, 2023 by Tarik

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

May 12, 2023 by Tarik

Write code yourself. Then it fits your problem best! Boilerplate: Never assume code you download from the net to be correct or optimal… make sure to fully understand it before using it. %matplotlib inline from numpy import array, linspace from sklearn.neighbors import KernelDensity from matplotlib.pyplot import plot a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1) kde = KernelDensity(kernel=”gaussian”, bandwidth=3).fit(a) … Read more

Clustering values by their proximity in python (machine learning?) [duplicate]

December 2, 2022 by Tarik

Don’t use clustering for 1-dimensional data Clustering algorithms are designed for multivariate data. When you have 1-dimensional data, sort it, and look for the largest gaps. This is trivial and fast in 1d, and not possible in 2d. If you want something more advanced, use Kernel Density Estimation (KDE) and look for local minima to … Read more

Clustering text documents using scikit-learn kmeans in Python

December 2, 2022 by Tarik

This is a simpler example: from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score documents = [“Human machine interface for lab abc computer applications”, “A survey of user opinion of computer system response time”, “The EPS user interface management system”, “System and human system engineering testing of EPS”, “Relation of user perceived … Read more

Unsupervised clustering with unknown number of clusters

November 8, 2022 by Tarik

You can use hierarchical clustering. It is a rather basic approach, so there are lots of implementations available. It is for example included in Python’s scipy. See for example the following script: import matplotlib.pyplot as plt import numpy import scipy.cluster.hierarchy as hcluster # generate 3 clusters of each around 100 points and one orphan point … Read more

K-means algorithm variation with equal cluster size

October 9, 2022 by Tarik

This might do the trick: apply Lloyd’s algorithm to get k centroids. Sort the centroids by descending size of their associated clusters in an array. For i = 1 through k-1, push the data points in cluster i with minimal distance to any other centroid j (i < j ≤ k) off to j and … Read more

plotting results of hierarchical clustering on top of a matrix of data

September 2, 2022 by Tarik

The question does not define matrix very well: “matrix of values”, “matrix of data”. I assume that you mean a distance matrix. In other words, element D_ij in the symmetric nonnegative N-by-N distance matrix D denotes the distance between two feature vectors, x_i and x_j. Is that correct? If so, then try this (edited June … Read more

plotting results of hierarchical clustering ontop of a matrix of data in python

July 24, 2022 by Tarik

scikit-learn DBSCAN memory usage

July 15, 2022 by Tarik

The problem apparently is a non-standard DBSCAN implementation in scikit-learn. DBSCAN does not need a distance matrix. The algorithm was designed around using a database that can accelerate a regionQuery function, and return the neighbors within the query radius efficiently (a spatial index should support such queries in O(log n)). The implementation in scikit however, … Read more

1D Number Array Clustering

July 6, 2022 by Tarik

Don’t use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier. In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization. You might want to look at Jenks … Read more