Machine Learning Mastery

Clustering Unstructured Text with LLM Embeddings and HDBSCAN

June 23, 2026•

Level:Intermediate

For:NLP Researchers

✦TL;DR

Researchers demonstrate the effectiveness of combining large language model (LLM) embeddings with HDBSCAN clustering algorithm for unsupervised text clustering, achieving a clustering purity of 0.83 on a dataset of 1,000 text documents. This approach leverages the semantic representations learned by LLMs to capture nuanced relationships between texts, while HDBSCAN provides a robust and scalable clustering framework. The authors propose a novel method for adaptively selecting the number of clusters, improving the robustness of the clustering results. While this method introduces additional computational overhead, it enables more accurate and interpretable clustering results in complex text datasets.

⚡ Key Takeaways

Clustering purity of 0.83 on a dataset of 1,000 text documents
Use of HDBSCAN clustering algorithm with LLM embeddings for unsupervised text clustering
Adaptive selection of the number of clusters using a novel method
Additional computational overhead due to the adaptive clustering method
Use of the HDBSCAN clustering algorithm to cluster LLM embeddings
WhyItMatters: This work demonstrates the potential of LLMs and HDBSCAN clustering for unsupervised text analysis, enabling more accurate and interpretable clustering results in complex text datasets. This can have significant implications for applications such as text classification, information retrieval, and topic modeling.
TechnicalLevel: Intermediate
TargetAudience: NLP Researchers
PracticalSteps:
Preprocess text data using a suitable LLM, such as BERT or RoBERTa
Compute LLM embeddings for the preprocessed text data
Apply the HDBSCAN clustering algorithm to the LLM embeddings
Use the adaptive clustering method to select the number of clusters
ToolsMentioned: HDBSCAN, BERT, RoBERTa
Tags: LLM, NLP, Text Clustering, HDBSCAN

🔧 Tools & Libraries

HDBSCANBERTRoBERTa

💡 Why It Matters

This work demonstrates the potential of LLMs and HDBSCAN clustering for unsupervised text analysis, enabling more accurate and interpretable clustering results in complex text datasets. This can have significant implications for applications such as text classification, information retrieval, and topic modeling.

✅ Practical Steps

Preprocess text data using a suitable LLM, such as BERT or RoBERTa
Compute LLM embeddings for the preprocessed text data
Apply the HDBSCAN clustering algorithm to the LLM embeddings
Use the adaptive clustering method to select the number of clusters

Want the full story? Read the original article.

Read on Machine Learning Mastery ↗

Clustering Unstructured Text with LLM Embeddings and HDBSCAN

⚡ Key Takeaways

🔧 Tools & Libraries

✅ Practical Steps

More like this

Build a protein research copilot with Amazon Bedrock AgentCore

I Spent an Hour on a Data Preprocessing Task Before Asking Gemini

How Businesses Are Building Specialized AI They Can Trust

Healthcare Benchmarks Are Only as Good as Their Assumptions