ArXiv AI Research Trends

Unsupervised learning analysis of 181k+ AI papers from ArXiv (2024-2026). Tested 9 embedding × clustering combinations, validated clusters against ArXiv categories using ARI and NMI, and identified growth trends via linear regression. Cluster labels generated automatically using Claude API.

PythonScikit-learnUMAPHDBSCANSentence-TransformersAnthropic APIPandasspaCy

View Code

Unsupervised clustering analysis of 181k+ AI research papers from ArXiv to discover and track emerging research trends. Built to help a technology-oriented company identify where to focus academic partnerships and R&D investment.

What This Project Does

Collects 181,294 AI research papers from ArXiv (Jan 2024 – Feb 2026), clusters them into 43 research topics using unsupervised learning, and identifies which areas are growing or declining.

The Pipeline

Collects papers from 8 ArXiv CS categories via the ArXiv API
Embeds abstracts using TF-IDF, MiniLM, and KaLM sentence transformers
Reduces dimensionality with SVD + UMAP (tuned via hyperparameter experiment)
Clusters using K-Means, GMM, and HDBSCAN (9 combinations compared)
Validates clusters against ArXiv categories and cross-embedding agreement
Measures growth using Poisson regression controlling for overall ArXiv growth
Delivers strategic recommendations backed by statistical significance

Key Findings

13 of 43 research areas are gaining share. 21 are losing share. AI research is concentrating.

Fastest Growing Areas

LLM Reasoning Optimization — 3,193 papers, +7.72%/month
LLM Agents and Decision-Making — 2,285 papers, +5.45%/month
AI Safety and Adversarial Robustness — 2,885 papers, +2.12%/month
AI for Molecular and Biological Discovery — 1,778 papers, +1.82%/month
Medical AI and Image Segmentation — 10,222 papers, +1.06%/month

Fastest Declining Areas

Domain Adaptation and Object Detection — 1,458 papers, -3.09%/month
Graph Neural Networks — 2,799 papers, -2.82%/month
3D Point Cloud Perception — 1,759 papers, -2.13%/month
Bias and Fairness in Language Models — 1,100 papers, -1.91%/month
Federated Learning — 3,620 papers, -1.85%/month

Method

3 embedding types (TF-IDF, MiniLM, KaLM) × 3 clustering algorithms (K-Means, GMM, HDBSCAN) = 9 combinations evaluated systematically.

UMAP dimensionality reduction with hyperparameter tuning (40d won over 20d)
Evaluated with silhouette, Davies-Bouldin, Calinski-Harabasz, and topic coherence (c_v)
Validated with ARI/NMI against ArXiv categories and cross-embedding agreement
Growth trends measured with Poisson regression (offset by total monthly volume)
All reported trends are statistically significant (p < 0.05)

Best combination: KaLM embeddings + HDBSCAN (43 clusters, silhouette 0.499, 35.5% noise)

Noise Analysis

64.5% of papers assigned to 43 main clusters
17.8% are small niche topics below the clustering threshold
17.7% are genuinely interdisciplinary papers

Getting Started

git clone https://github.com/dinosmuc/arxiv-ai-trends.git
cd arxiv-ai-trends

conda env create -f environment.yml
conda activate arxiv-trends
pip install -r requirements.txt
python -m spacy download en_core_web_sm

# Optional: set API key for Claude-based cluster labeling
export ANTHROPIC_API_KEY=your-key-here

# Run the full pipeline — execute notebooks 01-07 in order
jupyter lab notebooks/

Hardware Requirements

GPU: CUDA-compatible GPU required for sentence-transformer embeddings
KaLM embedding: ~2 hours on RTX 2070; MiniLM: ~5 minutes
RAM: 32GB recommended
Disk: ~2GB for processed data