ArXiv AI Research Trends
Unsupervised learning analysis of 181k+ AI papers from ArXiv (2024-2026). Tested 9 embedding × clustering combinations, validated clusters against ArXiv categories using ARI and NMI, and identified growth trends via linear regression. Cluster labels generated automatically using Claude API.

Unsupervised clustering analysis of 181k+ AI research papers from ArXiv to discover and track emerging research trends. Built to help a technology-oriented company identify where to focus academic partnerships and R&D investment.
What This Project Does
Collects 181,294 AI research papers from ArXiv (Jan 2024 – Feb 2026), clusters them into 43 research topics using unsupervised learning, and identifies which areas are growing or declining.
The Pipeline
- Collects papers from 8 ArXiv CS categories via the ArXiv API
- Embeds abstracts using TF-IDF, MiniLM, and KaLM sentence transformers
- Reduces dimensionality with SVD + UMAP (tuned via hyperparameter experiment)
- Clusters using K-Means, GMM, and HDBSCAN (9 combinations compared)
- Validates clusters against ArXiv categories and cross-embedding agreement
- Measures growth using Poisson regression controlling for overall ArXiv growth
- Delivers strategic recommendations backed by statistical significance
Key Findings
13 of 43 research areas are gaining share. 21 are losing share. AI research is concentrating.
Fastest Growing Areas
- LLM Reasoning Optimization — 3,193 papers, +7.72%/month
- LLM Agents and Decision-Making — 2,285 papers, +5.45%/month
- AI Safety and Adversarial Robustness — 2,885 papers, +2.12%/month
- AI for Molecular and Biological Discovery — 1,778 papers, +1.82%/month
- Medical AI and Image Segmentation — 10,222 papers, +1.06%/month
Fastest Declining Areas
- Domain Adaptation and Object Detection — 1,458 papers, -3.09%/month
- Graph Neural Networks — 2,799 papers, -2.82%/month
- 3D Point Cloud Perception — 1,759 papers, -2.13%/month
- Bias and Fairness in Language Models — 1,100 papers, -1.91%/month
- Federated Learning — 3,620 papers, -1.85%/month
Method
3 embedding types (TF-IDF, MiniLM, KaLM) × 3 clustering algorithms (K-Means, GMM, HDBSCAN) = 9 combinations evaluated systematically.
- UMAP dimensionality reduction with hyperparameter tuning (40d won over 20d)
- Evaluated with silhouette, Davies-Bouldin, Calinski-Harabasz, and topic coherence (c_v)
- Validated with ARI/NMI against ArXiv categories and cross-embedding agreement
- Growth trends measured with Poisson regression (offset by total monthly volume)
- All reported trends are statistically significant (p < 0.05)
Best combination: KaLM embeddings + HDBSCAN (43 clusters, silhouette 0.499, 35.5% noise)
Noise Analysis
- 64.5% of papers assigned to 43 main clusters
- 17.8% are small niche topics below the clustering threshold
- 17.7% are genuinely interdisciplinary papers
Getting Started
git clone https://github.com/dinosmuc/arxiv-ai-trends.git
cd arxiv-ai-trends
conda env create -f environment.yml
conda activate arxiv-trends
pip install -r requirements.txt
python -m spacy download en_core_web_sm
# Optional: set API key for Claude-based cluster labeling
export ANTHROPIC_API_KEY=your-key-here
# Run the full pipeline — execute notebooks 01-07 in order
jupyter lab notebooks/Hardware Requirements
- GPU: CUDA-compatible GPU required for sentence-transformer embeddings
- KaLM embedding: ~2 hours on RTX 2070; MiniLM: ~5 minutes
- RAM: 32GB recommended
- Disk: ~2GB for processed data