Skip to the content.

Luke Alexander, Eric Leonen, Sophie Szeto, Artemii Remizov, Ignacio Tejeda, Giovanni Inchiostro, Vasily Ilin

arXiv HF Paper Dataset Demo MathGPT

Overview

We release Theorem Search (paper, dataset) over all of arXiv, the Stacks Project, and six other sources. Our search is 70% more accurate than LLM search, with only 5 second latency.

Try TheoremSearch

Retrieval Performance (Hit@10)

ModelTheorem-levelPaper-level
Google Search0.378
ChatGPT 5.20.180
Gemini 3 Pro0.252
Ours0.4320.505

Theorem-level = retrieval of exact theorem statements
Paper-level = retrieval of the correct paper containing the theorem

Data Sources

SourceTheorems
arXiv9,246,761
ProofWiki23,871
Stacks Project12,693
Open Logic Project745
CRing Project546
Stacks and Moduli506
HoTT Book382
An Infinitely Large Napkin231

Motivation

Mathematical knowledge is scattered across millions of papers. Important results hide as lemmas in obscure sources, and existing search tools only operate at the document level.

For mathematicians, important results hide as lemmas in obscure papers, and existing tools only search at the document level. The recent AI “breakthroughs” on Erdos problems illustrate this: most turned out to be rediscoveries of results already in the literature. As Tao observed, many “open” problems are open through obscurity, not difficulty. DeepMind’s Aletheia confirmed this — most of its correct solutions were identifications of existing literature.

For AI agents, the bottleneck is the same. Without the relevant literature, LLMs confabulate incorrect arguments. In our experiments, Claude answered a research-level algebraic geometry question incorrectly on its own, but correctly when given access to TheoremSearch as a RAG tool.

How It Works

  1. Parse theorems. We extract over 9 million theorem statements from LaTeX sources across arXiv and seven other sources using a combination of plasTeX, TeX logging, and regex-based parsing.
  2. Generate slogans. Each theorem is summarized into a concise natural-language description (“slogan”) by DeepSeek V3, converting formal LaTeX notation into searchable English text.
  3. Embed and index. Slogans are embedded using Qwen3-Embedding-8B and stored in a PostgreSQL database with pgvector, using an HNSW index with binary quantization for fast approximate nearest-neighbor search.
  4. Retrieve. User queries are embedded with the same model. We retrieve the top-k theorems by Hamming distance, then re-rank by cosine similarity.

API

TheoremSearch provides a production REST API for semantic theorem search.Example:

curl https://api.theoremsearch.com/search \
  -H "Content-Type: application/json" \
  -d '{"query": "smooth DM stack codimension one", "n_results": 5}'

Returns a JSON object containing theorem-level results with metadata and similarity scores.

MCP

TheoremSearch is also available as an MCP tool for AI agents with a single tool theorem_search. Endpoint: https://api.theoremsearch.com/mcp.

Citation

@article{alexander2026semantic,
  title  = {Semantic Search over 9 Million Mathematical Theorems},
  author = {Alexander, Luke and Leonen, Eric and Szeto, Sophie and Remizov, Artemii and Tejeda, Ignacio and Inchiostro, Giovanni and Ilin, Vasily},
  journal= {arXiv preprint arXiv:2602.05216},
  year   = {2026},
  doi    = {10.48550/arXiv.2602.05216},
  url    = {https://arxiv.org/abs/2602.05216}
}

Acknowledgements

We thank the UW eScience Institute for supporting this project. We thank Nebius for providing inference infrastructure — our demo uses Nebius Token Factory for fast, low-cost query embedding with Qwen3-Embedding-8B.

Contact

Feedback is welcome! For questions or collaboration inquiries, reach out to vilin@uw.edu.

TheoremSearch is a project of the UW Math AI Lab.