QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals
Published in arXiv preprint, 2026
QuantSightBench evaluates how well large language models can quantify uncertainty by producing calibrated 90% prediction intervals for 1,000 real-world numerical forecasting questions. We find systematic overconfidence across models, with no model reaching the 90% target coverage, and coverage that degrades as the magnitudes of the quantities being forecast increase.
Recommended citation: Qin, J. & Andriushchenko, M. (2026). "QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals." arXiv preprint arXiv:2604.15859.
Download Paper
