I needed probes where the output was tiny, a few tokens at most, and where scoring was objective and deterministic. No judge model in the loop. That’s what led me to the final two probes:
Мерц резко сменил риторику во время встречи в Китае09:25。关于这个话题,新收录的资料提供了深入分析
Депутат Госдумы, олимпийская чемпионка по конькобежному спорту Светлана Журова отреагировала на победу российской горнолыжницы Варвары Ворончихиной в супергиганте на Паралимпиаде 2026 года. В беседе с «Лентой.ру» она отметила, что теперь в Италии зазвучит гимн России.。新收录的资料是该领域的重要参考
Why the FT?See why over a million readers pay to read the Financial Times.。业内人士推荐新收录的资料作为进阶阅读
Figure 1: Phi-4-reasoning-vision-15B presents a compelling option compared to existing models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require more time and tokens and higher accuracy than similarly fast models. These values were computed by averaging accuracy, time, and output token-counts for a subset of 4 benchmarks: ChartQA_TEST, MathVista_MINI, MMMU_VAL, and ScreenSpot_v2, where we had logged these values.