May 31, 2026
Gemini co-lead Oriol Vinyals argues file-system-style memory, not retrained weights, is the paradigm-shifting path to continual learning; Claude Managed Agents adds dreaming, outcomes, and multiagent orchestration; OpenAI's Sottiaux, Vercel's Rauch, Box's Levie, Cursor's Ryo Lu, and Peter Steinberger weigh in on model roadmaps and long-running agents.
X / TWITTER
Thibault Sottiaux (Codex & ChatGPT lead at OpenAI) laid out OpenAI's deliberately simple model roadmap: each version bump from GPT-5.0 to 5.1 on up to 5.5 pairs capability gains with token-efficiency improvements, which translate directly into speed. He calls GPT-5.5 their best model yet and says they intend to keep riding this same incremental strategy rather than chasing a dramatic leap.
OpenAI 的 Codex 与 ChatGPT 负责人 Thibault Sottiaux 阐述了 OpenAI 刻意保持简单的模型路线图:从 GPT-5.0 到 5.1 再到 5.5 的每一次版本递增,都同时带来能力提升和 token 效率的改善,而后者直接转化为速度的提升。他称 GPT-5.5 是目前最好的模型,并表示团队打算继续沿用这套渐进式策略,而不是去追求某种戏剧性的跨越。
Vercel CEO Guillermo Rauch cut through the AI hype with a blunt reminder to builders: "Ship the best product. Use lots of AI, some AI, maybe no AI. Just be the best." The point is that AI is a means, not the scorecard. Separately, Vercel shipped per-API-key spend caps on its AI Gateway, giving teams finer-grained control over runaway model costs.
Vercel CEO Guillermo Rauch 用一句直白的话戳破了 AI 的炒作,提醒开发者:"做出最好的产品。用很多 AI、用一点 AI、甚至完全不用 AI 都行。关键是做到最好。"他的意思是,AI 只是手段,而不是评判标准。另外,Vercel 在其 AI Gateway 上线了按 API Key 设置的支出上限功能,让团队能更精细地控制模型成本的失控风险。
Box CEO Aaron Levie pushed back on the "AI kills jobs" narrative with a counterintuitive observation from his conversations with large-enterprise CIOs and CEOs: most are either growing headcount because of AI (in new functions like forward-deployed engineers) or reinvesting their efficiency savings into underfunded areas like sales and marketing. He argues businesses have always been constrained by how much software, outreach, or risk management they can afford, so when AI lifts those constraints the investment flows back into the business. His sharp takeaway: companies that use AI to serve customers better win, while those that only chase cost savings end up worse off.
Box CEO Aaron Levie 用一个反直觉的观察反驳了"AI 摧毁就业"的论调。他在与大型企业 CIO 和 CEO 的对话中发现,大多数企业要么因为 AI 而扩招(在 forward-deployed engineer 等新岗位上),要么至少把效率节省下来的钱重新投入到销售、营销等长期投入不足的领域。他认为,企业一直受限于能负担多少软件、多少客户触达、多少风险管理,所以当 AI 解除这些约束时,投资会重新流回业务本身。他的犀利结论是:用 AI 把客户服务做得更好的公司会胜出,而只盯着省钱的公司最终会过得更差。
Cursor designer Ryo Lu highlighted what he loves about Cursor's auto-review feature: it explains each command and its associated risk before running it, which lowers the barrier for new coders to learn what's happening and "just do things." It's a small but telling example of how agentic tools can double as teaching aids rather than black boxes.
Cursor 设计师 Ryo Lu 谈到了他喜欢 Cursor auto-review 功能的地方:它在运行每条命令前都会解释这条命令及其相关风险,从而降低了新手程序员理解正在发生什么、并"放手去做"的门槛。这是一个虽小却有代表性的例子,说明 agentic 工具可以兼作教学辅助,而不只是黑箱。
Peter Steinberger (independent builder, working with OpenAI) shared a concrete signal of how far long-running agents have come: with GPT-5.5 plus his /goal, autoreview, and crabbox tooling, his prompts have gone from 30-60 minute tasks to often 4-10 hour tasks, with much higher confidence the result is actually ready. "Yielding agents is a skill," he notes. He also shared a useful trick: ask Codex to review code for bugs and it may say all is fine, but tell it there *is* a bug and it will loop relentlessly until it finds real issues.
独立开发者 Peter Steinberger(与 OpenAI 合作)分享了一个关于长时运行 agent 进展程度的具体信号:配合 GPT-5.5 以及他自己的 /goal、autoreview 和 crabbox 工具,他的 prompt 处理任务已经从 30 到 60 分钟级别,提升到常常是 4 到 10 小时级别,而且对结果真正"可交付"的信心高得多。他说:"驾驭 agent 是一项技能。"他还分享了一个实用技巧:让 Codex 审查代码找 bug,它可能会说一切正常,但如果你告诉它"确实有 bug",它就会不停循环,直到找出真正的问题。
OFFICIAL BLOGS
Claude Blog
New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration
Anthropic launched three upgrades to Claude Managed Agents aimed at making agents more capable with less human steering. The headline is "dreaming," a research preview: a scheduled process that reviews an agent's past sessions and memory stores between runs, extracts patterns, and curates memory so agents self-improve over time. Dreaming surfaces things a single agent can't see on its own, such as recurring mistakes and workflows teams converge on, and you can let it update memory automatically or review changes first. The second feature, "outcomes," lets you write a rubric describing success, then a separate grader evaluates output in its own context window and sends the agent back for another pass until it clears the bar. In internal testing, outcomes improved task success by up to 10 points (with the largest gains on the hardest problems), plus +8.4% on docx and +10.1% on pptx file generation. Third, multiagent orchestration lets a lead agent break a job into pieces and delegate to specialists with their own models, prompts, and tools, working in parallel on a shared filesystem. Real results cited: Harvey saw completion rates rise ~6x with dreaming, and Wisedocs' review agent now runs 50% faster using outcomes.
Anthropic 为 Claude Managed Agents 发布了三项升级,目标是让 agent 在更少人工干预下变得更强。重头戏是处于研究预览阶段的 "dreaming"(梦境):一个定时进程,在两次运行之间回顾 agent 过去的会话和记忆存储,提取模式并整理记忆,让 agent 随时间自我改进。Dreaming 能发现单个 agent 自身看不到的东西,比如反复出现的错误,以及团队趋同收敛的工作流;你可以让它自动更新记忆,也可以先审核改动。第二项功能 "outcomes"(结果)让你写一份描述何为成功的 rubric,然后由一个独立的评分器在自己的 context window 中评估输出,并把 agent 打回去重做,直到达标为止。在内部测试中,outcomes 将任务成功率最高提升了 10 个百分点(在最难的问题上提升最大),并使 docx 文件生成提升 8.4%、pptx 提升 10.1%。第三项是 multiagent orchestration(多 agent 编排),让一个主 agent 把任务拆分,委派给拥有各自模型、prompt 和工具的专家 agent,在共享文件系统上并行工作。文中引用的实际成果:Harvey 借助 dreaming 将完成率提升约 6 倍,Wisedocs 的审查 agent 使用 outcomes 后运行速度提升了 50%。
PODCASTS
Unsupervised Learning — Ep 87: Gemini Co-Lead on World Models, RL's Next Domains & Continual Learning
The Takeaway: The next leap in AI may come not from any single domain, but from "meta capabilities" like learning from experience, and the most practical path to that runs through file-system-style memory rather than retraining model weights.
Oriol Vinyals co-leads Google's Gemini alongside Jeff Dean and Noam Shazeer, and has pioneered deep-learning breakthroughs for over a decade. His read on where the frontier is heading is worth listening to because he sits at the exact intersection of research and what actually ships.
His most counterintuitive point is about memory and continual learning. Rather than baking each user's history into the model weights (which is a nightmare to serve at scale), he believes the winning mechanism is letting agents write to a file system, structuring thoughts into files and folders they read back from. "It's a bit more convenient than integrating those back into the weights because... we try to serve one model at scale," he explains, calling this nonparametric approach "paradigm shifting as well, in a way similar to how we saw reasoning a year and a half or so ago."
On world models, he's candid that the field hasn't yet hit the "GPT moment" for video and images, where a model could learn the rules of gravity purely from watching footage, because linking visual concepts to meaning without explicit language labels remains tricky. On reinforcement learning, he names the core bottleneck: games like Go generate infinite training data for free as each move creates a novel position, but with LLMs "the source of infinite complexity is not so clear." What surprised even him this past year is how much training narrowly on hard math and coding problems generalizes to unrelated reasoning. His honest verdict on AGI: by the standards he'd have used seven years ago, "in some way AGI is here," though not yet in the way he ultimately wants to see it.
核心要点: AI 的下一次飞跃可能不是来自某个单一领域,而是来自"从经验中学习"这类"元能力"(meta capabilities);而通往它最务实的路径,是文件系统式的记忆,而非重新训练模型权重。
Oriol Vinyals 与 Jeff Dean、Noam Shazeer 共同领导 Google 的 Gemini,十多年来一直是深度学习多项突破的开拓者。他对前沿走向的判断值得一听,因为他正处在研究与真正落地产品的交叉点上。
他最反直觉的观点关于记忆和持续学习。与其把每个用户的历史烘焙进模型权重(这在大规模服务时是一场噩梦),他认为制胜的机制是让 agent 写入文件系统,把想法组织成可回读的文件和文件夹。他解释道:"这比把那些东西重新整合进权重更方便一些,因为……我们要做的是大规模地服务同一个模型。"他把这种非参数化(nonparametric)方法称为"同样具有范式转变意义,某种程度上就像我们一年半前看到 reasoning 那样"。
关于 world models,他坦言这个领域还没迎来视频和图像的"GPT 时刻"——即模型能纯粹通过观看影像就学会重力法则——因为在没有显式语言标注的情况下,把视觉概念与含义关联起来仍然很棘手。关于 reinforcement learning,他点出了核心瓶颈:像围棋这样的游戏会免费产生无限的训练数据,因为每走一步都会产生全新的局面,但在 LLM 上,"无限复杂性的来源并不那么清晰"。过去这一年连他自己都感到意外的是,只在高难度数学和编程问题上做窄域训练,竟能如此大幅地泛化到不相关的推理任务。他对 AGI 的诚实判断是:按照他七年前会采用的标准,"在某种意义上 AGI 已经到来",尽管还不是他最终想看到的那种形态。