June 5, 2026
OpenAI's Dan Roberts argues RL has become 'the cake, not the cherry' and is now powerful enough for AI to make genuine scientific discoveries, citing the week's Erdős-problem breakthroughs; Anthropic ships a candid Claude Code quality postmortem plus Managed Agents that decouple the brain from the hands for ~60% faster time-to-first-token; and Alex Albert reveals 80%+ of Anthropic's code is now written by Claude while Cognition's Swyx, OpenAI's Sottiaux, Box's Levie, YC's Garry Tan, and Every's Dan Shipper weigh in on 100-hour evals, the Codex Python SDK, and agent-native writing tools.
X / TWITTER
Swyx (Shawn Wang, curator of Latent Space and AI Engineer, working with Cognition) announced what he calls the first eval ship out of Cognition: a real-world frontier coding evaluation that goes far beyond existing benchmarks. He notes that METR's evals cap out around 16 hours of human-equivalent task length, while Cognition now has private enterprise evals running up to 100 hours — and is confident enough to put a financial guarantee behind them. The Cognition dataset comes from 258 real Devin sessions across 126 users doing actual Java/TypeScript/Python/C# feature work, bug fixes, and migrations, with users estimating how long each task would have taken without Devin. He frames this as pioneering "last mile" real-world evals work. https://x.com/swyx/status/2062611218196771017
Swyx(Shawn Wang,Latent Space 主理人、AI Engineer 社区发起人,目前与 Cognition 合作)发布了他口中 Cognition 的第一个评测成果:一套远超现有基准的真实世界前沿编程评测。他指出,METR 的评测上限大约是 16 小时的人类等效任务长度,而 Cognition 现在的私有企业评测已经能跑到 100 小时,并且有信心为此提供财务担保。Cognition 的数据集来自 126 位用户、258 个真实 Devin 会话,覆盖真实的 Java/TypeScript/Python/C# 功能开发、bug 修复和迁移,由用户估算每个任务在没有 Devin 的情况下需要多久。他把这视为开创性的"最后一公里"真实世界评测工作。https://x.com/swyx/status/2062611218196771017
OpenAI's Thibault Sottiaux (who leads Codex and ChatGPT) shipped a Python SDK for Codex, letting developers drive Codex from inside their own programs with a simple `pip install openai-codex`. Separately, in a notably transparent move, he disclosed and fixed a Codex billing bug that had been undercounting tokens served to some Pro and Plus accounts — affecting under 15% of accounts. "Not the kind of bug you want us to fix, but didn't want to do this silently and thought you should know." https://x.com/thsottiaux/status/2062734215494664697
OpenAI 的 Thibault Sottiaux(负责 Codex 和 ChatGPT)发布了 Codex 的 Python SDK,开发者现在可以用一句 `pip install openai-codex` 在自己的程序里直接调用 Codex。另外,他以相当坦诚的姿态披露并修复了一个 Codex 计费 bug:此前对部分 Pro 和 Plus 账户少计了 token 用量,影响范围在 15% 账户以内。"这不是你们希望我们修的那种 bug,但我们不想悄悄处理,觉得应该让你们知道。"https://x.com/thsottiaux/status/2062734215494664697
Roblox product leader Peter Yang spent a full day wiring up integrations and skills in Codex for his top creator workflows, and came away convinced you can save at least 50% of your time on any kind of knowledge work if you invest in setting up the system upfront — keeping human checkpoints along the way to apply your own taste. His practical recipe: reflect on the most repetitive work from your past week, list out every single step in detail, then paste that into Codex or Claude Code and ask what integrations and skills could streamline it. He also offered a pointed critique: as much as he's come to love Codex, its frontend design still lags, while Claude can one-shot great-looking HTML slides — and that first impression matters. https://x.com/petergyang/status/2062740262338929110
Roblox 产品负责人 Peter Yang 花了一整天为自己的核心创作者工作流在 Codex 里配置集成和 skills,得出的结论是:只要肯在前期把系统搭好,任何知识工作都能至少节省一半时间,同时在流程中保留人工检查点以注入自己的判断力。他给出了一套实操方法:回顾过去一周最重复的工作,把每一个步骤都详细列出来,再贴进 Codex 或 Claude Code,问它能搭哪些集成和 skills 来提效。他也直言不讳地批评:尽管自己越来越喜欢 Codex,它的前端设计仍然落后,而 Claude 能一次成型地做出好看的 HTML 幻灯片,而这种第一印象很重要。https://x.com/petergyang/status/2062740262338929110
Anthropic's Thariq (on the Claude Code team) revived a 2020 essay, "An app can be a home-cooked meal," to argue that personal software was simply early back then — but in 2026 it really can be as personal as a home-cooked meal or a handwritten letter, now that anyone can build an app for an audience of one. https://x.com/trq212/status/2062605395101884916
Anthropic 的 Thariq(Claude Code 团队成员)重提了 2020 年的一篇文章《一个 app 可以是一顿家常菜》,他认为个人软件在当年只是来得太早,而到了 2026 年,软件真的可以像一顿家常菜或一封手写信那样私人化,因为现在任何人都能为"只有一个用户"的受众做出一个 app。https://x.com/trq212/status/2062605395101884916
Anthropic researcher Alex Albert shared striking internal data on how much of Claude's own development is now done by Claude: over 80% of all code merged into Anthropic's codebase is now written by Claude; many researchers haven't hand-written code in months; the typical Anthropic engineer ships 8x as much code as in 2024; and on the most open-ended engineering tasks, Claude's success rate jumped from ~26% to 76% in six months. When research sessions went off-track, Claude proposed a better next step than the human took 64% of the time. "We're not at recursive self-improvement yet, but it could come sooner than most expect." https://x.com/alexalbert__/status/2062580571214389510
Anthropic 研究员 Alex Albert 分享了一组惊人的内部数据,展示 Claude 现在有多大程度上在开发自己:合入 Anthropic 代码库的全部代码中,超过 80% 现在由 Claude 编写;许多研究员已经好几个月没有手写过代码;Anthropic 工程师如今交付的代码量是 2024 年的 8 倍;在最开放式的工程任务上,Claude 的成功率在六个月内从约 26% 跃升到 76%。当研究会话偏离正轨时,Claude 有 64% 的概率提出了比人类更好的下一步。"我们还没到递归自我改进的阶段,但它到来的时间可能比大多数人预想的更早。"https://x.com/alexalbert__/status/2062580571214389510
Box CEO Aaron Levie responded to Anthropic's internal post with what he sees as the key to the optimistic AI scenario: AI dramatically lowers the barrier to doing more, so organizations now generate far more ideas than they can pursue. The binding constraint becomes the surrounding execution work — the people needed to manage and ship those ideas. As he puts it, AI will let us build much more software, run more marketing campaigns, and research more drugs, but all of that still ultimately requires people to manage. He quotes Anthropic's line that the rate at which organizations can spot and fix their bottlenecks "may become the most important skill for any organization." https://x.com/levie/status/2062728257359790292
Box CEO Aaron Levie 回应了 Anthropic 的内部文章,点出了他眼中 AI 乐观情景的关键:AI 大幅降低了"做更多事"的门槛,所以组织现在产生的想法远超它能落地的数量。真正的瓶颈变成了围绕想法的执行工作——管理和交付这些想法所需要的人。用他的话说,AI 会让我们造更多软件、跑更多营销活动、研究更多药物,但这一切最终仍然需要人来管理。他引用了 Anthropic 的一句话:组织发现并修复自身瓶颈的速度"可能会成为任何组织最重要的能力"。https://x.com/levie/status/2062728257359790292
Y Combinator CEO Garry Tan celebrated two YC decacorns landing in a single day — one of them building commercial fusion. Its Polaris machine hit 150 million degrees Celsius, becoming the first privately funded machine to do so. "This is the abundance future, built by people who actually ship." https://x.com/garrytan/status/2062763109849411834
Y Combinator CEO Garry Tan 庆祝同一天诞生了两家 YC 的"十角兽",其中一家在做商业核聚变。它的 Polaris 装置达到了 1.5 亿摄氏度,成为首台做到这一点的私人融资机器。"这就是富足的未来,由真正能交付的人建造。"https://x.com/garrytan/status/2062763109849411834
FPV Ventures partner Nikunj Kothari built an AI replica of himself, a Claude Code skill called "Nock," to answer the question of whether a VC can be replaced by AI. He used Claude Code to pull over 200 of his 1:1 founder pitch meeting notes captured by Granola over the past few years, distilled them down to ~53 meetings with rich debate, added a few of his own essays on the founders he loves, and turned it all into a skill grounded in real conversations. He refined it by running it against 5–10 real decks and comparing its output to what he actually said, iterating until it felt like an accurate representation. Founders can now run their deck past his AI proxy, and he's offering other VCs a way to build their own. https://x.com/nikunj/status/2062659649732825549
FPV Ventures 合伙人 Nikunj Kothari 做了一个自己的 AI 分身——一个名为"Nock"的 Claude Code skill,用来回答"VC 能不能被 AI 取代"这个问题。他用 Claude Code 拉取了过去几年里 Granola 记录的 200 多份一对一创始人路演会议笔记,提炼出约 53 场讨论充分的会议,再加上几篇他写的关于自己欣赏哪类创始人的文章,把这一切打造成一个建立在真实对话之上的 skill。他用 5 到 10 份真实 BP 反复测试,把它的输出和自己真实说过的话对比,不断迭代到足够像他本人为止。创始人现在可以把 BP 拿给他的 AI 分身过一遍,他也为其他 VC 提供了自建分身的途径。https://x.com/nikunj/status/2062659649732825549
Every CEO Dan Shipper launched Spiral 4.0, a writing partner for both you and your agent. Its new Style Engine is built on the principles of stylometry to extract your or your brand's voice from past work and produce on-brand writing every time, and Spiral is now usable via MCP and CLI by agents like Codex, Claude Code, and OpenClaw. His 30-person team uses it daily to write landing pages, tweets, podcasts, and marketing emails while keeping everything on-brand. https://x.com/danshipper/status/2062628079869005876
Every CEO Dan Shipper 发布了 Spiral 4.0,一个同时面向你和你的 agent 的写作伙伴。它新的 Style Engine 基于文体计量学(stylometry)的原理,从过往作品中提取你或你品牌的语气,每次都能产出符合调性的文字;Spiral 现在还能通过 MCP 和 CLI 被 Codex、Claude Code、OpenClaw 等 agent 调用。他 30 人的团队每天用它来写落地页、推文、播客和营销邮件,并确保所有内容都保持品牌一致。https://x.com/danshipper/status/2062628079869005876
OpenAI CEO Sam Altman highlighted two product rollouts: a big upgrade to ChatGPT memory shipping today, and the ability to build and publish web apps directly with ChatGPT. On the latter he added a personal note: "I really wish I had this when I was a kid, but I do miss hypercard." https://x.com/sama/status/2062661071761211561
OpenAI CEO Sam Altman 重点介绍了两项产品更新:今天上线的 ChatGPT 记忆功能的重大升级,以及直接用 ChatGPT 构建并发布 web app 的能力。对于后者他还加了一句个人感慨:"真希望我小时候就有这个,不过我还是有点怀念 HyperCard。"https://x.com/sama/status/2062661071761211561
OFFICIAL BLOGS
Anthropic Engineering
An update on recent Claude Code quality reports — Anthropic published a detailed postmortem tracing a month of "Claude feels dumber" complaints to three separate, overlapping changes affecting Claude Code, the Agent SDK, and Cowork — while confirming the API itself was never impacted. First, a March 4 change dropped the default reasoning effort from high to medium to cut latency; this was "the wrong tradeoff" and was reverted April 7. Second, a March 26 caching optimization meant to clear old thinking once on idle sessions had a bug that cleared it on every turn, making Claude "increasingly without memory of why it had chosen to do what it was doing" and draining usage limits faster. Third, an April 16 system-prompt line capping responses ("keep text between tool calls to ≤25 words") hurt coding quality and was reverted April 20. Because each change hit a different slice of traffic on a different schedule, the aggregate looked like broad, inconsistent degradation. Notably, when back-tested, Opus 4.7's Code Review found the caching bug while Opus 4.6 didn't. Going forward, Anthropic is adding per-model eval suites for every system-prompt change, soak periods, gradual rollouts, and tighter prompt-change controls — and reset usage limits for all subscribers. https://www.anthropic.com/engineering/april-23-postmortem
Anthropic Engineering
Scaling Managed Agents: Decoupling the brain from the hands — Anthropic introduced Managed Agents, a hosted service that runs long-horizon agents by virtualizing an agent into three independent interfaces: a session (an append-only log of everything that happened), a harness (the loop that calls Claude and routes its tool calls), and a sandbox (where Claude runs code). The core insight is borrowed from operating systems — designing for "programs as yet unthought of" — and the key move is decoupling the "brain" (Claude and its harness) from the "hands" (sandboxes and tools) and the session log, so each can fail or be swapped independently. Their original coupled design had "adopted a pet": when a container failed, the session was lost. Pulling the harness out of the container turned both into interchangeable "cattle," let containers be provisioned only when needed, and dropped p50 time-to-first-token roughly 60% and p95 over 90%. It also fixed a security hole by ensuring credentials are never reachable from the sandbox where Claude's generated code runs. The session, crucially, "is not Claude's context window" — it's a durable, interrogable context object that lives outside it. https://www.anthropic.com/engineering/managed-agents
Claude Blog
New in Claude Managed Agents: self-hosted sandboxes and MCP tunnels — Managed Agents can now execute tools in a sandbox you control and connect to your private MCP servers, keeping sensitive files and services inside your enterprise perimeter while Anthropic's infrastructure still handles orchestration, context management, and error recovery. Self-hosted sandboxes (public beta) run on your own infrastructure or with managed providers Cloudflare, Daytona, Modal, or Vercel; early adopters include Amplitude (building Design Agent on Cloudflare), Clay (whose GTM agent Sculptor runs on Daytona), and Rogo (an institutional-finance analyst agent on Vercel Sandbox). MCP tunnels (research preview) let agents reach internal databases, private APIs, and ticketing systems through a lightweight gateway making a single outbound connection — no inbound firewall rules, no public endpoints, traffic encrypted end to end. https://claude.com/blog/claude-managed-agents-updates
Claude Blog
New connectors in Claude for everyday life — Anthropic expanded Claude's connectors beyond work tools to the apps people use throughout their week, including AllTrails, Audible, Booking.com, Instacart, Intuit TurboTax, Resy, Spotify, Uber, and more — the directory now spans over 200 connectors since launching in July 2025. Connectors now surface dynamically: Claude suggests the right app for what you're doing (a hike, a grocery cart, a reservation) from your preferences and conversation context, and when multiple connected apps could help, it shows them all and lets you choose. Anthropic stressed Claude stays ad-free with no paid placements, your data isn't used for training, and it checks with you before booking or purchasing on your behalf. https://claude.com/blog/connectors-for-everyday-life
PODCASTS
The MAD Podcast with Matt Turck — "OpenAI's Dan Roberts: Why AI Can Now Make Discoveries"
The Takeaway: Reinforcement learning has gone from the "cherry on top" to the cake itself — and it's now powerful enough that AI is starting to make genuine, original scientific discoveries.
Dan Roberts leads the Foundations of Reinforcement Learning team at OpenAI, arriving there from a deep background in theoretical physics — a PhD from MIT on quantum gravity and black holes, and a book, The Principles of Deep Learning Theory. His mandate isn't just making RL work, but understanding *how* it works and how it scales. The conversation lands during an extraordinary week in which OpenAI, DeepMind, and Anthropic each cracked famous unsolved Erdős problems in mathematics.
What makes the OpenAI result striking is that the model went contrarian: a conjecture everyone assumed was true, the model assumed false — and then persevered down a very long, expert calculation path through algebraic number theory to disprove it. "When you go against the grain and do something contrarian like that, you really have to have strong conviction in what you're doing in order to persevere down a really long calculation path." Roberts contrasts OpenAI's informal-reasoning approach with DeepMind's formal-proof approach in the Lean language.
His most contrarian take is on scaling itself: he rejects the idea that capabilities "emerge" discontinuously at scale. If something looks like it grokked or broke, "it means that you didn't understand something about what you were scaling up" — the job of a physicist-turned-AI-researcher is to go back to smaller, simpler toy models and restore smoothness to the scaling curve. He also pushes back on Rich Sutton's pure-RL view, arguing language is the right grounding layer for intelligence because "everything goes through language" — the sum of human knowledge is represented there. And he's already convinced AI is doing real science, citing the unit-distance proof as a case where no single human likely had the exact mix of skills to solve it. https://www.youtube.com/watch?v=oWOz2htozfI
核心要点: 强化学习(RL)已经从"蛋糕上的樱桃"变成了蛋糕本身,而且如今它已强大到足以让 AI 开始做出真正原创的科学发现。
Dan Roberts 领导 OpenAI 的"强化学习基础"(Foundations of Reinforcement Learning)团队,他有着深厚的理论物理背景:MIT 量子引力与黑洞方向的博士,著有《深度学习理论的原理》一书。他团队的使命不只是让 RL 跑起来,而是理解它"为什么"有效、又如何扩展。这次对话恰逢一个不寻常的星期——OpenAI、DeepMind 和 Anthropic 各自攻克了数学中著名的、悬而未决的 Erdős 问题。
OpenAI 的结果之所以惊艳,在于模型走了一条"逆向"路线:一个所有人都默认为真的猜想,模型假设它为假,然后沿着一条漫长且需要专业知识的代数数论计算路径坚持下去,最终将其证伪。"当你反其道而行、做出这种逆向的尝试时,你必须对自己在做的事有极强的信念,才能沿着一条非常长的计算路径坚持走下去。"Roberts 还把 OpenAI 的非形式化推理方法与 DeepMind 用 Lean 语言做形式化证明的方法做了对比。
他最具争议的观点是关于扩展(scaling)本身:他拒绝"能力会在规模上不连续地'涌现'"这一说法。如果某样东西看起来像是突然 grok 了或崩了,"那意味着你对自己在扩展的东西理解得还不够"——一个从物理转向 AI 的研究者的工作,就是回到更小、更简单的玩具模型,把扩展曲线重新变得平滑。他也反驳了 Rich Sutton 的纯 RL 观点,认为语言才是智能正确的"接地"层,因为"一切都要经过语言"——人类知识的总和都呈现在那里。而他已经确信 AI 正在做真正的科学,并以单位距离问题的证明为例:很可能没有任何一个人类同时具备解决它所需的全部技能组合。https://www.youtube.com/watch?v=oWOz2htozfI