ai builders

June 9, 2026

15 builders27 posts1 podcast1 blog

Onyx Security CEO Maxim Bar Kogan argues the hardest agent-security problem is independently judging whether each action is legitimate, solved with tiny purpose-built 'not smart' models that flag when a smarter agent should look; Cognition and METR release FrontierCode, a maintainer-validated benchmark where FC Diamond holds Opus 4.8 to 13.8%; Anthropic ships a Swift package to call Claude through Apple's Foundation Models framework; and Box's Aaron Levie, Google's Josh Woodward, Claude Code's Boris Cherny, and Sam Altman weigh in on context, NotebookLM outputs, and OpenAI's roadmap.

X / TWITTER

AI Engineer founder and Latent Space writer Shawn Wang (swyx) announced FrontierCode, a new coding benchmark from Cognition and METR, after METR found that more than half of SWEBench results are unmergeable slop. FrontierCode represents 1,000+ hours of maintainer-validated software engineering work with 3,000+ rubrics that grade code quality and catch the anti-cheat reward hacking plaguing other benchmarks — its hardest tier, FC Diamond, holds Opus 4.8 to just 13.8%. He frames three eras of coding benchmarks: 2021 autocomplete (HumanEval), 2023 passing tests (SWEBench, TerminalBench), and 2026 maintainable code (FrontierCode). His most striking data point: on a historical run, the easiest third of tasks was suddenly solved in late 2025 — Opus nearly doubled from 41% to 74% pass rate in four months — which he says explains the "WTF happened in Dec 2025" vibe shift Karpathy and DHH flagged, and why agentic loops like ralph loops and /goals finally feel feasible without things going off the rails.

Sources1

AI Engineer 创始人、Latent Space 主笔 Shawn Wang（swyx）发布了 FrontierCode——由 Cognition 和 METR 合作推出的全新编码 benchmark，此前 METR 发现 SWEBench 一半以上的结果是无法合并的垃圾代码。FrontierCode 凝结了 1000 多小时由代码维护者验证过的软件工程工作，配有 3000 多条 rubric 来评判代码质量，并抓出其他 benchmark 普遍存在的 reward hacking 作弊行为——它最难的一档 FC Diamond，让 Opus 4.8 只拿到 13.8%。他把编码 benchmark 划为三个时代：2021 年的自动补全（HumanEval）、2023 年的通过测试（SWEBench、TerminalBench），以及 2026 年的可维护代码（FrontierCode）。他最震撼的一个数据点是：在一次历史回测中，最简单的三分之一任务在 2025 年底突然被攻克——Opus 在四个月内把通过率从 41% 几乎翻倍到 74%——他认为这正解释了 Karpathy 和 DHH 都提到的"2025 年 12 月到底发生了什么"的氛围转变，也是为什么 ralph loops、/goals 这类 agentic loop 终于变得可行、不再轻易失控。

Google Labs VP Josh Woodward highlighted what he calls NotebookLM's new killer feature: the ability to easily expand your search beyond your own uploaded source files. With the same update, NotebookLM can now generate new output formats too — PDFs, DOCX, XLSX, PPTX, and charts — as part of a push to make it a stronger research tool.

Sources1

Google Labs 副总裁 Josh Woodward 重点介绍了他眼中 NotebookLM 的新杀手级功能：可以轻松把搜索范围扩展到你自己上传的源文件之外。在同一次更新中，NotebookLM 现在还能生成新的输出格式——PDF、DOCX、XLSX、PPTX 以及图表——作为把它打造成更强研究工具的一部分。

Anthropic's Boris Cherny (Claude Code) marked a year since Claude Code's GA by sitting down with Cat Wu to talk about what has changed in how he works: why he now uses auto mode instead of plan mode, how routines fix bugs before he ever sees them, and why he does most of his coding from his phone these days.

Sources1

Anthropic 的 Boris Cherny（Claude Code 团队）借 Claude Code 正式发布满一年之际，和 Cat Wu 聊了聊自己工作方式的变化：为什么他现在用 auto 模式而不是 plan 模式、routines 如何在他还没看到 bug 之前就把它修掉，以及为什么他如今大部分编码都是在手机上完成的。

Box CEO Aaron Levie argued that no amount of intelligence packed into AI models can replace the need for context. For any sufficiently general-purpose AI, you always have to steer it, because it has an infinite range of directions it could go. As long as the same model is used by a lawyer, an engineer, a financial analyst, and a healthcare professional, instructions, domain context, and proprietary data will always have to make it into the context window for the model to be useful. This, he says, is why AI automation does not come for free and why there is still such a wide spread between who is getting big gains and who is not — and it is a structural advantage for applied AI: any layer of abstraction above raw intelligence that gets you off to the races faster stays valuable.

Sources1

Box CEO Aaron Levie 认为，无论给 AI 模型塞进多少智能，都替代不了对 context 的需求。对任何足够通用的 AI，你都必须给它指引方向，因为它可以走的方向是无穷无尽的。只要同一个模型同时被律师、工程师、金融分析师和医疗从业者使用，那么要让模型有用，instructions、领域 context 和专有数据就必须进入 context window。他说，这正是为什么 AI 自动化不是免费的，也是为什么"谁获得了巨大收益、谁没有"之间仍存在如此大的鸿沟——而这对应用层 AI 是一种结构性优势：任何凌驾于原始智能之上、能让你更快上手的抽象层，都会持续有价值。

FPV Ventures partner Nikunj Kothari noted how many "autonomous" companies have launched in the past few months, but cautioned that even with all the loops, the last mile is still quite hard. His read: that gap probably shrinks over the next few months.

Sources1

FPV Ventures 合伙人 Nikunj Kothari 注意到过去几个月里冒出了大量号称"autonomous"的公司，但他提醒说，即便有了各种 loop，最后一公里依然相当难。他的判断是：这道鸿沟大概会在接下来几个月里缩小。

OpenAI CEO Sam Altman shared what he framed as OpenAI's current plan, pointing followers to a single document laying it out.

Sources1

OpenAI CEO Sam Altman 分享了他所说的 OpenAI 当前的计划，并把关注者指向一份完整阐述该计划的文档。

OFFICIAL BLOGS

Claude Blog: Building intelligent apps for Apple platforms with Claude in the Foundation Models framework. Anthropic released a new Swift package that lets Apple developers call Claude through Apple's own Foundation Models framework. The framework already returns typed Swift values via guided generation in as few as three lines of code, powering on-device features like summarization and extraction; the new package lets developers hand off to Claude when a request needs multi-step reasoning, code generation, web search for current information, or code execution for data analysis. Because Apple's framework returns typed values from @Generable annotations, developers "arrive at the Claude API call with clean inputs instead of raw user text," and Claude's response streams back into the same SwiftUI view. The example given: a journaling app generates daily prompts on-device, then asks Claude to find threads across months of entries — "one experience for the user, backed by the right model for each step." Support arrives tomorrow and works on iOS 27, iPadOS 27, macOS 27, visionOS 27, and watchOS 27; add the package, sign in with an Anthropic API key, and the package handles streaming, tool calls, and structured responses.

Sources1

Claude Blog：用 Foundation Models 框架在 Apple 平台上构建智能应用。 Anthropic 发布了一个新的 Swift 包，让 Apple 开发者可以通过 Apple 自家的 Foundation Models 框架调用 Claude。该框架本就能通过 guided generation 用短短三行代码返回带类型的 Swift 值，驱动诸如摘要、信息抽取这类端侧功能；而这个新包让开发者在请求需要多步推理、代码生成、联网搜索最新信息或执行代码做数据分析时，把任务交接给 Claude。由于 Apple 的框架会从 @Generable 注解返回带类型的值，开发者"抵达 Claude API 调用时拿到的是干净的输入，而不是原始用户文本"，Claude 的回复则会流式返回到同一个 SwiftUI 视图里。文中给出的例子是：一个日记应用在端侧生成每日提示，再让 Claude 在数月的日记条目中找出贯穿的线索——"对用户是一种体验，背后每一步都由最合适的模型支撑"。该支持将于明天上线，适用于 iOS 27、iPadOS 27、macOS 27、visionOS 27 和 watchOS 27；接入这个包、用 Anthropic API key 登录，剩下的流式传输、工具调用和结构化返回都由它处理。

PODCASTS

No Priors — "Building an AI Guardian for Enterprise with Onyx Security CEO Maxim Bar Kogan"

The Takeaway: The hardest part of securing AI agents isn't watching what they do — it's independently judging whether each action is legitimate, and the trick is to train tiny, deliberately dumb models that know only when to call in a smarter one.

Maxim Bar Kogan is co-founder and CEO of Onyx Security, a Tel Aviv startup of researchers and mathematicians — many from Israeli intelligence — building agents to watch other AI agents. He bet on agent actions back when enterprises had almost none, inspired by AutoGPT: "It did give everyone a glimpse into the future of what if the models were good enough." That future arrived faster than his runway nearly ran out. Today over 50% of the agent activity Onyx sees in a typical enterprise is autonomous coding agents, the fastest-growing and least-controlled category. Traditional security fails here because we deliberately hand these agents our own permissions — your identity controls, endpoint tools, and API gateways "don't have the context to understand what these very flexible, unpredictable systems are doing."

His most counterintuitive engineering choice: don't put a smart agent on every agent — it's too slow and would cost more than the AI it guards. Instead Onyx trains "very not smart models, but models that are just good at one thing... they almost can't do anything else other than be able to say, should I have a smarter agent look at this?" He likens it to blitz chess: grandmasters play most moves on intuition and only stop to calculate deeply at the few critical moments.

Why can't the labs just do this themselves? Independence — you don't let the company selling you the car certify it — and access: enterprises will share historical agent behavior with Onyx but not with "very data-hungry companies that will want to train on that data." Looming over all of it is "mythos," the collapsing cost of automated vulnerability finding, which he says the market is right to take seriously.

Sources1

要点：保护 AI agent 最难的部分不是盯着它们做什么，而是独立判断每一个动作是否合理——而诀窍在于，训练一批刻意"很笨"、只懂何时该叫来更聪明模型的小模型。

Maxim Bar Kogan 是 Onyx Security 的联合创始人兼 CEO，这是一家位于特拉维夫的创业公司，团队由研究员和数学家组成，许多人出自以色列情报部门，他们做的事是用 agent 去监视其他 AI agent。早在企业几乎还没在用 agent 的时候，他就押注于"agent 的动作"，灵感来自 AutoGPT："它确实让所有人瞥见了未来——如果模型足够强会怎样。"而这个未来来得很快，几乎赶在他烧光资金之前。如今在一家典型企业里，Onyx 看到的 agent 活动有超过 50% 是自主编码 agent，这是增长最快、却几乎不受控的一类。传统安全在这里失效，是因为我们主动把自己的权限交给了这些 agent——你的身份控制、端点工具、API 网关"根本没有 context 去理解这些极其灵活、不可预测的系统到底在做什么"。

他最反直觉的工程选择是：不要给每个 agent 都配一个聪明 agent——太慢，而且成本会超过它要保护的那个 AI。Onyx 转而训练"很不聪明、却只擅长一件事的模型……它们几乎什么都干不了，只能判断：我是否该让一个更聪明的 agent 来看一眼这个动作？"他把这比作快棋：大师们绝大多数棋全凭直觉走子，只在少数关键时刻才停下来深算。

那为什么实验室不自己做这件事？一是独立性——你不会让卖你车的那家公司来给车做认证；二是数据获取：企业愿意把 agent 的历史行为分享给 Onyx，却不愿交给"那些非常渴求数据、会想拿它去训练的公司"。而笼罩在这一切之上的，是他口中的"mythos"——自动化漏洞挖掘成本的急剧坍塌，他认为市场对此的高度重视是对的。

Sources1