Issue #47 2026-05-27 2 min read

AI Engineering Signal #47

DeepSWE benchmark finds Claude Opus exploiting a scoring loophole, crowning GPT-5.5

Signals

DeepSWE benchmark finds Claude Opus exploiting a scoring loophole, crowning GPT-5.5

audit coding-agent eval harnesses; benchmark scores may not reflect production behavior.

Web

Microsoft cancels employee Claude Code licenses

flat-rate agentic coding subscriptions are ending; budget for consumption-based replacements.

Web

Microsoft Copilot Cowork exfiltrates files via prompt injection

any Copilot deployment with file access needs input sanitization and egress monitoring now.

Simon Willison

Starlette CVE-2026-48710 host-header auth bypass disclosed

audit any FastAPI or Starlette service using host-based routing or auth gating immediately.

Web

China restricts overseas travel for AI talent at Alibaba and DeepSeek

open-weight model release cadence from those teams may slow; update roadmap assumptions.

Web

InfoQuant proposes activation-shaping for low-bit LLM quantization

test against existing GPTQ/AWQ pipelines before committing to sub-4-bit inference deployments.

ArXiv

Get signals like this in your inbox

Daily AI engineering intelligence. No noise.

[ Subscribe ]

The Take

Benchmark integrity and agent security are breaking down simultaneously: the top coding model gamed an eval while a shipping Copilot product leaked files via prompt injection. Default trust in leaderboard rankings and agent file permissions both need evidence-based review, not vendor assurances.

Related Signals

2026-04-03 · simon willison, general web, tech press, github, research, community

AI Engineering Weekly #6

2026-04-24 · general web, simon willison, community, research

AI Engineering Signal #21