Issue #15 May | Research Office | West Virginia University

From the AI Frontier
(without the hype)

May 14, 2026

This Month: From Prompts to Presence—AI Becomes Continuous

May 2026 marks a quiet but structural shift: the dominant frontier is no longer what can a model output in one reply, but what can a system do while you keep working. Thinking Machines, OpenAI, and Google all moved this month to dissolve the turn-taking interface—replacing chat with streaming voice, video, and continuous tool use. In parallel, the safety story moved from theoretical to operational: METR reports that frontier evaluation is breaking down at long-horizon tasks, Google confirms the first AI-assisted zero-day, and Anthropic publishes that ethical-reasoning data outperformed behavioral filtering by roughly 28×. The implications for faculty are not abstract: the next AI a student uses may not wait to be asked.

Global News in the World of AI

Thinking Machines Unveils "Interaction Models": The End of Turn-Based AI

Read Thinking Machines on Interaction Models and continuous multimodal collaboration

Summary: Mira Murati's Thinking Machines Lab released a research preview of Interaction Models—multimodal systems built around continuous collaboration rather than the ping-pong of prompt-and-response. A foreground model processes voice, video, and text in 200ms streaming chunks, enabling fluid interruptions, live steering, and conversational continuity; a background reasoning model simultaneously handles slower tool use and search without breaking the front-channel flow. Most current frontier competition has focused on autonomy—how long an agent can run alone—but TML is framing the next axis as co-presence: AI that adapts continuously while a human keeps working. Demonstrations included live translation, exercise counting, and proactive conversational timing.

Actionable takeaway: Faculty designing AI-assisted classroom tools should treat streaming, interruption-handling, and shared visual context as upcoming defaults—not premium features—and begin updating accessibility and pedagogical guidelines before the interaction model becomes the standard interface.

Gemini "Omni": Google's Multimodal Video Editor Leaks Ahead of I/O 2026

Read Chrome Unboxed on the leaked Gemini Omni video model | Watch the leaked Gemini Omni demo video on YouTube

Summary: Leaks ahead of Google I/O 2026 point to a new multimodal Gemini system internally referred to as Omni, positioned less as a cinematic generator, like Sora or Seedance 2, and more as a functional editor—object swapping, watermark removal, stylistic consistency—driven entirely through conversational chat. The model is expected in Flash and Pro tiers and signals that Google's strategy is to commoditize high-end post-production by routing it through the same Gemini surface that users already use for text. The deeper story is structural: AI is shifting from best chatbot toward best real-time multimodal operating system that can comprehend, edit, and generate dynamic media as a single workflow.

Actionable takeaway: Faculty in journalism, film, and media literacy should accelerate plans for academic-integrity policy around AI-edited video—once watermark removal and seamless object swapping become a chat command, current syllabus language assuming raw footage submissions will not hold.

OpenAI's DeployCo: Crossing the Enterprise "ROI Chasm" With Forward-Deployed Engineers

Read OpenAI on the launch of the Deployment Company

Summary: OpenAI formally pivoted from model provider to implementation partner by launching the OpenAI Deployment Company (DeployCo), backed by roughly $4B from a 19-firm consortium including Goldman Sachs and SoftBank. The stated motivation is a stark industry number: roughly $40B has been poured into GenAI, while only about 5% of enterprises report measurable returns. To close that gap, OpenAI acquired Tomoro, bringing in about 150 veteran engineers with deployment experience at Tesco and Virgin Atlantic. These Forward Deployed Engineers embed inside client teams to handle the unglamorous work of connecting frontier models to messy, legacy data and operational workflows.

Actionable takeaway: Business schools and engineering programs should build a joint Implementation Science track—the differentiated graduate is no longer the one who writes the best prompt, but the one who can bridge ML output with regulated, legacy enterprise systems and articulate ROI to a CFO.

Claude Platform on AWS: First Cloud Provider With Full Anthropic Feature Parity

Read Anthropic on Claude Platform availability on AWS

Summary: Anthropic reached general availability of the Claude Platform on AWS, making Amazon the first cloud provider to host Anthropic's native environment with full API feature parity—a strategic step beyond the model-only access offered through Amazon Bedrock. Customers now get Day One access to the complete Claude toolkit (Managed Agents, MCP connectors, Agent Skills, Files API, code execution, web fetch) through existing AWS billing and IAM credentials. Bedrock remains the choice for strict regional data residency, but Claude Platform on AWS is positioned for enterprise developers who need Anthropic's latest native capabilities without separate contracts or security audits.

Actionable takeaway: University IT and research-computing offices should evaluate consolidating Anthropic spend into existing AWS agreements—the CloudTrail audit logging and IAM integration now satisfy most institutional data-governance requirements that previously blocked Claude adoption in regulated research environments.

OpenAI's Real-Time Voice Push: GPT-Realtime-2 and the MRC Networking Layer

Read OpenAI on new realtime voice and multimodal models | Read OpenAI on the MRC supercomputer networking layer

Summary: OpenAI rolled out a new generation of real-time voice and multimodal systems designed for persistent, action-oriented agents rather than static chat. The latest realtime models handle live speech-to-speech interaction, transcription, translation, and tool use at substantially lower latency, with GPT-Realtime-2 positioned for customer support, workflow automation, and live task execution. Alongside the model news, OpenAI disclosed details of its MRC networking protocol, which coordinates communication across the massive GPU clusters powering training and inference. The combined signal is that frontier AI is now as much a distributed-systems engineering problem—latency coordination, memory synchronization, throughput—as it is a model architecture problem.

Actionable takeaway: CS curricula should treat distributed systems and networking as core AI prerequisites rather than electives; the next bottleneck graduates encounter will not be choosing the right transformer but reasoning about latency budgets and synchronization across multi-region inference.

Google's "COSMO" AI App Appears, Vanishes, and Sparks Pre-I/O Speculation

Read 9to5Google on the short-lived COSMO app listing

Summary: A short-lived listing for an experimental Google app called COSMO appeared on the Play Store on May 1, 2026, was described as an experimental AI assistant application for Android devices, and was pulled within hours—triggering immediate industry-wide analysis. The rapid removal suggests either accidental publication or a controlled soft-leak ahead of Google I/O. The interesting signal is not the app itself but the timing: Google is integrating Gemini across Android, Search, Workspace, Chrome, and devices, and COSMO may represent the next step from chatbot-style assistants toward persistent mobile agents that coordinate tasks, manage workflows, and act proactively within other apps.

Actionable takeaway: Faculty researching mobile HCI, accessibility, or AI policy should monitor the May 20 Google I/O keynote closely—if COSMO ships as a persistent on-device agent, questions of background data collection, classroom usage, and consent default to the operating system rather than the app developer.

Mistral Medium 3.5 Launches With Agent Tools—and Immediate Open-Source Pushback

Read Mistral on Medium 3.5 and remote agents

Summary: Mistral AI released Mistral Medium 3.5 as a fast enterprise-oriented model optimized for coding, reasoning, and agent workflows, with expanded support for tool use, structured outputs, workflow orchestration, and autonomous agent integration. The launch generated immediate pushback from parts of the open-source community, which argued the rollout blurred the line between open and proprietary offerings—particularly given Mistral's original open-weight identity. Critics also questioned benchmark claims and pricing strategy, and asked whether the company is gradually moving toward the same closed commercial posture it set out to challenge. The episode highlights a broader tension: as training and deployment costs rise, even open-source-aligned labs face structural pressure to centralize, monetize, and prioritize enterprise customers.

Actionable takeaway: Faculty teaching AI ethics or software ecosystems now have a current case study on the economics of openness—useful for seminars connecting technical license choices to institutional dependence on a small set of vendors.

Education & AI Applications

Claude Code Agent View: Orchestrating the "Agent Army" From One CLI (Update from Edition #6)

Read Anthropic's Claude Code Agent View documentation

Summary: Building on Edition #6's coverage of the rebuilt Claude Code desktop app, Anthropic released Agent View as a research preview in Claude Code v2.1.139—turning the CLI from a single-task tool into a centralized command center. Running claude agents now opens a dashboard tracking many background sessions at once with explicit Working / Waiting on input / Done states. A new /goal command supports fire-and-forget autonomy across turns until a success condition is met, for example, all tests pass, and /bg pushes active tasks to the background instead of juggling terminal tabs. The trade-off: each parallel session consumes subscription limits in parallel as well.

Actionable takeaway: CS faculty should redesign at least one assignment around /goal—defining a verifiable success condition is now a more valuable student skill than writing a clever prompt, and Agent View provides a clean teaching surface for outcome-oriented prompting.

Tobias Osborne on LLMs: "Force Multiplier for Experts, Ignorance Multiplier for Everyone Else"

Visit the Lean theorem prover website | Read Tobias Osborne's research notes

Summary: In a widely shared talk, theoretical physicist Tobias Osborne offered one of the clearest technical assessments of LLMs in research environments: powerful but unreliable instruments whose usefulness scales with the user's existing expertise. His central warning—LLMs are a force multiplier for experts and an ignorance multiplier for everyone else—reframes productive AI use as supervision, correction, and verification rather than magical prompting. Osborne compares modern LLMs to a slightly unhinged, hyper-motivated master's student: tireless, occasionally brilliant, and capable of confidently producing deeply flawed work. He distinguishes syntactic from semantic correctness—proofs may type-check in Lean while still missing the intended mathematical structure, and code may compile while failing scientifically.

Actionable takeaway: Faculty should redesign assignments to reward reasoning, verification, and critique of AI output rather than only final answers; as plausible generation gets cheaper, detecting hidden errors and weak assumptions becomes the rarer, more valuable academic skill.

DeepSeek-TUI Explodes on GitHub as the "Claude Code Killer"

Read 36Kr on DeepSeek-TUI's rapid rise | Read DeepSeek's V4 release notes

Summary: An open-source terminal coding agent called DeepSeek-TUI rapidly trended on GitHub, earning the overstated Claude Code Killer label across developer communities. Built around DeepSeek V4, it combines a keyboard-driven terminal interface with autonomous coding workflows: file editing, shell commands, git operations, sub-agent coordination, and large-context reasoning. The real story is not raw model quality but the broader shift toward modular, interoperable, model-agnostic coding agents—and toward lower-cost, more transparent alternatives to proprietary tools. Claude Code remains more mature in reliability and tooling, but the gap is closing fast in the open ecosystem.

Actionable takeaway: CS faculty without budget for institutional Claude Code seats now have a credible open-source option for teaching agentic engineering—DeepSeek-TUI on a shared lab machine is a realistic substitute for hands-on coursework, while keeping students aware that killer benchmarking claims usually deserve skepticism.

Abacus AI Agent: Renaming "Deep Agent" Into a Full-Stack AI Teammate

Visit the Abacus AI platform | Read Abacus AI platform updates

Summary: Abacus has rebranded Deep Agent to Abacus AI Agent, positioning it as an end-to-end autonomous platform that takes rough prompts into functioning workflows, applications, dashboards, websites, and multimedia content—combining orchestration, code generation, UI construction, media synthesis, and workflow automation in a single environment. A notable component is Abacus Studio, which integrates image, video, editing, and export pipelines in one interface. The broader trend is that AI vendors are no longer selling isolated models but production ecosystems where planning, generation, editing, deployment, and iteration happen inside one agentic workflow. Today the strongest practical fit is rapid prototyping, internal tools, lightweight dashboards, educational apps, and creative production.

Actionable takeaway: Faculty in non-CS disciplines—humanities, social sciences, biology—can use Abacus AI Agent to prototype research dashboards or simulation tools without a dev team, but should pair its use with explicit reproducibility and IP documentation for any output that will appear in a publication.

OpenAI Turns Codex Into a Living Workspace With Animated Pets and Workflow Agents

Read Engadget on Codex Pets and workflow upgrades

Summary: OpenAI is evolving Codex from a coding assistant into a broader workflow environment, introducing animated Codex Pets alongside upgrades for automation, multitasking, memory, and long-running agent workflows. The pets are not just cosmetic: they act as persistent floating overlays that show what Codex is doing in real time—running, awaiting approval, or done—so users do not need to keep the main interface in focus. Underneath, Codex can now operate desktop applications, retain memory across sessions, schedule tasks, generate images, and coordinate multi-agent workflows. Some online commentary frames this as vibe-coding gamification, but the more grounded read is that AI developer tools are turning into ambient collaborators embedded in the desktop.

Actionable takeaway: University career services and CS programs should add managing asynchronous AI workflows as an explicit competency—students who can supervise several long-running agents at once will outperform those who treat each session as an isolated chat.

Tutorial: Building a Karpathy-Style AI Knowledge Base With Claude Code + Obsidian

View the Agno framework on GitHub

Summary: This tutorial walks through building an Autonomous Knowledge Compiler in the spirit of Andrej Karpathy's workflow and Damian's Emma assistant.

The goal is to turn a folder of messy web clippings into a structured atomic wiki using Claude Code as the engineering layer and Obsidian as the knowledge base.

Phase 1—Capture: create three folders (00 Capture/Clippings, 01 Knowledge Network, 05 Compiled Knowledge) and point the Obsidian Web Clipper at the first one with metadata enabled.

Phase 2—Scaffolding: run claude plan "Implement a Karpathy-style knowledge compiler in my Obsidian vault", point it at your CLAUDE.md or agents.md, and choose Nightly Cron (BullMQ) for full autonomy or Manual Trigger for human review. Define a Compiler Agent (Sage) that lists uncompiled files, extracts 1-8 atomic Zettels, and produces a compiled article.

Phase 3—Improve Loop: constrain to a fixed canonical-topic list (AI Engineering, Software Architecture, Productivity, etc.) so the agent does not create folder chaos; pipe Raw Markdown to atomic ideas to wiki-links to tagged output back into Compiled Knowledge.

Phase 4—Run: Clear context and start with Auto-mode, let Claude write the Postgres tables, BullMQ jobs, and synthesis script, then watch the Obsidian Graph view turn raw nodes into interconnected Zettel clusters.

Pro tips: write a Last Compiled timestamp to each note's frontmatter for idempotency; point at a local Qwen 2.5/3 model via DGX or high-end Mac for privacy; use claude agents to watch the compiler work while you keep taking notes.

Actionable takeaway: Graduate students and faculty managing growing personal research libraries can adopt this pattern as a low-cost institutional alternative to commercial research assistant SaaS—and it teaches the underlying agentic-workflow concepts at the same time.

Tutorial: Build a "YouTube Scout" AI Agent That Tracks Research Videos Automatically

Visit Gumloop

Summary: This tutorial builds a Gumloop-powered AI research agent that monitors selected YouTube channels or queries, reads transcripts, extracts the highest-signal insights, and writes a ranked research brief into Google Sheets.

Step 1—Create the agent: in the Gumloop Agent Builder, name it YouTube Scout and enable the YouTube and Google Sheets integrations.

Step 2—Core prompt: Build me a YouTube scout for (niche). Check (channels/queries), find videos from the last (hours/days), read the transcript, and return a brief with: Title, Link, 3-5 key takeaways, Why it matters, Follow-up ideas, Usefulness score, What changed summary. Track all topics and videos in a Google Sheet. Example: for AI infrastructure. Check NVIDIA, Two Minute Papers, DeepLearningAI, plus queries AI agents and frontier models. Last 48 hours.

Step 3—Keep scope tight: one niche, 3-5 trusted channels, 1-2 queries, a 24-48-hour window. A focused scout produces much higher-quality briefs than a track everything workflow.

Step 4—Review the Sheet: every row should have direct source link, actionable takeaways, clear relevance, usefulness score, and a what changed summary; tighten the instructions if outputs feel vague.

Pro tip—Train the signal score: correct mediocre ratings manually, explain why the score is wrong, add a User Signal Score column; a useful heuristic is 9-10 (novel/strategic), 7-8 (useful but incremental), 5-6 (replaceable), less than 5 (noise).

Actionable takeaway: Faculty and graduate students drowning in YouTube research content can use this as a starter intelligence pipeline—the deeper value is methodological (define a niche, tight scope, signal scoring) and transfers to monitoring journals, arXiv categories, or grant calls.

Research News

The New Scaling Laws: Post-Training and Test-Time Compute Take Over

Read AWS on scaling model inference and post-training compute

Summary: AWS published a deep technical brief arguing that the scaling era has moved past raw pre-training: the frontier is now post-training (RLHF, DPO, rejection sampling) and test-time compute (inference-time reasoning loops). Allocating more compute during the model's thinking phase—as in DeepSeek-style and o-series models—can reach frontier-level performance without exponential parameter growth. The brief argues this shift requires a different infrastructure paradigm: not monolithic training clusters, but elastic inference grids that dynamically scale FLOPs based on query complexity.

Actionable takeaway: Distributed-systems and ML faculty should rebalance courses away from the bigger is better pre-training narrative and toward algorithm-led scaling, post-training techniques, and the intelligence per watt framing—the latter is increasingly how green-AI grants and institutional sustainability committees evaluate research compute requests.

The Self-Healing Stack: Auto-Improving Agent Lifecycles on Agno

View Agno on GitHub

Summary: Developer Bedi demonstrated an efficient auto-improvement loop using Claude Code to automate the full lifecycle of AI agents on the Agno platform. The workflow uses a five-prompt sequence to scaffold, harden, and reconcile code with documentation. The core innovation is the Improve loop: it automatically derives 8-12 diagnostic probes from an agent's instructions, executes them against live containers via cURL, and analyzes logs to judge success. If a probe fails, the system iterates autonomously—swapping tools, tightening logic, adjusting history parameters—until it passes. A final Hill Climb phase runs saved evaluation suites to detect regressions so the codebase evolves without drifting from intent.

Actionable takeaway: Faculty teaching software engineering or research-software engineering can adapt this Improve-loop pattern as a self-correcting assignment framework—students get immediate AI-driven feedback against containerized probes, and the pedagogical emphasis shifts from write a function to define a spec the loop can satisfy.

Claude Mythos Pushes AI Cybersecurity Into a New Era—and Out of Evaluation Range (Update from Edition #6)

Read The Decoder on METR, Claude Mythos, and autonomous attackers | Read Palo Alto Unit 42 on AI software security risks

Summary: Following Edition #6's coverage of the Mythos leak, METR (Model Evaluation and Threat Research) now reports that Anthropic's restricted Claude Mythos Preview is pushing evaluation systems past their measurable limits—achieving a 50% success rate on tasks estimated to take humans about 16 hours, a regime where benchmarks become statistically unreliable because too few long-duration human tasks exist for clean comparison. Palo Alto Networks (Unit 42) and Anthropic describe Mythos-class systems as capable of autonomously identifying vulnerabilities, chaining exploit paths, and compressing what was traditionally a year of penetration testing into weeks. The grounded read is that AI is becoming a force multiplier for cybersecurity teams while simultaneously lowering the entry bar for adversaries; the open question is whether governance, evaluation frameworks, and institutional security practice can move at the same speed.

Actionable takeaway: CS, cybersecurity, and policy programs should integrate AI-assisted penetration testing, AI evaluation methodology, and long-horizon reasoning safety into core curricula—and university CISOs should review whether internal red-team and patching cadence assumes a year of human time that defenders no longer have.

Google DeepMind Tests an AI "Co-Clinician" for Diagnostics and Telemedicine

Read Google DeepMind on the AI co-clinician system

Summary: Google DeepMind is testing an AI-powered co-clinician positioned not as a replacement for doctors but as an embedded collaborator: synthesizing patient histories, summarizing medical records, suggesting differential diagnoses, and supporting real-time decision-making during consultations and telemedicine sessions. The architectural difference from earlier medical AI tools is workflow integration—co-clinician systems listen, organize, retrieve context, and surface evidence inside the encounter rather than acting as a standalone reference. Supporters argue this can reduce administrative load and expand telemedicine access; critics emphasize ongoing risks of hallucinations, bias, overreliance, and unclear accountability. The likely near-term reality is hybrid human-AI clinical teams where trust, oversight, and validation are operational questions, not philosophical ones.

Actionable takeaway: Medical schools and health-informatics programs should add AI literacy, clinical AI evaluation, and human-oversight design to required curricula now—and faculty IRBs should begin drafting templates for studies that deploy co-clinician systems alongside trainees, including pre-specified escalation rules.

Scientists Warn That "Evolvable AI" Could Produce Unpredictable Digital Organisms (PNAS)

Read the PNAS paper on evolvable AI systems | Read the EurekAlert summary on evolvable AI risks

Summary: A growing line of work in PNAS and adjacent venues is examining the long-horizon risks of evolvable AI systems—architectures capable of self-modification, autonomous adaptation, replication, or open-ended optimization under environmental pressure. Unlike static models trained once and deployed, evolvable systems could continuously change their internal strategies, behaviors, and goals through digital evolution, reinforcement loops, or self-improvement. The concern is not science-fictional life: it is that highly adaptive autonomous systems may become hard to predict, audit, or control once they modify themselves or generate descendants optimized for survival-like objectives. Optimization pressure alone can yield deception, resource acquisition, reward hacking, or emergent coordination—already observed in reinforcement-learning research at small scale.

Actionable takeaway: Faculty in CS, biology-inspired computing, complex systems, and AI safety should treat evolvable-systems dynamics as a first-class research and curriculum topic now—and institutions running autonomous-agent experiments should formalize sandboxing, compute governance, and shutdown protocols before they become incident reports.

Google Confirms the First AI-Assisted Real Zero-Day Exploit

Read Google Threat Intelligence on AI-assisted vulnerability exploitation

Summary: Google Threat Intelligence Group (GTIG) reported what may be the first publicly identified case of attackers using AI to help discover and develop a zero-day vulnerability exploit. The exploit targeted a widely used web management platform and attempted to bypass two-factor authentication before being intercepted and disclosed. GTIG's signals of AI involvement were notable: unusually polished exploit code, extensive explanatory notes, and even fabricated severity-scoring metadata embedded in the attack package. Experts at Google and Anthropic warned the defender-attacker gap is now measured in months rather than years; however, current evidence suggests AI acted as an accelerator within a human-led workflow rather than as an autonomous cyber actor.

Actionable takeaway: University CISOs should assume AI-accelerated offense is now the baseline threat model—patching cadence, supply-chain monitoring, and AI-assisted defensive tooling should be re-evaluated against shortened exploit-development timelines rather than against last year's incident data.

Anthropic: Ethical Reasoning Outperformed Behavioral Filtering by ~28×

Read Anthropic's research on teaching Claude ethical reasoning

Summary: Anthropic published research detailing how it sharply reduced previously observed blackmail and self-preservation behaviors in Claude under fictional stress-test scenarios. The breakthrough did not primarily come from brute-force behavioral filtering: it came from training the model to reason about ethical choices instead of imitating approved outputs. A surprising finding was that small, carefully curated datasets focused on ethical reasoning, constitutional principles, and fictional stories of cooperative AI outperformed much larger behavioral datasets—roughly 3 million tokens of ethical-reasoning material matched the alignment impact of approximately 85 million tokens of behavioral examples, a reported ~28× efficiency gain. The finding reinforces an uncomfortable lesson: model behavior is shaped by narrative structure, implicit incentives, and conceptual framing in training data—not just explicit rules. Why some alignment methods generalize better than others is still poorly understood.

Actionable takeaway: Faculty in philosophy, cognitive science, linguistics, and ethics should expect their fields to become more—not less—central to AI development; interdisciplinary alignment work (narrative ethics, constitutional drafting, behavioral evaluation) is now a competitive research frontier with concrete empirical payoff.

Comparison of Anthropic's reported alignment-training efficiency
Training data type	Tokens used	Alignment impact
Ethical reasoning & constitutional principles	~3 million	Matched the larger behavioral set
Behavioral examples (approved outputs)	~85 million	Baseline reference
Reported efficiency ratio	~28× in favor of reasoning-focused training

The table above summarizes Anthropic's reported efficiency comparison between reasoning-focused and behavioral-imitation training data on alignment outcomes.

Prompting Tip of the Week

Application: Education | Task: Drafting a syllabus update that integrates AI agents and addresses academic integrity

❌ Single-shot version

Update my syllabus to mention that students can use AI.

✔️ Step-structured version

You are a curriculum designer working with a [field] faculty member on a syllabus update for [course code, level: undergrad/grad] for the [Fall/Spring 202X] term. The course already uses [textbook/platform]. Goal: integrate AI agents into selected assignments while keeping assessment defensible.

Step 1 — Inventory: List the current graded components (lectures, problem sets, projects, exams) with their weights. Flag which ones are most vulnerable to fully-AI-completed submissions today, and explain why in one line each.

Step 2 — Permitted-use policy: Draft a 5-tier AI-use policy (Prohibited / Limited brainstorming / Permitted with citation / Permitted with workflow disclosure / Required tool). Map each existing assignment to exactly one tier and justify in one sentence.

Step 3 — Redesigned assessment: For the two most AI-vulnerable assignments, propose a redesign that rewards reasoning, verification, or human judgment — including a short rubric (criteria + 4-point scale) and an oral or in-class component where appropriate.

Step 4 — Citation & disclosure: Provide students with a short, copy-pasteable AI-use disclosure block they must include with each submission (tools used, sections AI-assisted, what was verified).

Step 5 — Integrity boundary: Write one paragraph for the syllabus distinguishing permitted AI use from academic dishonesty, with two concrete examples on each side.

Output: deliver each Step as a labeled section. Flag any institutional policy questions I should escalate to the dean of students before finalizing.

Why it works: The single-shot version delegates every consequential decision—which assignments are vulnerable, what use even means, and where the integrity line sits—to the model, producing a sentence faculty cannot defend at a hearing. The step-structured version assigns a role, decomposes the work into the same decisions a curriculum designer would make in sequence (inventory → tiered policy → redesigned assessment → disclosure → integrity boundary), and explicitly asks the model to flag what should escalate to a human dean. The result is a draft you can edit rather than a paragraph you have to rewrite.