Research | AgenticOcean

Recent Papers

Reflexive Tool-Use: Sub-quadratic planning for long-horizon agents

Feb 2026 · arXiv preprint

Sentinel: a policy-as-code firewall for LLM tool calls

Jan 2026 · USENIX Security (under review)

Cognition-Bench: a 412-task evaluation for enterprise agents

Dec 2025 · NeurIPS 2025 Workshop

Ocean-RAG: retrieval that beats fine-tuning on domain QA

Oct 2025 · EMNLP 2025

Benchmarks & Evals

Every release of AgenticOcean is scored against a public-facing eval suite. We publish the numbers — wins and regressions.

62.4%

SWE-Bench Verified

+8.1 vs v1.4

78.9%

Cognition-Bench Enterprise

SOTA

71.2%

τ-Bench (Tool Use)

+3.4 vs v1.4

Safety & Alignment

We red-team every model and every agent template before it ships. Findings, mitigations and residual risks are published in our quarterly Safety Report.

• Prompt-injection corpus (4.1M adversarial samples, open-source)
• Tool-call sandboxing with capability tokens
• Human-in-the-loop primitives in every SDK
• Independent third-party SOC2 Type II audits

Quarterly Safety Report — Q1 2026

119 evaluated jailbreaks · 4 critical mitigations shipped · 0 unresolved high-severity issues.

Download PDF →

Open Datasets

Cognition-Bench-Enterprise

412 tasks · 4.2 GB

Multi-step enterprise workflows across finance, HR and IT.

Ocean-RAG-1M

1.0M Q/A pairs · 12 GB

Domain-grounded retrieval QA across 14 industries.

InjectBench

4.1M samples · 9 GB

Adversarial prompt-injection corpus for safety research.

Collaborate with us

We sponsor PhD residencies, fund external research grants and partner with university labs on long-horizon agentic systems.

research@agenticocean.app

Pushing the frontier of autonomous reasoning.