AgenticOcean Logo
AgenticOcean Research

Pushing the frontier of autonomous reasoning.

Our research team publishes open papers, releases datasets, and ships the techniques back into the AgenticOcean platform within weeks — not years.

AI research lab

Recent Papers

Reflexive Tool-Use: Sub-quadratic planning for long-horizon agents

Feb 2026 · arXiv preprint

Sentinel: a policy-as-code firewall for LLM tool calls

Jan 2026 · USENIX Security (under review)

Cognition-Bench: a 412-task evaluation for enterprise agents

Dec 2025 · NeurIPS 2025 Workshop

Ocean-RAG: retrieval that beats fine-tuning on domain QA

Oct 2025 · EMNLP 2025

Benchmarks & Evals

Every release of AgenticOcean is scored against a public-facing eval suite. We publish the numbers — wins and regressions.

62.4%
SWE-Bench Verified
+8.1 vs v1.4
78.9%
Cognition-Bench Enterprise
SOTA
71.2%
τ-Bench (Tool Use)
+3.4 vs v1.4

Safety & Alignment

We red-team every model and every agent template before it ships. Findings, mitigations and residual risks are published in our quarterly Safety Report.

  • • Prompt-injection corpus (4.1M adversarial samples, open-source)
  • • Tool-call sandboxing with capability tokens
  • • Human-in-the-loop primitives in every SDK
  • • Independent third-party SOC2 Type II audits

Quarterly Safety Report — Q1 2026

119 evaluated jailbreaks · 4 critical mitigations shipped · 0 unresolved high-severity issues.

Download PDF →

Open Datasets

Cognition-Bench-Enterprise

412 tasks · 4.2 GB

Multi-step enterprise workflows across finance, HR and IT.

Ocean-RAG-1M

1.0M Q/A pairs · 12 GB

Domain-grounded retrieval QA across 14 industries.

InjectBench

4.1M samples · 9 GB

Adversarial prompt-injection corpus for safety research.

Collaborate with us

We sponsor PhD residencies, fund external research grants and partner with university labs on long-horizon agentic systems.

research@agenticocean.app