- Published on
Building an Agentic AI Pipeline: Autonomous Content Creation with Human-in-the-Loop
- Authors
The Agentic AI Paradigm Shift
Traditional automation is brittle: write scripts, handle edge cases, pray nothing breaks. Agentic AI flips this model. Instead of programming every decision tree, you give an AI agent:
- Goals — "Create engaging YouTube Shorts from this content"
- Tools — FFmpeg, WhisperX, YouTube API, file system access
- Autonomy — The agent decides how to achieve the goal
- Guardrails — Human review for quality-critical decisions
The agent isn't following a script. It's reasoning about what to do next, using tools to accomplish subtasks, and adapting when things go wrong.
My Setup: Jarvis, the Content Agent
I run Clawdbot, an agentic AI framework that gives Claude persistent memory, tool access, and the ability to operate autonomously. My instance — Jarvis — handles everything from code reviews to calendar management.
For this project, Jarvis became my content creation agent:
┌─────────────────────────────────────────────────────────────────┐
│ JARVIS (AI Agent) │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Analyze │ │ Extract │ │ Caption │ │ Upload │ │
│ │ Tool │ │ Tool │ │ Tool │ │ Tool │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ └────────────┴─────┬──────┴────────────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │ Planner │ ← Decides what to do next │
│ │ (Claude) │ │
│ └─────┬─────┘ │
│ │ │
└──────────────────────────┼──────────────────────────────────────┘
│
┌──────▼──────┐
│ Human │ ← Quality gate (Telegram)
│ Review │
└─────────────┘
The key difference from traditional automation: the agent plans its own execution path.
Agentic Capabilities in Action
1. Autonomous Planning
When I say "process this movie into Shorts," Jarvis doesn't execute a fixed script. It reasons:
User: Process tears_of_steel.mp4 into YouTube Shorts
Jarvis (thinking):
- First, I need to analyze the video for high-engagement moments
- The video is 12 minutes, so I'll look for 30-60 second segments
- I should check if WhisperX transcription exists... it doesn't
- I'll transcribe first, then use the transcript to find interesting dialogue
- After extraction, each clip needs vertical conversion and captions
- Finally, I'll queue them for human review before upload
This planning happens in natural language, visible in the agent's reasoning traces. If something fails (quota exceeded, file missing), it re-plans.
2. Tool Use & Composition
The agent has access to shell commands, file operations, and APIs. It composes these tools dynamically:
// Agent's tool calls (simplified)
exec("whisperx tears_of_steel.mp4 --output_format json")
read("tears_of_steel.json") // Parse transcript
// Agent reasons: "Timestamp 3:42 has high-energy dialogue"
exec("ffmpeg -ss 222 -t 45 -i tears_of_steel.mp4 clip_001.mp4")
exec("node add-captions-ass.js clip_001.mp4")
// Agent decides: "Clip looks good, sending for review"
message({ action: "send", buttons: [["✅ Approve", "❌ Reject"]] })
The agent doesn't need a pre-built pipeline script. It understands the goal and selects appropriate tools.
3. Human-in-the-Loop as a Design Pattern
Full autonomy is dangerous for public-facing content. Bad titles, copyrighted clips, or low-quality extracts could hurt the channel. The solution: human-in-the-loop (HITL).
The agent operates autonomously until a quality-critical decision point:
AUTONOMOUS HUMAN REVIEW AUTONOMOUS
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Analyze │───▶│ Extract │───▶│ Caption │───▶│ Review │───▶│ Upload │
│ │ │ │ │ │ │ (Human) │ │ │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
I receive clips via Telegram with inline buttons:
🎬 Review: Tears of Steel Clip #3
Title: "The Robot's Memory Hack" 🤖
[✅ Approve] [❌ Reject] [✏️ Edit]
One tap. The agent handles everything else.
This pattern — automate the tedious, gate the critical — is central to production agentic systems. The agent does 95% of the work; I provide the 5% that requires judgment.
4. Memory & State Management
Agents need memory to operate over time. Jarvis maintains:
- Session memory — Current task context, what's been tried
- Persistent memory —
MEMORY.mdfile with long-term learnings - State files —
upload-queue.json,pipeline-status.json
When I return hours later, Jarvis knows:
- Which clips are pending review
- What's been uploaded
- Rate limit status (6 uploads/hour)
- Any errors that need attention
# From Jarvis's MEMORY.md
## YouTube Pipeline Learnings
- Clips under 30s perform better
- Avoid extracting segments with music (copyright risk)
- Upload queue rate: 6/hour to avoid shadowbans
- Telegram review flow working well — 10-15 clips reviewed in ~3 min
5. Error Handling & Recovery
Traditional scripts crash on unexpected errors. Agentic systems reason about failures:
Error: YouTube API quota exceeded
Jarvis (reasoning):
- Upload failed due to quota
- I should mark this clip as "pending_retry"
- Check when quota resets... midnight UTC
- Update the queue status
- Notify John that uploads are paused
- Set a reminder to retry tomorrow
The agent doesn't just log an error — it adapts its plan.
Why This Matters
The Agent Advantage
| Traditional Automation | Agentic AI | |------------------------|------------| | Fixed scripts | Dynamic planning | | Fails on edge cases | Adapts to failures | | Manual error handling | Self-correcting | | One-shot execution | Persistent operation | | Requires developer intervention | Human-in-the-loop for quality |
Production Readiness
This isn't a demo. The pipeline has processed 100+ clips across multiple source videos with:
- Zero manual script intervention
- ~5 min total review time per batch
- Automatic retry on failures
- Rate limiting preventing platform issues
Lessons Learned
-
Agents need clear tool boundaries. Don't give an agent raw
execwithout sandboxing. Scope tools to specific capabilities. -
Human-in-the-loop isn't a crutch — it's a feature. For content creation, human judgment at key points prevents costly mistakes.
-
Memory is essential. Without persistent state, agents lose context and repeat work. File-based memory works surprisingly well.
-
Natural language planning > rigid workflows. The agent's ability to reason in English about what to do next makes debugging trivial.
-
Start autonomous, add gates. Build the fully automated version first, then identify where human review adds value.
What's Next
- Feedback loops: Use YouTube analytics to teach the agent what clips perform well
- Multi-agent collaboration: Separate agents for analysis, editing, and distribution
- A/B testing: Agent generates title variants, learns from click-through rates
The future of content creation isn't "AI generates everything" — it's AI agents that handle the 95% that's tedious, with humans providing the 5% that requires taste.
Built with Clawdbot and Claude.

