Beyond Static Documents:
The Multimodal Shift

Transitioning from simple prompt-response interactions toward sophisticated, multi-agent "Think → Plan → Act" workflows. Automate the conversion of research articles into narrated, pedagogical presentations without manual fine-tuning.

Logic Layer

Mistral 7B

Parsing

LayoutLMv3

Audio Synthesis

Suno Bark

Strategic Advantage

Modularity

8 specialized agents provide "best-in-class" processing at every stage.
Grounded Reasoning

FAISS-based Vector DB eliminates hallucinations by anchoring content to PDF text.
Scalable Efficiency

Inference-only workflow requires no custom fine-tuning or specialized training.

See It In Action

Watch a live demonstration of how the pipeline transforms research documents into narrated PowerPoint presentations.

Watch on YouTube

Click to view the full demo video

View Demo

The "Think → Plan → Act" Architecture

Unlike simple OCR, this pipeline employs a cognitive loop. It deconstructs the document, plans a pedagogical narrative, and then synthesizes media.

🧠 THINK

Agent 1

Layout Analyzer

Extracts Bounding Boxes

Agent 2

Chunker

Semantic Segmentation

Agent 3

Vector Indexer

FAISS Database

➜

📝 PLAN

Agent 4

Narrative Planner

Designing the Flow

Agent 5

Script Gen

Writing the Lecture

➜

⚡ ACT

Agent 6

Visual Builder

Slide Composition

Agent 7

TTS Generator

Audio Synthesis

Agent 8

PPTX Assembler

Final Compilation

Inside the Pipeline: 8 Specialized Nodes

Explore how each agent contributes to the transformation from raw pixels to pedagogical narration. Click a step to view the logic.

Technical Performance Insights

Data visualization highlighting the relative processing loads and semantic breakdown of the pipeline.

Process Load Distribution

Layout Awareness (Agent 1)

Preserving "bounding boxes" ensures the system respects original hierarchy—distinguishing primary headers from footnotes to maintain context.

Inference-Only Scaling

No custom RAG retraining required. The system pivots from academic texts to medical journals instantly using Mistral 7B's base reasoning capabilities.

Pedagogical Constraints

Visuals follow the 5/12 rule (Max 5 bullets, Max 12 words) while the Script Agent adopting an "Academic Lecturer" persona provides the depth.

Pipeline Execution Results

The culmination of the 8-agent effort results in a standardized structure within the output/ directory.

output/slides/

The finalized .pptx file with text, layout, and linked narration audio ready for delivery.

MASTER PPTX GENERATED

output/audio/

Individual .wav and .mp4 files. Optimized via Dynamic Range Compression to prevent clipping.

24kHz PCM 16-BIT AUDIO

output/metadata/

Execution JSON logs tracking agent dependencies and grounding IDs for full auditability.

LOG DATA PERSISTED

Master Orchestrator Terminal

$ python "Agentic Systems/Master Orchestrator Agent/master_agent.py"

[*] Initializing Agentic Pipeline...

[*] Agent 1: Parsing PDF Layout [DONE]

[*] Agent 3: Vector Indexing Chunks [DONE]

[*] Agent 7: Synthesizing Audio with Suno Bark (Temp 0.7) [IN PROGRESS]

|############################----------| 75%

GitHub Repository

Access the complete source code and contribute to the project.

View on GitHub

The Team

Mahmoud Alyosify

LinkedIn Profile

Mirna Embaby