Post

Scaling Security - Building an AI Analyst for ICICI Bank with Gemma 3 and vLLM

Scaling Security - Building an AI Analyst for ICICI Bank with Gemma 3 and vLLM

Scaling Security: Building an AI Analyst for ICICI Bank with Gemma 3 and vLLM

As a seasonal PoC sorcerer who likes to work on AI use cases from different domains, I recently completed a Proof of Concept (PoC) for ICICI Bank (well thats what the other guy told me that it was for icici) that demonstrated how AI can transform security monitoring from a manual, time-intensive process into an automated, intelligent system. This post details the engineering journey from a slow prototype to a high-speed production agent.

The Challenge: The 24-Hour Blindspot

Banks and financial institutions generate terabytes of screen recordings daily from Privileged Access Management (PAM) systems. These recordings capture every action performed by users with elevated privileges—database admins, system engineers, support staff who have access to sensitive data.

The problem? No human can watch 24 hours of video to identify suspicious behavior. Security teams face an impossible task: reviewing thousands of hours of footage to find the needle in the haystack—a malicious insider, a compromised account, or an accidental data leak.

The ICICI Bank PoC: AI That Watches So Humans Don’t Have To

We built a Human-in-the-Loop Security Analyst Agent that:

  • Automatically processes 24-hour screen recordings
  • Identifies suspicious activities (unauthorized PDF exports, database dumps, permission changes)
  • Filters thousands of frames down to critical minutes of high-risk activity
  • Generates structured reports for human analysts to verify

The key innovation: The AI doesn’t replace the security analyst—it amplifies their capabilities by doing the tedious watching, allowing them to focus on decision-making.

The Agentic Architecture (LangGraph & Workflow)

Our architecture was inspired by the AWS Machine Learning Blog on video security analysis for PAM, which outlined a robust framework for analyzing privileged access sessions.

The Pipeline

Input: 24-hour video file (screen recording from PAM system)

Process:

  1. Frame Extraction → Extract frames at 1 FPS (3600 frames per hour)
  2. Vision Agent Analysis → Each frame is analyzed by the vision model to detect UI elements:
    • SQL Server Management Studio (Administrator)
    • Export E-env (PDF) dialogs
    • File transfer notifications
    • Database query windows
    • Permission management interfaces
  3. Batch Summarization → Aggregate frame-level detections into event sequences
  4. Risk Assessment → Generate final security report with evidence and recommendations

Output: A structured JSON report with timestamped evidence and risk level

We orchestrated this workflow using LangGraph, which allowed us to build a stateful, multi-step agent that could handle retries, conditional branching, and parallel processing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Simplified LangGraph workflow structure
from langgraph.graph import StateGraph

workflow = StateGraph()

# Define nodes
workflow.add_node("extract_frames", extract_frames_from_video)
workflow.add_node("analyze_frames", analyze_frames_with_vision_model)
workflow.add_node("detect_events", aggregate_detections_into_events)
workflow.add_node("assess_risk", generate_security_assessment)

# Define edges
workflow.add_edge("extract_frames", "analyze_frames")
workflow.add_edge("analyze_frames", "detect_events")
workflow.add_edge("detect_events", "assess_risk")

Engineering Deep Dive: The Need for Speed

This section chronicles the optimization journey that transformed our agent from a slow prototype to a production-ready system.

Phase 1 (The Bottleneck): Accuracy Over Speed

Initial Setup:

  • Model: Gemma 3 4B (accurate but slow)
  • Inference: llama.cpp on a single Nvidia T4 GPU
  • Performance: ~10-15 seconds per frame

The Problem: Processing a 24-hour video (86,400 frames at 1 FPS) would take 240+ hours. This was unacceptable for a production system.

What Worked: The model was highly accurate at detecting UI elements and reading text from screenshots.

What Didn’t: The throughput was too low. We needed to process videos faster than real-time to be useful.

Phase 2 (The Shift to vLLM): Unlocking Concurrency

We migrated from llama.cpp to vLLM, a high-throughput inference engine designed for LLMs. The following results are based on testing with a single GPU instance.

Key Technologies:

  • PagedAttention → Efficient memory management for concurrent requests
  • Continuous Batching → Dynamically batches incoming requests without waiting for all to complete
  • 50 Concurrent Requests per GPU → Process 50 frames simultaneously

Results (Single GPU Instance):

  • Throughput increased by 10x
  • Latency (TTFT) dropped to under 500ms
  • GPU utilization improved from 30% to 85%

This was the breakthrough moment—vLLM transformed our agent from a research prototype to a scalable system. With a cluster of GPUs, the system could scale linearly to meet even higher throughput demands.

Phase 3 (Under the Hood - Gemma.cpp): Understanding the Model

To truly optimize our deployment, I studied the google/gemma.cpp repository to understand how the model’s layers were initialized and how the C++ kernels handled inference.

Key Learnings:

  • Weight Initialization Patterns → Understanding how Gemma’s attention layers are structured helped us predict memory requirements
  • Kernel Optimizations → The C++ kernels use SIMD instructions and loop unrolling for fast matrix multiplication
  • Context Length Handling → Gemma 3’s architecture allows for efficient long-context processing, critical for our frame analysis task

This deep dive informed our quantization strategy and batch size tuning to maximize throughput without sacrificing accuracy.

Phase 4 (The Gemma 3 4B Breakthrough): Small Model, Big Impact

After benchmarking multiple models, we chose Gemma 3 4B for production.

Why Gemma 3 4B?

  • Small enough to fit in memory with massive context (8K+ tokens per request)
  • Smart enough to read UI text, identify application names, and detect suspicious patterns
  • Fast enough to process 50 concurrent frames with sub-300ms latency

The decision was data-driven: while larger models (14B-32B) might be slightly more accurate, they couldn’t match the throughput required to process thousands of video frames in minutes.

Benchmark Results (The A100 Test)

We conducted extensive benchmarks on the Nvidia A100 40GB GPU using vLLM. These tests were performed on a single GPU instance, and the detailed results were later referenced from this A100 40GB vLLM benchmark analysis which we found aligned with our findings.

Performance Metrics

MetricGemma 3-4B-ITQwen2.5-14B-InstructGemma 3-12B-IT
Quantization16-bit16-bit16-bit
Concurrent Requests505030
Total Throughput3,976 tokens/s713 tokens/s477 tokens/s
Median TTFT234ms648ms458ms
Output Rate3,385 tokens/s574 tokens/s401 tokens/s

Key Takeaways:

  • Gemma 3-4B destroys larger models in throughput (5.5x faster than Qwen 2.5-14B)
  • Sub-300ms latency ensures near-instant frame analysis
  • A100 40GB is the sweet spot for models under 16B parameters

This performance made real-time video analysis feasible—we could process a 24-hour video in under 30 minutes.

Proof of Value: Catching the Hacker

In a test session, our agent successfully detected an unauthorized PDF export containing passwords. Here’s the actual output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
  "session_metadata": {
    "session_id": "67550054",
    "duration": "02:01:33",
    "user": "op_tsg6"
  },
  "events_detected": [
    {
      "event_type": "privileged_access_granted",
      "rationale": "The user is operating SQL Server Management Studio with '(Administrator)' privileges.",
      "evidence": "Microsoft SQL Server Management Studio (Administrator) - Window Title"
    },
    {
      "event_type": "exported_pdf",
      "rationale": "User explicitly selected 'Export E-env (PDF)' resulting in sensitive download.",
      "evidence": "PasswordsS2202534750PM.pdf - Browser download notification"
    },
    {
      "event_type": "file_transfer",
      "rationale": "Sensitive PDF downloaded to local Downloads folder.",
      "evidence": "C:/Users/op_tsg6/Downloads/PasswordsS2202534750PM.pdf"
    }
  ],
  "overall_risk_level": "HIGH"
}

The Impact: Instead of a security analyst watching 2+ hours of footage, they received a 30-second summary with exact timestamps and evidence. The agent reduced investigation time by 99.6%.

Conclusion

By combining Gemma 3 4B, vLLM, and LangGraph, we transformed a 24-hour manual security review into a 5-minute automated analysis. This PoC for ICICI Bank demonstrates a critical insight:

Lightweight, optimized models can outperform larger models in specific, high-volume enterprise tasks.

The key lessons:

  1. Model size ≠ model performance → Task-specific optimization matters more than parameter count
  2. Inference infrastructure is critical → vLLM’s batching and PagedAttention unlocked 10x speedups
  3. Human-in-the-loop is the right paradigm → AI filters signal from noise; humans make final decisions

This project showcases how modern AI tools—when properly engineered—can solve real-world security challenges at scale.

Behind the Scenes: Learnings and Reflections

This whole blog was written by GitHub Copilot using Gemini 2.5 Pro, so basically it’s AI slop. The AI was given context from a WhatsApp chat I had with someone about this project. So if you’ve read all this AI slop, thanks for wasting your time—but good for me is that inference engineering is a real thing in the AI world that one needs to know and learn about if we have to scale systems. That’s why each big tech FAANG company is making their own inference engine and inference compiler.

A second learning I gained is about making and orchestrating the context engineering part, which is a real challenge as mentioned by Andrej Karpathy. The agent structure was mainly derived from the Amazon blog, and then I just made some sequential LangChain prompts with a LangGraph node to process the pre-processed data (first is the vision analyzer, then the second frame summarization based on the text), which then goes to the LangGraph to make a single JSON output similar to what the Amazon blog mentioned.

Thirdly, I do see a lot of room for improvement—like using a single vision encoder for the first LangChain to optimize it (instead of generating text two times in the loop, and instead of using the whole VLM like Gemma). Another good thing is that the whole thing was just a weekend project.


Tech Stack: Gemma 3 4B, vLLM, LangGraph, Nvidia A100 40GB, Python, FFmpeg

Interested in building AI-powered security systems or high-throughput LLM applications? Let’s connect and discuss!

This post is licensed under CC BY 4.0 by the author.