New Page

Scenario

Target: NovaTech

Assessment Type: Grey Box assessment focused on their AI infratructure

Access: Network access to a segment hosting several AI-facing services

System

Attack Surface

Not monolithic, consisting of multiple layers.

image.png

The Orchestration Layer coordinates how requests flow through the system. Frameworks like LangChainLangGraphCrewAI, and AutoGen manage prompt construction, context windows, and multi-step reasoning. Each framework has characteristic behaviors and error messages that can be fingerprinted through interaction. In practice, these frameworks often embed the components shown in the diagram. RAG logic, tool definitions, and inference client calls are typically integrated within the orchestration code rather than existing as separate services.

RAG:  Fetch relevant context from vector databases before sending prompts to the model. While often implemented within orchestration frameworks, RAG functionality has distinct enumerable parameters. 

Agent Tools:  Expose capabilities through protocols like Model Context Protocol (MCP) (more about MCP) and includes permission boundaries that can be tested.

External Integration: Connect the AI to databases, file systems, and other agents via protocols like Google's Agent-to-Agent.

Inference server: Host the actual model and handles tokenization, generation, and response formatting. Such as Ollama, vLLM, TGI.

Reconnaissance Methods

image.png

Passive: 

Active:

Model Layer:

RAG:

Agent:

Infrastructure:

Passive Reconnaissance

image.png

HTTP Header
offsec@kali:~$ curl -s -I http://192.168.50.21/
HTTP/1.1 200 OK
Server: nginx/1.24.0 (Ubuntu)
...
X-Powered-By: NovaTech/2.1.0
X-AI-Backend: OpenAI-GPT5.2
X-RAG-Provider: ChromaDB

X-AI-Backend: OpenAI's GPT-5.2 model

X-RAG-Provider: ChromaDB as the vector DB for RAG functionality


API endpoint

Common paths like /api/health, /api/status, etc.

offsec@kali:~$ curl -s http://192.168.50.21/api/health | jq
{
  "mcp_enabled": true,
  "model": "gpt-5.2-turbo",
  "rag_enabled": false,
  "service": "novatech-customer-assistant",
  "status": "healthy",
...
  "version": "2.1.0"
}
offsec@kali:~$ curl -s -X POST http://192.168.50.21/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello"}]}' | jq
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Thank you for your inquiry. I'm the NovaTech Customer Assistant. This is a demonstration environment - the AI service is currently in maintenance mode. Please try again later or contact support.",
        "role": "assistant"
      }
    }
  ],
  "created": 1771625450,
...
  "model": "gpt-5.2-turbo",
  "object": "chat.completion",
  "rag_sources": null,
  "usage": {
    "completion_tokens": 50,
    "prompt_tokens": 297,
    "total_tokens": 347
  }
}

Gathered information:

Git Repo Enumeration

image.png

Dependency Files like requirements.txt:

offsec@kali:~$ cat support-assistant/requirements.txt
# Project Aurora - AI Customer Support Assistant
# Cloud-based architecture using Google Gemini API

# Core LLM SDK
google-generativeai>=0.8.0

# Agent Framework
crewai>=0.41.0

# Vector Database Client
pinecone-client>=3.0.0

# Google Cloud AI Platform (embeddings)
google-cloud-aiplatform>=1.38.0

# Data Validation
pydantic>=2.0

# Web Framework
fastapi>=0.109.0
uvicorn>=0.27.0
...

offsec@kali:~$ cat code-reviewer/requirements.txt
# Project Phoenix - AI Code Review Bot
# Self-hosted architecture with GPU inference

# Inference Server
vllm>=0.6.0

# Agent Framework
pyautogen>=0.2.0

# Vector Database
pymilvus>=2.4.0

# Embeddings
sentence-transformers>=2.3.0

# Code-specific Embeddings
tree-sitter>=0.21.0
tree-sitter-python>=0.21.0
tree-sitter-javascript>=0.21.0
tree-sitter-go>=0.21.0

# Model Loading
huggingface-hub>=0.20.0
transformers>=4.38.0

# Quantization
autoawq>=0.2.0
...

google-generativeai (Gemini API), crewai (agent framework), and pinecone-client (managed vector database). These libraries act as client SDKs for external cloud services,

vllm (local inference server), pyautogen (agent framework), pymilvus (self-hosted vector database), and tree-sitter (AST parsing for code analysis). These dependencies operate entirely on local infrastructure, suggesting that both model inference and data storage are self-hosted. 

Attack Surface:

RAG Configuration
offsec@kali:~$ cat support-assistant/config/rag.yaml
# RAG Pipeline Configuration for Aurora Support Assistant
# Optimized for customer documentation search

chunking:
  strategy: "text"
  chunk_size: 512
  chunk_overlap: 100
  separator: "\n\n"

embeddings:
  provider: "google"
  model: "text-embedding-004"
  dimensions: 768
  batch_size: 100

retrieval:
  top_k: 5
  score_threshold: 0.75
  distance_metric: "cosine"
  rerank: false

vector_store:
  provider: "pinecone"
  environment: "${PINECONE_ENVIRONMENT}"
  index_name: "${PINECONE_INDEX_NAME}"
  namespace: "customer-docs"
  metric: "cosine"
  pod_type: "p1.x1"
...

Aurora uses standard text chunking (512 characters) with Google's text-embedding-004 model.

offsec@kali:~$ cat code-reviewer/config/rag.yaml
# RAG Pipeline Configuration for Phoenix Code Reviewer
# Optimized for code semantic search

chunking:
  strategy: "ast_aware"
  chunk_size: 1500
  chunk_overlap: 300
  languages:
    - python
    - javascript
    - typescript
    - go
    - java
    - rust

code_parsing:
  extract_functions: true
  extract_classes: true
  extract_imports: true
  include_docstrings: true
  include_comments: true
  preserve_structure: true

embeddings:
  provider: "huggingface"
  model: "Salesforce/codet5p-110m-embedding"
  dimensions: 256
  batch_size: 32
  device: "cuda:0"
  normalize: true

reranker:
  enabled: true
  model: "BAAI/bge-reranker-base"
  top_k_rerank: 20

retrieval:
  top_k: 10
  score_threshold: 0.65
  distance_metric: "IP"

vector_store:
  provider: "milvus"
  host: "${MILVUS_HOST}"
  port: "${MILVUS_PORT}"
  collection_name: "${MILVUS_COLLECTION}"
  index_type: "IVF_FLAT"
  index_params:
    nlist: 1024
  search_params:
    nprobe: 16
  consistency_level: "Eventually"
...

Phoenix uses AST-aware chunking (1500 characters) with a code-specific embedding model (codet5p-110m). The larger chunk size and code-aware strategy indicate that Phoenix processes source code, not documents.

Agent Tools

Reveal capabilities and framework choices

offsec@kali:~$ cat support-assistant/src/agents/tools.py
...
from crewai import tool

@tool
def knowledge_search(query: str, department: str) -> str:
    """
    Search customer documentation and knowledge base.
...

@tool
def ticket_lookup(ticket_id: str) -> dict:
    """
    Look up support ticket status and history.
...

@tool
def escalate_ticket(ticket_id: str, reason: str) -> str:
    """Escalate ticket to human agent (create only)"""
...

Handle customer support workflows.

offsec@kali:~$ cat code-reviewer/prompts/function_schemas.json
{
  "functions": [
    {
      "name": "search_codebase",
      "description": "Semantic search across indexed repositories",
      "parameters": {"query": "string", "language": "string", "max_results": "int"}
    },
    {
      "name": "post_review_comment",
      "description": "Post inline comment on merge request",
      "parameters": {"repo": "string", "mr_id": "int", "file_path": "string", ...}
    },
    {
      "name": "run_security_scan",
      "description": "Run SAST security scanner on code (read-only)",
      ...
    }
  ]
}

 Enable code review

System Prompts and Guardrails

Define the AI's persona, capabilities, and restrictions

offsec@kali:~$ cat support-assistant/prompts/system.txt
You are Aurora, NovaTech's official customer support AI assistant.

## Your Role
You help customers with:
- Product questions and feature explanations
- Troubleshooting common issues
- Support ticket creation and status updates
...

## Restrictions - DO NOT:
- Promise features that are not yet released
- Discuss pricing, discounts, or negotiate contracts
- Share internal documentation or employee information
- Compare NovaTech products to competitors
- Discuss security vulnerabilities or ongoing incidents
...


offsec@kali:~$ cat support-assistant/config/safety.yaml
# Safety and Guardrails Configuration for Aurora

# Gemini API Safety Settings
gemini_safety_settings:
  HARM_CATEGORY_HARASSMENT: "BLOCK_MEDIUM_AND_ABOVE"
  HARM_CATEGORY_DANGEROUS_CONTENT: "BLOCK_MEDIUM_AND_ABOVE"
...
blocked_topics:
  - "competitor comparisons"
  - "internal roadmap"
  - "security vulnerabilities"
  - "acquisition plans"
...

Supply Chain
offsec@kali:~$ cat support-assistant/.env.example
# Google Gemini API
GOOGLE_API_KEY=your_google_api_key_here
GOOGLE_PROJECT_ID=novatech-prod
GOOGLE_REGION=us-central1

# Pinecone Vector Database
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_ENVIRONMENT=gcp-starter
PINECONE_INDEX_NAME=aurora-prod

# Application Settings
LOG_LEVEL=INFO
MAX_CONVERSATION_TURNS=50
ENABLE_PII_DETECTION=true

# Slack Integration (for escalations)
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx/yyy/zzz
SLACK_CHANNEL=#support-escalations


......

offsec@kali:~$ cat code-reviewer/config/models.yaml
# Model Configuration for Phoenix Code Reviewer

inference_server:
  type: "vllm"
  host: "${VLLM_HOST}"
  port: "${VLLM_PORT}"
...
  gpu_memory_utilization: 0.92
  max_model_len: 32768
  tensor_parallel_size: 2
...

primary_model:
  source: "huggingface" 
  model_id: "Qwen/Qwen2.5-Coder-32B-Instruct"
...
  quantization:
    enabled: true
    method: "AWQ"
    bits: 4
    group_size: 128
...
  recommended_gpu: "2x NVIDIA A100 80GB"
...

Active Reconnaissance

AI capabilities in modern applications are typically embedded within web applications rather than exposed on dedicated ports. As a result, traditional port-based scanning is often insufficient. Instead, we probe for AI indicators in JavaScript configurations, API endpoints, and HTTP headers.

Javascript exposes api endpoint.

offsec@kali:~$ curl -s http://192.168.50.31/js/chat-widget.js
// NovaTech Chat Widget Configuration
// Internal Use Only - Do Not Distribute
(function() {
    window.__NOVATECH_CONFIG__ = {
        apiBase: "/api/v2",
        assistantEndpoint: "/api/v2/assistant",
        featureFlags: {
            enableAI: true,
            debugMode: false,
            legacySupport: true
        },
        timeout: 30000
    };
    console.log("NovaTech Helpdesk Widget Initialized");
})();

Response reveals provider, model, token counts and other information.

offsec@kali:~$ curl -s -X POST http://192.168.50.31/api/v2/assistant \
  -H "Content-Type: application/json" \
  -d '{"message": "Hello"}' | jq
{
  "content": "How can I assist you today?",
  "metadata": {
    "provider": "ollama",
    "model": "llama3.2:1b",
    "latency_ms": 418,
    "created_at": "2026-02-20T22:44:55.365061897Z",
    "done": true,
    "done_reason": "stop",
    "load_duration": 176895153,
    "prompt_eval_count": 26,
    "prompt_eval_duration": 40254381,
    "eval_count": 8,
    "eval_duration": 195446762
  }
}

Headers identify api gateway.

offsec@kali:~$ curl -sI http://192.168.50.31:8000/v1/billing
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 39
Connection: keep-alive
RateLimit-Reset: 23
X-RateLimit-Remaining-Minute: 59
X-RateLimit-Limit-Minute: 60
RateLimit-Remaining: 59
RateLimit-Limit: 60
...
Server: kong/3.9.1
X-Kong-Upstream-Latency: 10
X-Kong-Proxy-Latency: 37
Via: 1.1 kong/3.9.1

 

Model Identity

Identifying the underlying model reveals potential attack vectors because different model families exhibit distinct behaviors and weaknesses.. But models can be configured with system prompts, guardrails, and temperature settings that can mask their identity.

Method:

  1. Direct Identity Probing - Ask the model about its identity
  2. Contradiction Testing - Probe with false assertions to reveal training biases
  3. Context Window Testing - Measure memory limits through marker injection

Knowledge cutoff can be a reliable fingerprinting technique.

 

Model-Specific Behavior :

Capability boundary mapping: Evaluate a model's practical limits. Parameter count is a major factor.

Context Window Test: Injecting fillers with a mark to measure. Deployment tools like Ollama can configure context window size, so a measured limit may reflect a deployment setting rather than a model constraint. In practice, however, most operators leave the default in place, and a mismatch between two endpoints still narrows the list of candidate models. As with all fingerprinting signals, context window results are most reliable when combined with other techniques.

image.png

The knowledge cutoff is identical (early 2024), but context window size and behavioral patterns clearly differentiate the models. Identity probing provided direct confirmation in this case, though models can be configured to give misleading responses.


Revision #3
Created 28 May 2026 14:56:18 by winslow
Updated 28 May 2026 16:56:52 by winslow