For Platform & Infrastructure Teams

Stop Being the BottleneckFor Every Codebase Question

Your platform team shouldn't be a human search engine. Probe gives every engineer self-serve access to infrastructure knowledge, dependency maps, and operational context.

On-premAny LLMOpenTelemetryOpen source

Free and open source. Business and Enterprise plans available.

What changes for your platform team
Before
App teams Slack you for every infra question
Docs are stale before the PR merges
Incident investigations start from scratch every time
After
Engineers self-serve from the actual source code
Platform knowledge lives in code, not wikis
Full service dependency context in seconds
Enterprise-ready by design
On-prem Any LLM OpenTelemetry

Why platform teams choose Probe

Cross-Service Visibility

Map dependencies across all repos and services. Know what breaks when you change something before you push.

Automated Quality Gates

Enforce platform standards on every PR. Catch Terraform misconfigurations, missing health checks, and security issues automatically.

Reduce Support Load

Stop answering the same questions. App teams query the codebase directly instead of filing tickets against your team.

Incident Intelligence

Correlate alerts with recent deploys, config changes, and dependency updates. Cut MTTR by giving on-call real context.

The Real Problem

You built the platform. Now you're stuck supporting it.

Platform teams build infrastructure so app teams can move fast. But "move fast" turns into "Slack the platform team for everything" the moment something isn't documented.

01

The Support Tax

Your team spends 40% of its time answering questions that are already answered in code. "How does the deploy pipeline work?" "Which env vars does service X need?" "Why does this Terraform module exist?" The answers are there. Nobody can find them.

02

The Documentation Treadmill

You write docs. Infrastructure changes. Docs go stale. App teams follow outdated runbooks and break things. You spend time fixing what they broke and updating docs they won't read. The cycle repeats every sprint.

03

The Scaling Wall

Org adds 3 new teams. Each needs onboarding to your platform. Each has different tech stacks and requirements. Your 4-person platform team is now supporting 12 app teams, and the Jira board is nothing but "help me deploy" tickets.

What platform teams actually ask Probe

Real prompts from platform engineers. Infrastructure questions, dependency mapping, incident investigation, and automation across the entire stack.

Question
How does the deployment pipeline work for services in the payments namespace? Walk me through from git push to production.
Deployment flow: 1. Push triggers GitHub Actions (.github/workflows/deploy.yml) 2. Runs tests, builds Docker image, pushes to ECR 3. ArgoCD syncs the Helm chart in infra/charts/payments/ 4. Canary rollout via Istio VirtualService (10% -> 50% -> 100% over 15 min) 5. Datadog monitors error rate during rollout, auto-rollback if >1%...
Investigation
The order-service is hitting connection pool exhaustion on the primary database. What other services share this connection pool and what changed recently?
3 services share the primary pool: order-service, inventory-service, and billing-service. Connection limit is 100 (set in terraform/modules/rds/main.tf). PR #1847 merged 2 days ago added a new batch query to inventory-service that runs every 30s and holds connections for ~5s. This likely pushed the shared pool over capacity. Recommend: increase pool to 150 or move inventory batch to read replica.
Question
Which services depend on the auth-gateway? If we upgrade it to v3, what consumers need to be updated?
12 services depend on auth-gateway. Direct consumers (import the client SDK): user-service, admin-panel, api-gateway, billing-service. Indirect consumers (via service mesh): 8 services that route through the API gateway. v3 breaking changes: JWT claim format changed (affects 4 direct consumers), rate limit headers renamed (affects api-gateway only). Migration guide in auth-gateway/docs/v3-migration.md.
Question
What Kubernetes resource limits are set for services in the prod namespace? Which ones are over-provisioned based on actual usage?
Scanned 18 services in prod. Over-provisioned: analytics-worker (requests 4Gi, P95 usage 800Mi), report-generator (requests 2 CPU, P95 usage 0.3 CPU), notification-service (requests 1Gi, P95 usage 200Mi). Total waste: ~8Gi memory, 3.4 CPU. Under-provisioned: search-indexer hitting OOMKills (limit 512Mi, needs 1Gi). Config locations: k8s/overlays/prod/*/resources.yaml.
Investigation
Latency on the checkout flow spiked after yesterday's deploy. Trace the request path and tell me what changed.
Checkout request path: nginx-ingress -> api-gateway -> cart-service -> inventory-service -> payment-processor -> order-service. Yesterday's deploy: PR #2103 added a synchronous call from cart-service to a new fraud-check endpoint. This adds 200-400ms per request. The fraud-check service has no connection pooling configured. Recommend: make fraud check async or add connection pooling in fraud-check/config/database.yml.
Question
Show me all Terraform modules we maintain, their consumers, and which ones haven't been updated in 6+ months.
14 Terraform modules in terraform/modules/. Stale (6+ months): rds-legacy (last updated 8 months ago, used by 2 services), vpc-peering (10 months, used by 1 service), lambda-cron (7 months, used by 4 services). Most used: ecs-service (12 consumers), rds-cluster (8 consumers), s3-bucket (11 consumers). rds-legacy still uses provider 3.x -- needs upgrade to 5.x for security patches.
Question
How does the deployment pipeline work for services in the payments namespace? Walk me through from git push to production.
Deployment flow: 1. Push triggers GitHub Actions (.github/workflows/deploy.yml) 2. Runs tests, builds Docker image, pushes to ECR 3. ArgoCD syncs the Helm chart in infra/charts/payments/ 4. Canary rollout via Istio VirtualService (10% -> 50% -> 100% over 15 min) 5. Datadog monitors error rate during rollout, auto-rollback if >1%...
Investigation
The order-service is hitting connection pool exhaustion on the primary database. What other services share this connection pool and what changed recently?
3 services share the primary pool: order-service, inventory-service, and billing-service. Connection limit is 100 (set in terraform/modules/rds/main.tf). PR #1847 merged 2 days ago added a new batch query to inventory-service that runs every 30s and holds connections for ~5s. This likely pushed the shared pool over capacity. Recommend: increase pool to 150 or move inventory batch to read replica.
Question
Which services depend on the auth-gateway? If we upgrade it to v3, what consumers need to be updated?
12 services depend on auth-gateway. Direct consumers (import the client SDK): user-service, admin-panel, api-gateway, billing-service. Indirect consumers (via service mesh): 8 services that route through the API gateway. v3 breaking changes: JWT claim format changed (affects 4 direct consumers), rate limit headers renamed (affects api-gateway only). Migration guide in auth-gateway/docs/v3-migration.md.
Question
What Kubernetes resource limits are set for services in the prod namespace? Which ones are over-provisioned based on actual usage?
Scanned 18 services in prod. Over-provisioned: analytics-worker (requests 4Gi, P95 usage 800Mi), report-generator (requests 2 CPU, P95 usage 0.3 CPU), notification-service (requests 1Gi, P95 usage 200Mi). Total waste: ~8Gi memory, 3.4 CPU. Under-provisioned: search-indexer hitting OOMKills (limit 512Mi, needs 1Gi). Config locations: k8s/overlays/prod/*/resources.yaml.
Investigation
Latency on the checkout flow spiked after yesterday's deploy. Trace the request path and tell me what changed.
Checkout request path: nginx-ingress -> api-gateway -> cart-service -> inventory-service -> payment-processor -> order-service. Yesterday's deploy: PR #2103 added a synchronous call from cart-service to a new fraud-check endpoint. This adds 200-400ms per request. The fraud-check service has no connection pooling configured. Recommend: make fraud check async or add connection pooling in fraud-check/config/database.yml.
Question
Show me all Terraform modules we maintain, their consumers, and which ones haven't been updated in 6+ months.
14 Terraform modules in terraform/modules/. Stale (6+ months): rds-legacy (last updated 8 months ago, used by 2 services), vpc-peering (10 months, used by 1 service), lambda-cron (7 months, used by 4 services). Most used: ecs-service (12 consumers), rds-cluster (8 consumers), s3-bucket (11 consumers). rds-legacy still uses provider 3.x -- needs upgrade to 5.x for security patches.
Automation
@Probe every morning: check all services for missing health checks, liveness probes, or readiness probes. Post findings to #platform-alerts.
Morning audit complete. 3 services missing readiness probes: pdf-generator, legacy-importer, analytics-ingest. 1 service with health check returning 200 but not checking dependencies: notification-worker (doesn't verify Redis connection). Created PLAT-441 through PLAT-444 with per-service remediation. Posted summary to #platform-alerts.
Automation
On every PR to terraform/: validate the plan, check for security group changes that open ports to 0.0.0.0/0, and flag any IAM policy changes
Reviewing PR #892 to terraform/modules/ecs-service. Found: 1 security group rule opening port 8080 to 0.0.0.0/0 (should be VPC CIDR only). 1 IAM policy adding s3:* permissions (too broad -- should be scoped to specific bucket). No state drift detected. Posted inline comments with suggested fixes and linked to our security policy doc.
Automation
When a new service repo is created, validate it has: Dockerfile, health check endpoint, structured logging, graceful shutdown, and Helm chart. Open issues for anything missing.
New repo detected: recommendation-engine. Audit results: Dockerfile present (passes lint). Health check endpoint: MISSING (no /healthz route). Structured logging: using fmt.Println instead of structured logger (3 files). Graceful shutdown: MISSING (no signal handler). Helm chart: present but missing resource limits. Created 4 issues with templates and examples from our platform standards repo.
Automation
Nightly: scan all Dockerfiles for base images with known CVEs. If critical, open a PR to bump the image and notify the owning team.
Scanned 24 Dockerfiles. 3 critical CVEs found: user-service using node:18.12 (CVE-2023-44487 HTTP/2 rapid reset), billing-service using golang:1.20 (CVE-2023-39325), analytics-ingest using python:3.9.7 (CVE-2023-40217). Opened PRs #903, #904, #905 with bumped base images. Tests passing on #903 and #904. #905 needs manual verification -- pinged @data-team in Slack.
Investigation
We're seeing intermittent 503s on the API gateway. Correlate with recent config changes, deploys, and any infrastructure events in the last 24h.
Timeline: 14:00 -- Istio sidecar injection config updated (PR #871). 14:15 -- first 503s appear. 14:30 -- notification-service deploy rolled out new version. 503s correlate with Istio config change, not the deploy. Root cause: mTLS strict mode enabled but notification-service hasn't rotated its certificates. Fix: either rollback Istio config or force cert rotation on notification-service. Rollback command attached.
Automation
Generate a platform adoption report: which teams are using our standard Helm chart vs custom deployments, who's on the latest platform SDK, and who's behind.
Platform adoption across 12 teams. Standard Helm chart: 9/12 teams (75%). Custom deployments: data-pipeline (legacy Nomad), ml-platform (custom K8s operators), analytics (bare EC2). Platform SDK v3.x (current): 7/12. SDK v2.x (1 major behind): 4/12. SDK v1.x (unsupported): analytics team only. Migration blockers: data-pipeline team has no K8s experience, ml-platform has custom GPU scheduling needs. Report posted to #platform-metrics.
Automation
@Probe every morning: check all services for missing health checks, liveness probes, or readiness probes. Post findings to #platform-alerts.
Morning audit complete. 3 services missing readiness probes: pdf-generator, legacy-importer, analytics-ingest. 1 service with health check returning 200 but not checking dependencies: notification-worker (doesn't verify Redis connection). Created PLAT-441 through PLAT-444 with per-service remediation. Posted summary to #platform-alerts.
Automation
On every PR to terraform/: validate the plan, check for security group changes that open ports to 0.0.0.0/0, and flag any IAM policy changes
Reviewing PR #892 to terraform/modules/ecs-service. Found: 1 security group rule opening port 8080 to 0.0.0.0/0 (should be VPC CIDR only). 1 IAM policy adding s3:* permissions (too broad -- should be scoped to specific bucket). No state drift detected. Posted inline comments with suggested fixes and linked to our security policy doc.
Automation
When a new service repo is created, validate it has: Dockerfile, health check endpoint, structured logging, graceful shutdown, and Helm chart. Open issues for anything missing.
New repo detected: recommendation-engine. Audit results: Dockerfile present (passes lint). Health check endpoint: MISSING (no /healthz route). Structured logging: using fmt.Println instead of structured logger (3 files). Graceful shutdown: MISSING (no signal handler). Helm chart: present but missing resource limits. Created 4 issues with templates and examples from our platform standards repo.
Automation
Nightly: scan all Dockerfiles for base images with known CVEs. If critical, open a PR to bump the image and notify the owning team.
Scanned 24 Dockerfiles. 3 critical CVEs found: user-service using node:18.12 (CVE-2023-44487 HTTP/2 rapid reset), billing-service using golang:1.20 (CVE-2023-39325), analytics-ingest using python:3.9.7 (CVE-2023-40217). Opened PRs #903, #904, #905 with bumped base images. Tests passing on #903 and #904. #905 needs manual verification -- pinged @data-team in Slack.
Investigation
We're seeing intermittent 503s on the API gateway. Correlate with recent config changes, deploys, and any infrastructure events in the last 24h.
Timeline: 14:00 -- Istio sidecar injection config updated (PR #871). 14:15 -- first 503s appear. 14:30 -- notification-service deploy rolled out new version. 503s correlate with Istio config change, not the deploy. Root cause: mTLS strict mode enabled but notification-service hasn't rotated its certificates. Fix: either rollback Istio config or force cert rotation on notification-service. Rollback command attached.
Automation
Generate a platform adoption report: which teams are using our standard Helm chart vs custom deployments, who's on the latest platform SDK, and who's behind.
Platform adoption across 12 teams. Standard Helm chart: 9/12 teams (75%). Custom deployments: data-pipeline (legacy Nomad), ml-platform (custom K8s operators), analytics (bare EC2). Platform SDK v3.x (current): 7/12. SDK v2.x (1 major behind): 4/12. SDK v1.x (unsupported): analytics team only. Migration blockers: data-pipeline team has no K8s experience, ml-platform has custom GPU scheduling needs. Report posted to #platform-metrics.

Three things that change everything

01

Infrastructure-Aware Code Intelligence

Ask any question about any service, any Terraform module, any Helm chart, and get answers grounded in actual code.

  • "How does traffic route from the load balancer to service X?"
  • "What environment variables does the payment-service need to boot?"
  • "Which services would break if we upgrade the Redis cluster?"
  • "Show me the full request path for the /api/v2/orders endpoint."

Probe reads infrastructure-as-code the same way it reads application code. It understands Terraform resources, Kubernetes manifests, Helm values, Docker compose files, and CI/CD pipelines. It connects the dots between application code and the infrastructure that runs it.

Multi-repo architectureQuery across app repos, infra repos, and shared libraries in one request
IaC-awareUnderstands Terraform, Kubernetes YAML, Helm charts, and Dockerfiles natively
Dependency graphingMaps service-to-service, service-to-infrastructure, and shared resource dependencies
02

Automated Platform Compliance

Every PR reviewed against your platform standards. Every new service validated against your golden path.

Generic code review tools don't understand your platform. They don't know that every service needs a /healthz endpoint, that Terraform modules must tag resources with cost center, or that Dockerfiles should use your internal base images. They can't tell you that a Helm chart is missing resource limits.

Probe learns your platform standards and enforces them on every change. It checks Terraform plans for security misconfigurations, validates Kubernetes manifests against your policies, and catches breaking changes to shared infrastructure before they hit production.

Custom platform rulesDefine and enforce standards for infra, Dockerfiles, Helm charts, and CI configs
Security scanningCatch overly permissive IAM policies, open security groups, and unencrypted resources
Breaking change detectionKnow when a shared module change will affect downstream consumers
03

Operational Intelligence

Correlate incidents with code changes, config drifts, and dependency updates across your entire stack.

When something breaks at 2 AM, the on-call engineer shouldn't need to know the entire system to debug it. They need context: what changed recently, what depends on the broken component, and what's the safest way to fix it. That context is scattered across 30 repos, 5 monitoring tools, and someone's head.

Probe assembles that context automatically. It correlates alerts with recent PRs, Terraform applies, and config changes. It maps blast radius by tracing dependencies. It surfaces relevant runbooks and past incidents. The on-call engineer gets a complete picture, not a PagerDuty alert with no context.

Change correlationLink incidents to recent deploys, config changes, and dependency updates
Blast radius analysisMap which services are affected when a shared dependency fails
Runbook automationSurface and execute relevant runbooks based on the incident context

Workflow packs for platform operations

Pre-built automation workflows for common platform team responsibilities. Deploy them, customize per team, iterate over time.

Every PR

Infrastructure Review

Automated review of Terraform changes, Kubernetes manifests, and Helm charts. Validates against platform security policies, checks for resource misconfigurations, and detects breaking changes to shared modules before they merge.

  • Terraform plan validation
  • Security policy compliance check
  • Downstream consumer impact analysis
  • Cost estimation for resource changes
On Incident

Incident Context Assembly

When an alert fires, automatically gather context: recent deploys, config changes, dependency status, related past incidents. Give on-call everything they need to start debugging without asking anyone.

  • Recent change timeline
  • Service dependency map
  • Related past incidents
  • Relevant runbook links
New Service

Service Onboarding

When a new service repo is created, audit it against your golden path. Check for required health checks, structured logging, graceful shutdown, Dockerfile standards, and Helm chart completeness. Open issues for gaps.

  • Golden path compliance audit
  • Auto-generated issues for gaps
  • Template links and examples
  • Platform SDK setup guide
Nightly

Infrastructure Hygiene

Nightly scans for CVEs in base images, stale Terraform modules, unused resources, over-provisioned services, and drift between declared and actual state. Summary posted to your platform channel.

  • CVE scan with auto-bump PRs
  • Resource waste report
  • State drift detection
  • Module staleness audit

Built for production infrastructure

On-Premises Deployment

Runs entirely inside your infrastructure. Code and infrastructure configs never leave your environment. Meets SOC 2, HIPAA, and FedRAMP requirements.

Any LLM Provider

Use your org's preferred model -- Claude, GPT, open-source, or self-hosted behind your firewall. Switch providers without changing workflows or losing context.

Full Audit Trail

OpenTelemetry instrumentation on every query and workflow execution. Export to your existing Datadog, Grafana, or Splunk stack. Debug AI workflows like any other system.

Open Source Core

The core engine is open source. Your security team can audit exactly how code is processed. No vendor black boxes in your infrastructure stack.

Open Source vs Enterprise

Start with open source on a single repo. Scale to enterprise when you need multi-repo architecture and workflow automation across teams.

Probe Open Source

Free forever

The core code intelligence engine. Evaluate the technology on a single infrastructure repo or service codebase.

  • Single-repo code intelligence -- Ask questions about one repository at a time
  • Semantic code search -- Understands Terraform, YAML, Dockerfiles, and application code as structured data
  • No indexing required -- Works instantly on any codebase, runs locally
  • MCP integration -- Use with Claude Code, Cursor, or any MCP-compatible tool
  • Any LLM provider -- Claude, GPT, open-source models -- your choice
  • Privacy-first -- Everything runs locally, no data leaves your machine

Probe Enterprise

Contact for pricing

Everything in Open Source, plus multi-repo architecture, cross-service dependency mapping, workflow automation, and integrations.

  • Multi-repository architecture -- Query across all service repos, infra repos, and shared libraries
  • Cross-service dependency mapping -- Understand how services connect and what breaks when you change something
  • AI-powered infrastructure review -- Automated PR review for Terraform, Helm, and Kubernetes changes
  • Jira integration -- Pull ticket context, link infra changes to product requirements
  • Datadog / Grafana integration -- Correlate code changes with metrics and alerts
  • Incident correlation -- Automatically link alerts to recent deploys and config changes
  • Workflow automation -- Pre-built workflows for compliance, onboarding, and hygiene
  • Intelligent routing -- System determines which repos, configs, and docs are relevant per query
  • Slack/Teams integration -- App teams query infrastructure context from where they work
  • On-premises deployment -- Runs entirely in your infrastructure for maximum security

How to evaluate Probe

Two-phase approach: validate core technology with open source, then pilot enterprise features on your actual infrastructure stack.

Phase 1

Technical Validation

~10 minutes

Pick any of these and have something running before your next standup. No account required.

~2 min

Add Probe to Your Editor

Get infrastructure-aware code search in Claude Code, Cursor, or any MCP-compatible tool. Point it at your Terraform repo or service codebase. One command to install.

You get: An AI agent that understands your infrastructure code -- finds the right Terraform module, traces the Helm chart, identifies the correct K8s manifest without you navigating file trees.
AI code editor setup →
~5 min

Automate Infrastructure Reviews

Add a GitHub Action for automated review of Terraform changes, Kubernetes manifests, and Helm charts. Every PR gets a first-pass review for security, cost impact, and compliance.

You get: Every infra PR reviewed against your platform standards automatically. Catches security misconfigurations, overly permissive IAM, and missing resource tags before human review.
~10 min

Deploy a Platform Q&A Bot

Create a Slack bot that answers questions about your infrastructure. App teams ask "how do I deploy to staging?" and get answers grounded in your actual CI/CD config and Helm charts.

You get: A Slack bot that reduces your team's support load by letting app teams self-serve infrastructure questions. Run locally to test, then deploy anywhere.
Full setup guide →
Phase 2

Platform Pilot

2-4 weeks

Once you've validated the core technology, run a pilot across your infrastructure stack to test multi-repo understanding, automated reviews, and operational intelligence.

1
Map your architecture

We work with your team to connect all service repos, infra repos, and shared libraries. Map cross-service dependencies and configure access controls.

2
Define platform standards

Codify your golden path into review rules. Health checks, resource limits, logging standards, security policies -- all enforceable on every PR automatically.

3
Deploy operational workflows

Set up incident context assembly, nightly hygiene scans, and service onboarding automation. Connect to your monitoring stack via OpenTelemetry.

4
Measure the difference

Track support ticket volume from app teams, mean time to resolve incidents, and infra PR review cycle time. Compare to pre-pilot baselines.

Success criteria: Measurable reduction in platform team support load. Faster incident resolution. Consistent infrastructure quality across all teams. App teams self-serving for infrastructure context.

Want to discuss how a pilot would work for your infrastructure stack?

Schedule a Technical Discussion

What platform teams ask us

How is this different from Backstage or other developer portals?

Developer portals give you a catalog of services with manually maintained metadata. Probe gives you live, queryable access to the actual code, configs, and infrastructure. When someone asks "how does service X connect to the database?", Probe reads the connection config from the code. Backstage shows whatever someone wrote in a YAML file six months ago.

They're complementary -- Probe can actually keep your Backstage catalog accurate by generating metadata from real code.

Does it understand Terraform, Kubernetes, and Helm natively?

Yes. Probe reads infrastructure-as-code semantically, not as flat text. It understands Terraform resource relationships, Kubernetes manifest structures, Helm value overrides, and Docker multi-stage builds. It can trace a reference from a Helm values file to the Terraform module that provisions the underlying resource.

How does it handle multi-repo architectures?

The enterprise tier connects all your repositories -- service code, infrastructure code, shared libraries, CI/CD configs. When you ask a question, it determines which repos are relevant and pulls context from all of them. A question about "why is service X slow" might pull code from the app repo, networking config from the infra repo, and resource limits from the Helm chart.

Can we run this fully on-prem behind our firewall?

Yes. Probe runs entirely locally -- retrieval is local and you control what context is sent to the model. You choose your LLM provider, including self-hosted models like Llama or Mistral. All workflow execution is local. OpenTelemetry traces export to your existing monitoring stack. No data leaves your network unless you explicitly configure it to.

Ready to stop being a human search engine?

Let's talk about how Probe can reduce your platform team's support load and give app teams self-serve access to infrastructure context. We'll show you how it works on real infrastructure code.