🎯

service-mesh-observability

🎯Skill

from rmyndharis/antigravity-skills

VibeIndex|
What it does

Implements comprehensive service mesh observability by configuring distributed tracing, metrics collection, and visualization across microservices.

πŸ“¦

Part of

rmyndharis/antigravity-skills(289 items)

service-mesh-observability

Installation

npm runRun npm script
npm run build:catalog
npxRun with npx
npx @rmyndharis/antigravity-skills search <query>
npxRun with npx
npx @rmyndharis/antigravity-skills search kubernetes
npxRun with npx
npx @rmyndharis/antigravity-skills list
npxRun with npx
npx @rmyndharis/antigravity-skills install <skill-name>

+ 15 more commands

πŸ“– Extracted from docs: rmyndharis/antigravity-skills
10Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.

Overview

# Service Mesh Observability

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

Do not use this skill when

  • The task is unrelated to service mesh observability
  • You need a different domain or tool outside this scope

Instructions

  • Clarify goals, constraints, and required inputs.
  • Apply relevant best practices and validate outcomes.
  • Provide actionable steps and verification.
  • If detailed examples are required, open resources/implementation-playbook.md.

Use this skill when

  • Setting up distributed tracing across services
  • Implementing service mesh metrics and dashboards
  • Debugging latency and error issues
  • Defining SLOs for service communication
  • Visualizing service dependencies
  • Troubleshooting mesh connectivity

Core Concepts

1. Three Pillars of Observability

```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Observability β”‚

β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€

β”‚ Metrics β”‚ Traces β”‚ Logs β”‚

β”‚ β”‚ β”‚ β”‚

β”‚ β€’ Request rate β”‚ β€’ Span context β”‚ β€’ Access logs β”‚

β”‚ β€’ Error rate β”‚ β€’ Latency β”‚ β€’ Error details β”‚

β”‚ β€’ Latency P50 β”‚ β€’ Dependencies β”‚ β€’ Debug info β”‚

β”‚ β€’ Saturation β”‚ β€’ Bottlenecks β”‚ β€’ Audit trail β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

```

2. Golden Signals for Mesh

| Signal | Description | Alert Threshold |

|--------|-------------|-----------------|

| Latency | Request duration P50, P99 | P99 > 500ms |

| Traffic | Requests per second | Anomaly detection |

| Errors | 5xx error rate | > 1% |

| Saturation | Resource utilization | > 80% |

Templates

Template 1: Istio with Prometheus & Grafana

```yaml

# Install Prometheus

apiVersion: v1

kind: ConfigMap

metadata:

name: prometheus

namespace: istio-system

data:

prometheus.yml: |

global:

scrape_interval: 15s

scrape_configs:

- job_name: 'istio-mesh'

kubernetes_sd_configs:

- role: endpoints

namespaces:

names:

- istio-system

relabel_configs:

- source_labels: [__meta_kubernetes_service_name]

action: keep

regex: istio-telemetry

---

# ServiceMonitor for Prometheus Operator

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

name: istio-mesh

namespace: istio-system

spec:

selector:

matchLabels:

app: istiod

endpoints:

- port: http-monitoring

interval: 15s

```

Template 2: Key Istio Metrics Queries

```promql

# Request rate by service

sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)

# Error rate (5xx)

sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))

/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100

# P99 latency

histogram_quantile(0.99,

sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))

by (le, destination_service_name))

# TCP connections

sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)

# Request size

histogram_quantile(0.99,

sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))

by (le, destination_service_name))

```

Template 3: Jaeger Distributed Tracing

```yaml

# Jaeger installation for Istio

apiVersion: install.istio.io/v1alpha1

kind: IstioOperator

spec:

meshConfig:

enableTracing: true

defaultConfig:

tracing:

sampling: 100.0 # 100% in dev, lower in prod

zipkin:

address: jaeger-collector.istio-system:9411

---

# Jaeger deployment

apiVersion: apps/v1

kind: Deployment

metadata:

name: jaeger

namespace: istio-system

spec:

selector:

matchLabels:

app: jaeger

template:

metadata:

labels:

app: jaeger

spec:

containers:

- name: jaeger

image: jaegertracing/all-in-one:1.50

ports:

- containerPort: 5775 # UDP

- containerPort: 6831 # Thrift

- containerPort: 6832 # Thrift

- containerPort: 5778 # Config

- containerPort: 16686 # UI

- containerPort: 14268 # HTTP

- containerPort: 14250 # gRPC

- containerPort: 9411 # Zipkin

env:

- name: COLLECTOR_ZIPKIN_HOST_PORT

value: ":9411"

```

Template 4: Linkerd Viz Dashboard

```bash

# Install Linkerd viz extension

linkerd viz install | kubectl apply -f -

# Access dashboard

linkerd viz dashboard

# CLI commands for observability

# Top requests

linkerd viz top deploy/my-app

# Per-route metrics

linkerd viz routes deploy/my-app --to deploy/backend

# Live traffic inspection

linkerd viz tap deploy/my-app --to deploy/backend

# Service edges (dependencies)

linkerd viz edges deployment -n my-namespace

```

Template 5: Grafana Dashboard JSON

```json

{

"dashboard": {

"title": "Service Mesh Overview",

"panels": [

{

"title": "Request Rate",

"type": "graph",

"targets": [

{

"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",

"legendFormat": "{{destination_service_name}}"

}

]

},

{

"title": "Error Rate",

"type": "gauge",

"targets": [

{

"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"

}

],

"fieldConfig": {

"defaults": {

"thresholds": {

"steps": [

{"value": 0, "color": "green"},

{"value": 1, "color": "yellow"},

{"value": 5, "color": "red"}

]

}

}

}

},

{

"title": "P99 Latency",

"type": "graph",

"targets": [

{

"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",

"legendFormat": "{{destination_service_name}}"

}

]

},

{

"title": "Service Topology",

"type": "nodeGraph",

"targets": [

{

"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"

}

]

}

]

}

}

```

Template 6: Kiali Service Mesh Visualization

```yaml

# Kiali installation

apiVersion: kiali.io/v1alpha1

kind: Kiali

metadata:

name: kiali

namespace: istio-system

spec:

auth:

strategy: anonymous # or openid, token

deployment:

accessible_namespaces:

- "**"

external_services:

prometheus:

url: http://prometheus.istio-system:9090

tracing:

url: http://jaeger-query.istio-system:16686

grafana:

url: http://grafana.istio-system:3000

```

Template 7: OpenTelemetry Integration

```yaml

# OpenTelemetry Collector for mesh

apiVersion: v1

kind: ConfigMap

metadata:

name: otel-collector-config

data:

config.yaml: |

receivers:

otlp:

protocols:

grpc:

endpoint: 0.0.0.0:4317

http:

endpoint: 0.0.0.0:4318

zipkin:

endpoint: 0.0.0.0:9411

processors:

batch:

timeout: 10s

exporters:

jaeger:

endpoint: jaeger-collector:14250

tls:

insecure: true

prometheus:

endpoint: 0.0.0.0:8889

service:

pipelines:

traces:

receivers: [otlp, zipkin]

processors: [batch]

exporters: [jaeger]

metrics:

receivers: [otlp]

processors: [batch]

exporters: [prometheus]

---

# Istio Telemetry v2 with OTel

apiVersion: telemetry.istio.io/v1alpha1

kind: Telemetry

metadata:

name: mesh-default

namespace: istio-system

spec:

tracing:

- providers:

- name: otel

randomSamplingPercentage: 10

```

Alerting Rules

```yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

name: mesh-alerts

namespace: istio-system

spec:

groups:

- name: mesh.rules

rules:

- alert: HighErrorRate

expr: |

sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)

/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05

for: 5m

labels:

severity: critical

annotations:

summary: "High error rate for {{ $labels.destination_service_name }}"

- alert: HighLatency

expr: |

histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))

by (le, destination_service_name)) > 1000

for: 5m

labels:

severity: warning

annotations:

summary: "High P99 latency for {{ $labels.destination_service_name }}"

- alert: MeshCertExpiring

expr: |

(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7

labels:

severity: warning

annotations:

summary: "Mesh certificate expiring in less than 7 days"

```

Best Practices

Do's

  • Sample appropriately - 100% in dev, 1-10% in prod
  • Use trace context - Propagate headers consistently
  • Set up alerts - For golden signals
  • Correlate metrics/traces - Use exemplars
  • Retain strategically - Hot/cold storage tiers

Don'ts

  • Don't over-sample - Storage costs add up
  • Don't ignore cardinality - Limit label values
  • Don't skip dashboards - Visualize dependencies
  • Don't forget costs - Monitor observability costs

Resources

  • [Istio Observability](https://istio.io/latest/docs/tasks/observability/)
  • [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)
  • [OpenTelemetry](https://opentelemetry.io/)
  • [Kiali](https://kiali.io/)