UZMAN
Monitoring & Alerting
Production cluster'ın sağlığını izlemek ve sorunları proaktif tespit etmek.
Seviye: Uzman — Bu bölüm production deneyimi gerektirir.
Karar Rehberi
| Durum | Öneri | Örnek veya gerekçe |
|---|---|---|
| Stack Monitoring (Kibana) | Uygun: Elastic Cloud, hızlı setup | Managed cluster |
| Prometheus + Grafana | Uygun: Existing infra, custom dashboards | Multi-system monitoring |
| elasticsearch-exporter | Uygun: Prometheus scrape target | Self-managed + Prom |
| Metricbeat | Uygun: ES-native, Kibana dashboards | ELK-native monitoring |
| Alertmanager | Uygun: Route/silence/group alerts | PagerDuty escalation |
| Watcher (ES) | Uygun: ES-native alerting | Legacy setup |
Kritik Metrikler
| Metrik | Tehlike Eşiği | Aksiyon |
|---|---|---|
| Cluster status | YELLOW > 5 min | Unassigned shard investigate |
| Cluster status | RED | Immediate response |
| JVM heap | > 85% | GC pressure, scale up |
| Disk usage | > 85% | ILM/delete/add disk |
| Search latency p99 | > 500ms | Profile + optimize |
| Indexing rate drop | > 50% | Bulk rejection check |
| Thread pool rejected | > 0 | Queue size / scale |
| Circuit breaker | Trips | Reduce query complexity |
REST API
# Key monitoring endpoints
curl -s "http://localhost:9200/_cluster/health?pretty"
curl -s "http://localhost:9200/_cat/nodes?v&h=name,role,heap.percent,disk.used_percent,cpu,load_1m"
curl -s "http://localhost:9200/_cat/indices?v&h=index,health,pri,rep,docs.count,store.size&s=store.size:desc"
curl -s "http://localhost:9200/_cat/thread_pool?v&h=node_name,name,active,queue,rejected&s=rejected:desc"
curl -s "http://localhost:9200/_nodes/stats/jvm,os,process?pretty"
# Pending tasks (cluster stability)
curl -s "http://localhost:9200/_cluster/pending_tasks?pretty"
# Hot threads (CPU debug)
curl -s "http://localhost:9200/_nodes/hot_threads"
.NET Client
// Health check for ASP.NET
public class ElasticsearchHealthCheck : IHealthCheck
{
private readonly ElasticsearchClient _client;
public ElasticsearchHealthCheck(ElasticsearchClient client) => _client = client;
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context, CancellationToken ct = default)
{
try
{
var response = await _client.Cluster.HealthAsync(h => h
.Timeout(TimeSpan.FromSeconds(5)), ct);
if (!response.IsValidResponse)
return HealthCheckResult.Unhealthy("ES unreachable");
return response.Status switch
{
HealthStatus.Green => HealthCheckResult.Healthy("Cluster green"),
HealthStatus.Yellow => HealthCheckResult.Degraded(
"Cluster yellow: " + response.UnassignedShards + " unassigned"),
_ => HealthCheckResult.Unhealthy("Cluster RED!")
};
}
catch (Exception ex)
{
return HealthCheckResult.Unhealthy("ES connection failed", ex);
}
}
}
// Register in DI
builder.Services.AddHealthChecks()
.AddCheck<ElasticsearchHealthCheck>("elasticsearch");
Örnek: Production'da Prometheus + Grafana ile ES monitoring: JVM heap, GC pause, indexing rate, search latency, thread pool rejection. PagerDuty alert: cluster RED = P1 incident, heap>90% = P2.
Grafana Dashboard (Ready-to-Import)
Elasticsearch Cluster Dashboard JSON (4 Panel)
{
"dashboard": {
"title": "Elasticsearch Cluster Overview",
"tags": ["elasticsearch", "production"],
"timezone": "browser",
"panels": [
{
"title": "Cluster Health Status",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
"targets": [{
"expr": "elasticsearch_cluster_health_status{color="green"}",
"legendFormat": "Green"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "color": "red", "value": 0 },
{ "color": "green", "value": 1 }
]
},
"mappings": [
{ "type": "value", "options": { "0": { "text": "RED/YELLOW" }, "1": { "text": "GREEN" } } }
]
}
},
"description": "Cluster health. GREEN=all shards assigned. YELLOW=replicas missing. RED=primaries missing."
},
{
"title": "JVM Heap Usage (%)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
"targets": [{
"expr": "elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} * 100",
"legendFormat": "{{name}}"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{ "color": "green", "value": 0 },
{ "color": "yellow", "value": 75 },
{ "color": "red", "value": 85 }
]
}
}
},
"description": "JVM heap per node. Alert threshold: >85% sustained = GC pressure. >90% = OOM risk. Max 30GB heap."
},
{
"title": "Indexing & Search Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
"targets": [
{
"expr": "rate(elasticsearch_indices_indexing_index_total[5m])",
"legendFormat": "Indexing/s {{name}}"
},
{
"expr": "rate(elasticsearch_indices_search_query_total[5m])",
"legendFormat": "Search/s {{name}}"
}
],
"fieldConfig": { "defaults": { "unit": "ops" } },
"description": "Indexing and search operations per second. Sudden drops indicate bulk rejections or circuit breaker trips."
},
{
"title": "Thread Pool Rejections",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 12 },
"targets": [
{
"expr": "rate(elasticsearch_thread_pool_rejected_count{name=~"search|write|bulk"}[5m])",
"legendFormat": "{{name}} rejected {{node}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "ops",
"thresholds": {
"steps": [
{ "color": "green", "value": 0 },
{ "color": "red", "value": 1 }
]
}
}
},
"description": "Thread pool rejections indicate overload. Any rejection >0 = capacity issue. Scale up or reduce load."
}
],
"time": { "from": "now-1h", "to": "now" },
"refresh": "30s"
}
}
Alert thresholds (Prometheus rules):
| Metric | Warning | Critical | Action |
|---|---|---|---|
elasticsearch_cluster_health_status{color="red"} |
— | == 1 for 1m | P1: immediate response |
jvm_heap_percent |
> 80% for 5m | > 90% for 2m | Scale up / reduce load |
thread_pool_rejected |
> 0 for 1m | > 10/s for 1m | Queue size / scale nodes |
disk_used_percent |
> 80% | > 85% | Delete / add disk / ILM |