Imagine a startup called "TechCorp" that provides an e-commerce platform. One night, the platform suddenly went down, and users were unable to place orders. The on-call engineer, Alex, was unaware of the outage until customers started complaining the next morning. By the time Alex logged in, the system had been down for over six hours, resulting in significant revenue loss and customer dissatisfaction.
Since TechCorp had not implemented a proper monitoring solution, Alex had no immediate visibility into what had gone wrong. He had to manually check server logs, application errors, and cloud infrastructure metrics to diagnose the issue. After hours of investigation, he found that the database server had reached its maximum connection limit due to an unexpected traffic spike. If Prometheus had been in place, it could have:
Triggered Alerts: Alertmanager would have notified Alex instantly about high database connection usage.
Provided Metrics: Prometheus could have shown a spike in database connections, helping Alex pinpoint the issue faster.
Reduced Downtime: With proactive monitoring, Alex could have mitigated the issue before it led to a full-blown outage.
This incident was a wake-up call for TechCorp, and they immediately integrated Prometheus into their infrastructure to prevent future disruptions. The lesson here is clear: proactive monitoring is not optional—it is essential.
With its rich feature set and strong community support, Prometheus has become the go-to monitoring tool for modern infrastructure. In the upcoming sections, we’ll explore essential concepts about Prometheus. Let’s get started!
What is Prometheus?
Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability in modern cloud-native environments. It was originally developed at SoundCloud and is now a Cloud Native Computing Foundation (CNCF) project.
Features of Prometheus:
Time-Series Data Storage: Stores metrics with timestamps and labels, allowing detailed analysis over time.
PromQL (Prometheus Query Language): A powerful query language to retrieve and manipulate metrics.
Pull-Based Data Collection: Prometheus scrapes (pulls) metrics from configured endpoints instead of relying on external agents.
Service Discovery: Automatically detects services in dynamic environments like Kubernetes.
Alerting with Alertmanager: Sends alerts based on metric conditions via Slack, Email etc.
Multi-dimensional Data Model: Uses labels (key-value pairs) to categorize and filter metrics efficiently.
Integration with Grafana: Visualize metrics in real-time dashboards.
Working of Prometheus:
Scraping: It collects metrics from target endpoints (e.g., servers, applications, databases).
Storage: Stores the scraped metrics in a highly efficient time-series database.
Querying: Users can query metrics using PromQL.
Alerting: Prometheus evaluates rules and sends alerts when conditions are met.
Why we use Prometheus?
Scalability: Handles large-scale distributed systems.
Reliability: Designed for fault-tolerant monitoring.
Flexibility: Works with Kubernetes, Docker, EC2, and more.
Key Terminologies:
Observability:
Observability is the ability to understand what’s happening inside a system just by looking at the data it produces. It helps in quickly identifying issues, understanding system behavior, and improving performance.
Think of observability like a car’s dashboard:
The speedometer shows how fast you're going (Performance Metrics 📊).
The fuel gauge tells you when to refill (Resource Usage ⛽).
The check engine light warns you about problems (Alerts 🚨).
Now, if your car didn’t have these indicators, you'd have no clue what’s wrong until the car breaks down! This is why observability is crucial for modern applications.
Modern system architecture is becoming more complex, shifting from monolithic to microservices. In microservices-based systems, multiple services interact, making it hard to identify failures without proper observability. Imagine users experiencing slow response times. Without observability, engineers might struggle to determine if the issue is with the database, authentication service, or an overloaded API gateway.
Observability consists of three main pillars: Logging, Metrics, and Traces.
Metrics (What is Happening?)
Metrics provide quantitative data about the system’s overall health and performance. They help in monitoring trends over time.
📌 Example:
Metrics in a car include speed (km/h), fuel level (%), engine temperature (°C), and mileage (km/l).
They give you a real-time snapshot of how the car is performing.
Example: If the engine temperature starts rising, it signals a potential issue.
🔍 Prometheus Example:
Prometheus collects CPU usage, memory utilization, request rates, and error rates from your system.
If CPU usage spikes to 90%, it could indicate high load or a resource leak.
Logs (Why Did It Happen?)
Logs are detailed records of events and actions taken within a system. They help in troubleshooting by showing exactly what happened.
📌 Example:
Logs are like a car’s black box (event recorder) or error messages on the dashboard.
Example: If your car doesn’t start, the logs might show:
“Battery voltage low”
“Ignition system error”
These logs help mechanics diagnose what went wrong.
🔍 Prometheus & Logging:
Prometheus does not store logs directly but can work with Loki (Grafana’s logging tool) to provide logs alongside metrics.
Example: A server crashes, and logs show:
“Database connection timed out”
“Out of memory error”
Traces (Where Did It Happen?)
Traces help track the journey of a request across multiple services in a system. They show where delays or failures occur.
📌 Example:
Traces are like GPS tracking of your car’s journey.
If your trip from Point A to Point B takes too long, traces show:
Where you stopped (e.g., traffic jam 🚦)
Which routes were slow (e.g., roadblock 🚧)
If your engine misfires, the trace might reveal which cylinder failed.
🔍 Prometheus & Tracing:
Prometheus works with Jaeger or OpenTelemetry to track request flows.
Example: A user request takes 5 seconds instead of 500ms, and traces show:
API Gateway (200ms) ✅
Authentication Service (4.5s) ❌ (Delayed)
Database (100ms) ✅
SLI (Service Level Indicator)
– The Actual Measurement 📏SLI is a quantifiable metric that measures how well a system is performing. It tracks key performance indicators such as latency, error rate, or uptime.
📌 Example:
A website’s SLI for uptime could be 99.5% in the last 30 days.
If your app’s response time should be under 500ms, the SLI could be 480ms (good) or 800ms (bad).
SLO (Service Level Objective)
– The Goal 🎯SLO is the target value for an SLI. It defines what is considered acceptable performance.
📌 Example:
If your SLI (uptime) is 99.5%, but your company aims for 99.9% uptime, then SLO = 99.9%.
If the response time should be ≤500ms for 95% of requests, that’s the SLO.
💡 SLOs are internal targets used by engineering teams to maintain reliability.
SLA (Service Level Agreement)
– The Contract 📜SLA is a legal agreement between a service provider and a customer. It sets binding guarantees based on SLOs and outlines penalties if they are not met.
📌 Example:
An SLA might state:
- “We guarantee 99.9% uptime. If downtime exceeds this, you get a refund or service credits.”
If the SLO (goal) is 99.9% uptime, but the system only achieves 99.5% uptime, the provider may have to compensate customers.
Imagine a pizza delivery service:
SLI (Indicator): The actual delivery time of pizzas. 📏
- Last month’s average delivery time: 28 minutes.
SLO (Objective): The target delivery time. 🎯
- The company’s internal goal: Deliver 90% of pizzas within 30 minutes.
SLA (Agreement): The contractual promise to customers. 📜
- If your pizza is late (over 30 minutes), you get a free pizza.
Common Terminologies:
Latency ⏳– Time taken for a request to travel from the user to the system and back.
Example: A webpage loads in 300ms, or an API request takes 120ms.
Lower latency means faster response and better performance.
Throughput 🚀– Number of requests a system can process per second.
Example: A web server handles 1,000 requests per second, or a database processes 500 queries per second.
Higher throughput means the system can serve more users simultaneously.
Response Time ⏱️– Total time from when a request is sent to when the system responds.
Example: An API request taking 2 seconds to respond.
Formula: Response Time = Latency + Processing Time.
Uptime 🟢– The percentage of time a system is available.
Example: 99.9% uptime means 8.76 hours of downtime per year.
Higher uptime indicates better reliability.
Error Rate ❌– Percentage of failed requests.
Example: If 10 out of 1,000 requests fail, the error rate is 1%.
Lower error rate means a more stable system.
Requests Per Second (RPS) 📡– The number of requests a server receives per second.
Example: A website handling 5,000 requests per second during peak hours.
More RPS indicates higher system load.
CPU Usage 🔋– The percentage of CPU resources used.
Example: If CPU usage exceeds 90%, the system may slow down.
High CPU usage for long periods can lead to crashes.
Memory Usage 🧠– The percentage of RAM used by an application.
Example: A server using 80% of RAM may slow down or crash.
Memory leaks occur when unused memory is not released.
Disk I/O ✍🏻– Speed at which data is read or written to disk.
Example: High Disk I/O can slow down database performance.
Slow disk speed affects application performance.
Network Bandwidth 🌐– Data transfer rate over a network (measured in Mbps/Gbps).
Example: A video streaming service using 1Gbps bandwidth.
Low bandwidth results in slow website loading.
- Load Average ⚖️– System load measured over 1, 5, and 15 minutes.
Example: A load average of 10 means 10 processes are waiting for CPU.
Higher load can lead to slower response times.
- Cache Hit Ratio 🎯– The percentage of requests served from cache instead of the database.
Example: A 90% cache hit ratio means 9 out of 10 requests were served from cache.
Higher cache hit ratio improves response time.
Installation:
You can install Prometheus from: https://prometheus.io/download/
To run Prometheus, we need to execute the following things:
Run Prometheus on Different OS
Mac (M1/M2 & Intel):
brew install prometheus
prometheus --config.file=/opt/homebrew/etc/prometheus.yml
Linux:
wget https://github.com/prometheus/prometheus/releases/latest/download/prometheus-linux-amd64.tar.gz
tar -xvf prometheus-linux-amd64.tar.gz
cd prometheus-linux-amd64
./prometheus --config.file=prometheus.yml
Windows (PowerShell):
Invoke-WebRequest -Uri "https://github.com/prometheus/prometheus/releases/latest/download/prometheus-windows-amd64.zip" -OutFile "prometheus.zip"
Expand-Archive -Path "prometheus.zip" -DestinationPath ".\prometheus"
cd prometheus
.\prometheus.exe --config.file=prometheus.yml
Prometheus will now be running on localhost:9090 🚀
That’s it for now. 💡 What’s Next?
I’ll be diving deeper into Prometheus fundamentals, exploring advanced metrics, and monitoring Kubernetes in upcoming blogs. Stay tuned!