This is part two of a series. In part one, I wrote about my first day with k6, what I built, what broke, and how I found that bcrypt at cost factor 12 was killing performance under load. That post ended with a cliffhanger: k6 tells you something is slow. But it doesn't tell you where exactly, or why. That's what today is about.

What Jaeger Taught Me That Postman Never Could

After the k6 session, I had numbers. Good ones, bad ones, and one that nagged at me, a 90ms gap I couldn't explain. Postman had never shown me anything like it. Neither had k6, not directly. I needed something that could look inside a request while it was happening and tell me exactly where time was going.

Today I sat down to properly learn observability. Not just know what it is, actually set it up, configure it into a real app, and see what it tells me. I built a NestJS microservices blogging platform (auth, user, blog, comment services communicating over gRPC), threw k6 load tests at it, and watched what happened. What I found changed how I think about performance debugging.

Setting up

Before I could observe anything, I needed something worth observing. I wrote four microservices, wired them together with gRPC, added JWT auth with refresh tokens, bcrypt for password hashing, and an API gateway in front. Docker Compose to run it all locally. (part one)

Then I set up the observability stack:

  • Jaeger for distributed tracing, what happened to that specific request, across every service it touched
  • Prometheus for metrics, system health, throughput, error rates over time
  • Grafana for dashboards on top of Prometheus

Three containers, a bit of configuration for OpenTelemetry instrumentation in each service, and I had eyes on the system.

The smoke test looked fine

First test: one virtual user, 30 seconds, hitting every endpoint in sequence. Register, login, get profile, create blog, list blogs, get blog, create comment, list comments.

txt
http_req_duration p(95)=111.45ms
http_req_failed   rate=0.00%


Everything green. 209 requests, zero failures, p95 under 500ms. Postman would have told me the same thing, endpoints work, responses look right, move on.

I didn't move on.

Then I opened Jaeger

This is where things got interesting.

I pulled up the trace for POST /auth/login. Total duration: 104ms. I clicked into it expecting to see that time distributed across spans, gateway, auth service, database query, response back.

Instead I saw this: the top-level span was 104ms. The child spans underneath it, the ones being traced, added up to about 14ms. Somewhere between the parent and its children, 90ms had vanished.

Postman showed me 104ms and called it a day. Jaeger showed me 104ms with a 90ms hole in it and asked me what I was going to do about it.

Hunting the ghost

The missing time had to be something the tracer didn't know about. Something happening inside the service that hadn't been instrumented. I looked at the auth flow, what does login actually do?

It calls bcrypt.compare. Password verification. I'd been running bcrypt at cost factor 10, which means 2^10 = 1024 rounds of hashing. Under a single request in Postman, that's imperceptible. Under concurrent load, every virtual user doing it simultaneously, it chews through CPU and the latency stacks up.

But I couldn't see it in Jaeger because bcrypt wasn't being traced. It was a black box inside my spans.

I added custom instrumentation for both bcrypt calls: the compare on login, and the hash on refresh token generation:

ts
const valid = await tracer.startActiveSpan('bcrypt.compare', async (span) => {
  try {
    return await bcrypt.compare(password, user.hashedPassword);
  } finally {
    span.end();
  }
});

const hashed = await tracer.startActiveSpan('bcrypt.hash (refresh-token)', async (span) => {
  try {
    return await bcrypt.hash(refreshToken, 10);
  } finally {
    span.end();
  }
});


Ran the smoke test again and opened Jaeger.

There they were. bcrypt.compare: 51.71ms. bcrypt.hash for the refresh token: 49.83ms. Together they accounted for the full 90ms gap. The ghosts had names now.

That's what Jaeger taught me that Postman never could, not just that something was slow, but exactly where inside the request the time was going. Down to the specific function call, with a millisecond number attached.

A decision about load testing

Once I knew bcrypt was the bottleneck, I had to think about what I was actually trying to measure. If I include auth in my stress tests, the bcrypt cost dominates the numbers and obscures the real service latency.

But in production, the auth service would run in its own pod with its own CPU. It wouldn't compete with the blog service for resources. So testing them together on one machine gives me numbers that don't reflect production reality.

I restructured the stress tests to pre-authenticate, generate tokens before the test run, then use those tokens during the test. That way the stress test measures actual hot-path latency completely decoupled from bcrypt. When I deploy to separate pods, the numbers will be representative.

The stress test results after that change:

txt
http_req_duration p(95)=145.22ms  p(99)=196.95ms
http_req_failed   rate=0.00%
checks_succeeded  100.00%, 64192 out of 64192

151 requests per second. Zero failures. All thresholds green.

Then I deployed it

I figured if I was going to learn observability properly, I should see what it looks like on real infrastructure. I wrote Kubernetes manifests for all six services plus the observability stack, and deployed to GKE.

txt
api-gateway        2 pods running
auth-service       1 pod running
blog-service       2 pods running
comment-service    2 pods running
email-service      1 pod running
postgres           1 pod running
redis              1 pod running
jaeger             1 pod running
prometheus         1 pod running
grafana            1 pod running


Then ran the stress test against the live deployment, from Kathmandu, hitting a server in Delhi.

txt
http_req_duration p(95)=258.68ms
http_req_failed   rate=0.00%  (2 out of 53394)
http_reqs         126/second


258ms p95 including the Nepal-to-Delhi round trip, which alone is around 80ms. Real application latency is closer to 175ms. At 126 requests per second sustained, that's roughly 200k–500k users per day capacity, and that's with GKE's horizontal pod autoscaler already doing the heavy lifting. My original goal was 20k users per day. The autoscaler quietly scaled to handle 15x that without me touching anything.

The 2 failed requests out of 53,394 were expected, during the stress test ramp-up, HPA was spinning up new pods and a couple of requests hit a pod before it was ready to serve. A readiness probe with a proper initialDelaySeconds fixes that. Not a bug, just something to configure properly before real production traffic.

What I actually learned

Postman tells you your code works. Jaeger tells you how it works, where time is going, which span is slow, what's hiding inside your black boxes.

The distinction I didn't fully understand before today: Jaeger is a diagnostic tool. You use it during development and UAT to trace specific requests, find bottlenecks, understand the shape of your system under load. Prometheus is a production monitoring tool. It's what stores your metrics over time, fires alerts when things go wrong, and wakes you up at 3am when your error rate spikes.

When I have real users and something goes wrong at midnight, Prometheus is what tells me something is wrong. Jaeger is what I open to figure out what.

In production you can't store every trace, the volume is too high. You sample probabilistically, maybe 1-5%, and keep everything that's slow or errored. The rest you let go. But the infrastructure is there when you need it.

I set out today wanting to understand observability. I ended up deploying a microservices system to Kubernetes, finding a hidden 90ms performance issue that would have been invisible in Postman forever, and building something that can handle half a million users a day.

Not a bad day I’d say.