Back to blog

I wanted Kimi k2.5 to work for my email monitor. It didn't.

A real-world benchmark on Kimi k2.5 vs other models for recurring assistant workflows, where reliability at cadence mattered more than benchmark hype.

AILLMsKimiOpenClawAutomationModel BenchmarkingReliability
I wanted Kimi k2.5 to work for my email monitor. It didn't.

I gave Kimi k2.5 a real shot.

Not a toy prompt or a one-off benchmark screenshot. I tested it in a recurring workflow that actually matters to my day: email monitoring.

I made Kimi my default for a while because it was free. I have Claude Max and use it heavily for daily work, and I didn't want OpenClaw burning through usage for background jobs if I could avoid it.

That was the theory.

In practice, I ran into something I've seen over and over with AI in production: a model can be smart, accurate, and still be the wrong choice.

Why this mattered to me in the first place

I run a lot through AI now. Writing, planning, triage, coding support, assistant workflows. That's not novelty for me anymore; it's operational.

So model selection isn't abstract. It has direct impact on:

  • whether I get useful notifications when I actually need them,
  • whether background jobs complete or silently fail,
  • whether the assistant is something I trust or something I babysit.

I wasn't trying to crown a favorite model. I was trying to make my system more sustainable.

The starting thought was simple: if Kimi is free and "good enough," use it for recurring background jobs and save paid capacity for higher-value interactive work.

It was a reasonable idea and a good instinct on cost discipline.

But production doesn't care about instinct. It cares about results under load.

The specific workflow I tested

The target workflow was my recurring email monitor.

Success criteria were straightforward:

  1. Run every 30 minutes.
  2. Pull recent unread emails across inboxes.
  3. Compare senders to a VIP sender database.
  4. Pull relevant context via tools.
  5. Notify me about what needs response/attention, with context that helps me act.

That context layer mattered. I didn't just want "You got an email from X."

I wanted actionable context like:

  • what else is on my calendar this week,
  • whether there are recent notes/memories related to the sender,
  • enough signal to decide quickly: respond now, defer, or ignore.

This is exactly the kind of thing AI should be good at when integrated well.

My assumption that turned out half-right

At first, I told myself speed wasn't critical.

If I got notified within an hour, that felt acceptable for email monitoring. This wasn't emergency incident response. It was workflow triage.

That assumption was partly right and partly wrong.

It was right that I didn't need sub-second latency.

It was wrong that long runtimes wouldn't create reliability problems downstream.

Because once you move from single runs to recurring cadence, runtime compounds.

And compounded runtime exposes all the fragility you can hide in one-off demos.

The first red flags (before hard failures)

The first signal wasn't a metrics chart. It was behavior.

I started sending requests in my main session, locking my phone, setting it down, and waiting for notifications because responses were taking so long.

Not catastrophic, but definitely friction—and friction is usually the warning shot before failure.

After that came the hard evidence: repeated notifications that background jobs were failing.

That's where the conversation changed.

Slowness is annoying.

Missed or failed jobs in a workflow designed to prevent dropped balls is a non-starter.

The benchmark that made the decision obvious

Kimi eventually completed a full run in 3m37s (217 seconds) with perfect classification.

If I had stopped there, I probably could've convinced myself it was fine.

But comparison is what matters.

Here are the final test results:

Model Time VIP Detection Filtering Cost Verdict
Gemini Flash 6s ✅ Perfect ✅ Perfect Quota 🏆 BEST
GPT-4.1 6s ✅ Perfect ✅ Perfect Paid 🏆 BEST
Gemini Pro 7s ✅ Perfect ✅ Perfect Quota Excellent
Sonnet 4.5 16s ✅ Perfect ✅ Perfect Paid Good
Kimi k2.5 217s ✅ Perfect ✅ Perfect Free TOO SLOW

All models produced the same classification output:

  • 1 VIP
  • 3 important
  • 6 skip

So this wasn't a quality gap.

It was a runtime profile mismatch.

Why 217 seconds is worse than it looks

In isolation, 3m37s doesn't sound outrageous.

At cadence, it's brutal.

This monitor runs 48 times per day.

  • Gemini Flash at 6s/run = 288 seconds/day (~4.8 minutes)
  • Kimi at 217s/run = 10,416 seconds/day (~2.9 hours)

Same answer quality.

One costs minutes.

One costs hours.

And that gap isn't just inconvenience. It introduces secondary failures:

  • timeout risk rises,
  • retries multiply overhead,
  • stale context accumulates,
  • confidence in alerts drops.

At that point, "free" becomes expensive in a different currency: reliability and attention.

What changed when we fixed the prompt

One important note: this wasn't just a bad-prompt experiment.

During this process, I simplified and clarified the prompt significantly because part of the initial pain was instruction bloat and too much context processing.

Even after those fixes, Kimi still wasn't viable for this recurring monitor in my environment.

That's important because it's easy to hand-wave failures as prompt mistakes.

Sometimes that is true.

Sometimes you fix the prompt and still have a model-workload mismatch.

This was the second case.

The architecture lesson hidden inside the model test

The model benchmark surfaced an architecture issue too.

Some of the early monitor design was doing too much per run:

  • broad context pulls,
  • heavy synthesis,
  • duplicated responsibilities.

That inflated token/runtime pressure and made any slower model even less viable.

So the lesson wasn't "just use model X."

It was:

  1. tighten the workflow,
  2. reduce unnecessary context churn,
  3. then choose the model that best fits the cleaned-up task.

If you skip step 1 and 2, you'll blame the model for architecture debt.

What the wider conversation on X got right (and missed)

While I was testing, I also looked at what people were saying about Kimi k2.5 publicly.

The release narrative was strong:

  • Moonshot positioned Kimi k2.5 as an open visual/agentic model with top benchmark performance.
  • Third-party posts highlighted strong scores on agentic and multimodal benchmarks.
  • There was obvious excitement around open-weight competitiveness and pricing.

A few examples worth noting:

  • Kimi/Moonshot launch posts and model claims on X
  • Artificial Analysis commentary on K2.5’s open-model position
  • Community benchmark reaction threads calling it frontier-adjacent

None of that is fake signal. Benchmarks matter.

But benchmarks answer a different question than recurring operations.

Benchmarks ask: "How capable is this model on this test set?"

Production asks: "Can this workflow complete reliably at the cadence I need with acceptable latency and overhead?"

Those are related questions, not identical questions.

Kimi looked strong in public benchmark discourse.

In my recurring monitor loop, it still failed the reliability/runtime bar.

Both can be true.

Counter-signal: Kimi may still be the best open-model option in some tests

I now have consolidated results from that benchmark slice.

The task was intentionally constrained: execute a real calendar tool call and return strict JSON only. That forces models to do tool use, data extraction, contract compliance, and concise factual summarization in one pass.

Group A: NVIDIA-hosted open-model slice

  • Kimi k2.5: 157.4s, 10.4k tokens, ✅ clean pass
  • Llama 3.1 8B / 70B / 405B + Llama 3.2 90B Vision: ❌ failed (tool/provider handling and/or schema non-compliance)
  • Pass rate: 1/5 (20%)
  • Key point: Kimi was the only model in this slice that produced a correct, schema-compliant output.
  • GPT-4.1: 15s, 10.7k tokens, ✅ pass
  • Claude Opus 4.6: 18s, ✅ pass
  • o4-mini: 21s, 22.9k tokens, ✅ pass
  • Claude Sonnet 4.6: 29s, ⚠️ partial (included in-progress event in upcoming list)
  • Gemini 2.5 Pro: 28s, 30.9k tokens, ❌ fail (incorrect counts/details)
  • Strict pass rate: 3/5 (60%)

So yes, Kimi still failed my recurring email-monitor fit test. But in this separate reliability benchmark, it outperformed the other open models in the tested NVIDIA slice on correctness + schema discipline.

That’s exactly why this post isn't anti-Kimi.

It's pro task-model fit.

Caveats (important)

  • This is a single-task benchmark, not a universal ranking.
  • Results are sensitive to orchestration environment, tool-wrapper behavior, and prompt strictness.
  • For publication-grade confidence, this should include at least:
    • 3 trials per model
    • a second task class (for example, Todoist write + verification)

Both of these statements are true:

  • Kimi may be the most reliable open-model option for certain constrained agent tasks.
  • Kimi can still be the wrong model for high-frequency operational loops where latency and completion reliability at cadence are dominant constraints.

Where this landed for me

For this workflow, Gemini Flash was the clear winner.

Not because it's fashionable.

Because it delivered:

  • identical accuracy,
  • dramatically lower latency,
  • better completion reliability at cadence,
  • acceptable quota impact.

So yes, cost matters. But only after the workflow is reliable.

If your automation can't be trusted, "saving" money is fake savings.

Am I anti-Kimi now?

No.

I'm not anti-Kimi. I just haven't found a reliable production fit for it in my current setup.

I've tried Kimi in sub-agent content/orchestration use cases too and still saw timeout/reliability issues. I also had an earlier X-thread workflow where it missed a stop signal, hallucinated messages, and responded to those hallucinations.

That doesn't mean nobody should use it.

It means I won't force a model into a role where it repeatedly fails my reliability bar.

The model-selection framework I use now

I don't use a rigid "three checks" template anymore.

I use a comparative workflow framework:

Step 1: Define one concrete workflow

Not "general intelligence." One actual recurring job with clear input/output.

Step 2: Define success before testing

Set explicit targets for:

  • quality threshold,
  • acceptable latency window,
  • failure tolerance,
  • run cadence.

Step 3: Run the same task across multiple models

OpenClaw sub-agents are great for this because you can execute the same test setup across models quickly.

Step 4: Compare the four metrics that matter

  • accuracy,
  • runtime,
  • token usage,
  • token cost.

Step 5: Pick for workload fit, not model brand

The fastest model that clears your quality threshold usually wins for recurring operations.

For deep one-off reasoning, the answer may differ.

But for background loops, this framing has saved me a lot of pain.

The practical rule I walked away with

If two models give me the same correct output, I choose the one that protects reliability at scale.

In this case, that was Gemini Flash.

Kimi wasn't disqualified on intelligence.

It was disqualified on operational fit.

And honestly, that's a healthier way to evaluate models anyway.

Not "Which model do I like?"

But "Which model keeps my system trustworthy when it runs all day, every day?"

That's the standard that matters for assistants that are supposed to reduce dropped balls, not create new ones.

A practical takeaway for builders running recurring jobs

If your automation runs all day, set your bar with production constraints, not benchmark excitement.

A simple rule set:

  • If quality is equal, prefer lower latency.
  • If latency is acceptable but failures are frequent, reject the model for that workflow.
  • If a model is "free" but causes retries and missed alerts, treat that as real cost.
  • Re-test after prompt/architecture simplification before finalizing your decision.

That sounds obvious written out. It's much less obvious when you're in the middle of tuning and hoping one more tweak will fix everything.

I definitely went through that loop.

The benchmark and cadence math made the decision clear.

For my email monitor, Gemini Flash is the right operational tool right now.