Home>News>Research
ResearchMonday, April 20, 2026·8 min read

Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale

AD
AI Agents Daily
Curated by AI Agents Daily team · Source: MarkTechPost
Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale
Why This Matters

Researchers from Moonshot AI and Tsinghua University have proposed PrfaaS, a new architecture that lets large language models split their heaviest computational work across multiple datacenters for the first time. This matters because it could dramatically cut the cost of running...

According to MarkTechPost's coverage, a joint team from Moonshot AI and Tsinghua University published a paper titled "Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter" on arXiv on April 16, 2026. The paper, assigned arXiv identifier 2604.15039v1, introduces PrfaaS, a system designed to break one of the most stubborn physical constraints in large-scale LLM serving. The research team includes Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, and Xinran Xu from Moonshot AI, with Yongwei Wu, Weimin Zheng, and Mingxing Zhang from Tsinghua University. Zhang serves as the corresponding author.

Why This Matters

The inference cost problem is the defining infrastructure challenge of 2026, and this research attacks it at the architectural level rather than through incremental optimization. Mooncake, Moonshot AI's production system, already demonstrated that smarter KVCache management can increase request capacity by 59 to 498 percent compared to baseline methods. PrfaaS extends that logic across datacenters, which means organizations could stop buying redundant GPU clusters in the same building and start distributing workloads geographically. For any company running multi-agent pipelines at scale, where dozens of agents hammer the same underlying model simultaneously, the ability to share KVCache computations across sites is not a nice-to-have feature, it is a serious cost reduction opportunity.

Stay ahead in AI agents

Daily briefing from 50+ sources. Free, 5-minute read.

The Full Story

For years, the way LLMs handle inference has been confined by a simple physical reality. The prefill stage, where a model processes your entire input prompt, and the decode stage, where it generates output tokens one at a time, both depend on KVCache, which are key-value tensors that store intermediate computations. Moving that KVCache data fast enough requires RDMA (Remote Direct Memory Access) networking, a technology that works brilliantly inside a single datacenter but becomes impractical across long distances. The result: prefill and decode have been locked together in the same building, sometimes the same rack.

The Moonshot AI and Tsinghua team saw an opening. The rise of hybrid-attention architectures, which mix standard attention mechanisms with newer, more memory-efficient approaches, has meaningfully shrunk the size of KVCache compared to traditional dense-attention models. Smaller KVCache means less data needs to travel between systems. The researchers reasoned that if the payload is small enough, you might be able to move it over ordinary Ethernet rather than RDMA, opening the door to genuine cross-datacenter serving.

But smaller data alone does not solve the problem in practice. Real production workloads are messy. Request patterns arrive in unpredictable bursts. Some prompts contain thousands of tokens and others contain just a handful. Prefix caches, the stored computations that let the system skip repeated work, are unevenly distributed across different datacenters. Network bandwidth between clusters fluctuates constantly. A naive approach that simply ships all prefill work to a remote cluster would collapse under these conditions, creating congestion and unstable queuing.

PrfaaS addresses this by being selective rather than wholesale. The system offloads only long-context prefill tasks, the computationally expensive ones that justify the network overhead, to specialized compute-dense remote clusters. It then uses bandwidth-aware scheduling to manage how and when KVCache data moves across the network. The decode stage stays closer to the end user, and the two stages coordinate through a scheduling layer designed around the realities of fluctuating inter-cluster bandwidth.

This architecture builds directly on Moonshot AI's Mooncake platform, which won Best Paper at the USENIX FAST 2025 conference. Mooncake already runs in production for Kimi, Moonshot AI's chatbot service, processing over 100 billion tokens daily across thousands of nodes. In deployments using NVIDIA A800 and H800 GPU clusters, Mooncake allows Kimi to handle 115 percent and 107 percent more requests respectively compared to the systems it replaced. PrfaaS is the next logical step, taking the disaggregation principle that Mooncake proved at the cluster level and extending it to span entire datacenters.

Key Details

  • The paper was published on arXiv on April 16, 2026, under identifier 2604.15039v1.
  • The research team includes 8 researchers split between Moonshot AI and Tsinghua University, with Mingxing Zhang as corresponding author.
  • Mooncake, the predecessor system, processes over 100 billion tokens per day across thousands of nodes in production.
  • Mooncake improved Kimi's request capacity by 115 percent on NVIDIA A800 clusters and 107 percent on H800 clusters compared to prior systems.
  • Testing with real production traces showed Mooncake increased effective request capacity by 59 to 498 percent over baseline methods while meeting Service Level Objectives.
  • Mooncake received the Best Paper award at USENIX FAST 2025, one of the most competitive systems conferences in computing.
  • PrfaaS uses standard Ethernet for cross-datacenter KVCache transport, removing the dependency on expensive RDMA infrastructure.

What's Next

The immediate question is whether PrfaaS moves from arXiv paper to production deployment inside Kimi's infrastructure, which would give the research community hard performance numbers on a system running at genuine scale. Watch for follow-up benchmarks comparing cross-datacenter latency against the RDMA-bound single-datacenter baseline, because that data will determine whether other cloud providers treat this as a blueprint worth copying. Given that Moonshot AI has a track record of shipping production systems that back up their research claims, with Mooncake being the clearest example, enterprises building AI agent platforms should be tracking this work closely.

How This Compares

The broader context here is a wave of disaggregated inference research that has accelerated through 2025 and into 2026. Prefill-decode disaggregation itself is not new. Systems like DistServe and Splitwise explored separating these two phases within a single cluster, and major cloud providers have implemented variations of the idea in their managed inference services. What PrfaaS does differently is cross the datacenter boundary, which none of those earlier systems seriously attempted.

Google's infrastructure for Gemini models operates across multiple datacenter regions, but Google has the luxury of proprietary high-bandwidth interconnects between its data centers, a resource most organizations cannot access. PrfaaS makes a practical argument for commodity Ethernet as the transport layer, which is a genuinely different bet on what the industry will actually deploy. If that bet is right, it opens cross-datacenter disaggregation to a much wider set of operators beyond hyperscalers with custom networking hardware.

Compared to recent AI infrastructure news around speculative decoding and continuous batching optimizations, the PrfaaS work operates at a higher architectural level. Those techniques optimize the compute happening inside a single cluster. PrfaaS changes the topology of where compute happens at all. For multi-agent workloads specifically, where shared KVCache between agents running simultaneously could eliminate enormous amounts of redundant computation, this architectural shift has implications that go well beyond single-model serving.

FAQ

Q: What is KVCache and why does it matter for AI performance? A: KVCache stands for key-value cache, which stores intermediate computational results during the process of generating text with a large language model. Reusing these stored results instead of recomputing them from scratch saves significant time and GPU resources. Faster, more efficient KVCache management is one of the primary ways engineers speed up LLM inference without buying more hardware.

Q: What does prefill disaggregation mean in plain language? A: When you send a long prompt to an LLM, the model first reads and processes your entire input before it starts generating a response. Prefill disaggregation means running that reading step on a separate, specialized set of computers rather than the same machines that generate the output. PrfaaS extends this so those specialized computers can be in a completely different datacenter.

Q: How does PrfaaS improve on what Mooncake already does? A: Mooncake disaggregates prefill and decode within a single datacenter cluster using high-bandwidth RDMA networking. PrfaaS removes the requirement for RDMA and enables the same disaggregation across geographically separate datacenters over standard Ethernet. This makes the architecture more flexible and potentially cheaper to operate, since organizations are not forced to co-locate all their compute in one physical facility.

The LLM infrastructure space is moving faster than most people realize, and papers like this one from Moonshot AI and Tsinghua signal that the architectural assumptions underlying today's production systems are not permanent. Engineers building agent systems that depend on inference at scale should read the full paper at arXiv before the production benchmarks arrive. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.

Our Take

This story matters because it signals a shift in how AI agents are being adopted across the industry. The research findings here could reshape how developers build agentic systems in the coming months.

Post Share

Get stories like this daily

Free briefing. Curated from 50+ sources. 5-minute read every morning.

Share this article Post on X Share on LinkedIn