The Great UPI Outage Debacle: Why NPCI 's Root Cause Analysis Falls Short

Exposing the architectural and design flaws in NPCI 's UPI system that led to the 2025 transaction outages, questioning the official root cause analysis and providing an enhance system design.

Apr 16, 2025

The Hidden Truth Behind India's Payment Crisis

Just yesterday I published an article on what I thought could have been the reasons behind the UPI outages we saw in rapid succession. The reason I did that, was simply because I was a little miffed at the silent treatment NPCI had meted out to us. And it also stemmed from the fact that while the industry has been talking about reviving the NUE (a concept I too spoke off in my article), I realize that setting that up will need a mammoth amount of money, to get over the inertia that UPI with its decade long run has had to capture.

And then late last evening, I read the news on the "Official root cause analysis of the intermittent UPI outages" from NPCI. It just didn't make sense.

So, here I am, with an emergency edition of the Fintech Chronicler, deep diving into the technicalities of "flooding of 'Check transaction' API" requests from Payment Service Provider (PSP) banks, trying to think whether the problem stems from fundamental flaws in the UPI system's architecture, design patterns, and operational protocols. And then putting on my builder cap, to see what kind of system design changes can prevent something like this happening again.

So lets dig in.

The NPCI Explanation: A Surface-Level Analysis

According to NPCI's root cause analysis, the April 12 UPI outage occurred because:

PSP banks were flooding the system with "Check transaction" API requests
These requests included checks for older transactions, repeatedly sent multiple times
PSP banks did not wait for responses before sending additional requests
The system became congested, leading to prolonged downtime

While these observations might be technically accurate, they represent symptoms rather than the root cause of the problem. The explanation fails to address fundamental architectural questions about why the system was vulnerable to such a simple failure mode in the first place.

The Missing Throttling Mechanisms

The first critical oversight in NPCI's system design involves the absence of appropriate throttling mechanisms at the transaction ID level. In modern high-volume payment systems, transaction-specific rate limiting is standard practice to prevent exactly this type of cascade failure.

Why Transaction-Level Throttling Is Essential

Proper API design includes rate limiting at multiple levels:

Global rate limits across the entire API
Service-specific rate limits
Endpoint-specific rate limits
Transaction ID-specific rate limits

The last item is particularly crucial for "Check transaction" API calls. By implementing a simple rule that limits repeat checks on the same transaction ID to once per 90 seconds (as per NPCI's own guidelines), the system could have automatically rejected excessive requests without overwhelming the core processing infrastructure.

As noted in payment industry best practices: "Fixed-rate limiting involves setting a fixed number of payments that can be processed per unit of time." This basic throttling mechanism should have been implemented from day one, especially for non-critical status check operations.

The OLTP vs. OLAP Architectural Separation

Check transactions are fundamentally different from payment processing transactions:

Payment processing is a mission-critical OLTP operation
Status checks are read-only analytical operations that don't modify state

These two types of operations should be handled by separate infrastructure:

OLTP Servers: Dedicated to processing new payment transactions
OLAP Servers: Dedicated to handling status checks, reports, and other read-only operations

Industry standard practice confirms this separation: "OLTP is a type of database processing focused on efficiently managing and executing a large number of short, online transactions in real-time," while "OLAP specializes in complex analytical queries." Airbyte

The fact that check transaction queries could bring down the entire UPI payment processing infrastructure indicates a serious architectural flaw in NPCI's system design, where these separate concerns were inappropriately combined.

The Absent Caching Layer

The third and perhaps most glaring omission from NPCI's system architecture is the lack of a proper caching mechanism for transaction status information.

Modern Payment Systems Require Robust Caching

NPCI's explanation states that PSP banks were repeatedly checking "older transactions" - precisely the type of data that should be cached. A robust payment system architecture should include:

A transaction status cache with appropriate TTL (Time-To-Live) settings
A last-known-status response mechanism for completed transactions
A caching layer that serves repeated status checks without hitting the main database

Payments always use both cached data for speed and persistent data for recoverability. Whenever there is caching, then it is important to have mechanisms to keep the cached data and the persistent data in sync." Coinbase via Crowdfund Insider

A quick recap of how UPI works in a 5 party system

Proper Architecture for Resilient Payment Systems

So if I had the chance to rewire NPCI's UPI architecture, with just the information I have in hand here is what I would do:

1. Multi-Tier Architecture with Separation of Concerns

Presentation Layer: API gateways with robust rate limiting and throttling
Service Layer: Separated OLTP (payment processing) and OLAP (status checking) services
Data Layer: Primary database for transactions with separate read replicas for status checks

2. Comprehensive Throttling Strategy

Global and Local rate limits to protect overall system stability
Service-specific limits to prevent one service from overwhelming others
Transaction ID-based throttling to prevent repeated checks of the same transaction, or redirecting them to the caches service first, before hitting the core service APIs

3. Intelligent Caching System

Status cache for completed transactions with appropriate TTL
Write-through cache for maintaining consistency
Cache invalidation strategies for updated transaction statuses
Priority queuing system for different types of requests

That’s it for now. Hopefully in the next edition, I will have something other than UPI to talk about.

The Fintech Chronicler

Discussion about this post