During the AI era, the support systems of Hewlett Packard Enterprise have regularly processed billions of operational signals coming in from customers every day, and as those operational environments became more autonomous, executives for the IT gear supplier watched as the number of tokens those systems consumed scaled with the amount of signals.
Along with this, the per-token cost to HPE also steadily mounted, the result of what has become known as “tokenomics.”
In response to the rapidly rising costs, HPE engineers built an AI-first support platform on premises with the company’s GreenLake Intelligence – a framework of AI agents – and Private Cloud AI, an on-prem AI infrastructure engineered with Nvidia. Running the AI workloads on its own infrastructure gave HPE control over the economics of the work, according to Fidelma Russo, executive vice president, president and general manager of HPE’s hybrid cloud business unit and the company’s chief technology officer.
“It allowed us to govern that really important customer data and it gave us better performance,” Russo said from the keynote stage at the recent HPE Discovery 2026 event in Las Vegas. “It also helped us significantly minimize the token spend associated with operating AI at scale. We stopped being consumers of AI and we became producers of intelligence.”
As a result, HPE lowered the cost by more than 30 times, saving nearly $100,000 a month, which she said gave the company the capacity needed to scale more quickly.
This is an example of the shift in the industry away from enterprises running their AI workloads primarily on the big clouds to building AI datacenters that run on-premises – and even reach out to the edge – and create more hybrid inferencing environments. There are several reasons for this, including data sovereignty and security, as well as latency. However, key among them are the rising costs associated with expansion of AI inferencing and the emergence of agentic AI.
“Once agents have continuous access to data, every interaction consumes a token,” Russo said. “That includes every decision, includes some validating the decision, and includes them taking the action. Unlike traditional AI, agents don’t stop after one response. They continuously reason, they continuously coordinate, and they continuously interact with other systems. What that means is that inference is a continuous operational workload, not a one-time request. This brings us back to economics, because every time an AI system reasons, validates, acts, or takes action, it’s consuming a token.”

The consumption piles up the costs very quickly, she said. What seems like a simple prompt can becomes thousands and millions of model interactions. For example, public data shows that OpenClaw – the widely popular virtual personal AI agent – has processed more than 600 billion tokens in a single month to support roughly 100 continuously operating coding agents, Russo said, which comes to about $13,000 per agent per month.
“Suddenly, AI economics looks a lot like infrastructure economics,” she said. “It comes down to utilization, efficiency, and scale, and how well we operate the full system and not just the model.”

According to validation firm Signal65, agents can utilize 4X to 15X times as many tokens than standard AI chat interactions, and as agentic workloads evolve, autonomous agents could push 1,000 times the inference demand than reasoning AI. All of this will drive the cost of inferencing and agents even higher.
“Training might happen in the cloud, and that part of the story remains largely true,” wrote Steve McDowell, founder and chief analyst with NAND Research. “But something unexpected is happening with inference. It’s moving back on-prem and out to the edge, a quiet reversal that’s forcing a fundamental rethink of enterprise AI architecture. The ‘cloud for everything’ approach that seemed inevitable just two years ago is proving impractical for production AI workloads. IT organizations are discovering that while cloud infrastructure excels at certain AI tasks, inference often works better closer to home.”
The trend has driven traditional OEMs to create infrastructure that can run these AI workloads on-premises and software tools that can manage the resulting hybrid environments that are arising, including in the form of AI factories created by the likes of not only HPE but also Dell Technologies and Cisco Systems in partnership with Nvidia. Executives speaking at Dell Technologies World 2026 in May highlighted their efforts to expand the vendor’s on-premises AI infrastructure capabilities, with founder and chief executive officer Michael Dell pointing to a study by the company that said 67 percent of AI workloads run outside the cloud and 88 percent of those surveyed said they are running at least one AI workload in their own datacenter.
And at its Cisco Live 2026 show earlier this month, Cisco Systems gave a similar assessment, putting a networking-heavy emphasis on the initiative while outlining the advantages of its large infrastructure, services, and software.
As we noted, HPE also put a focus on its networking as it continues to cross-pollinate its Aruba branch and campus networking lineup with the technology inherited when it bought Juniper Networks for $14 billion last year. However, the reached deeper into a range of other areas, from the cloud and the edge to software and security.
In software, a focus is on GreenLake Intelligence, with a central agent registry to ensure that organizations know not only what agents they have, but also where they are and what they’re permitted to do, and OpsRamp Copilot for AI agents and large language models to monitor utilization, token-based consumption, and keeping tabs on costs related to not only agents but also AI factories and workloads.

Morpheus 9, the latest iteration of the software for infrastructure automation that now includes Morpheus Central for federated manage of multiple sites, the Morpheus Orchestration Copilot that uses natural language for provisioning, and integrated software-defined networking.

“Morpheus Central provides a single operational layer across every distributed Morpheus instance, regardless of where it runs or what it manages,” Russo said.
HPE is integrating its Alletra MPX 10000 storage with the two-year-old Private Cloud AI to automatically apply policies for governance and metadata.
Russo said that every agent, inference, and AI workflow depends on context, the background information that makes up the systems working memory and situational awareness to generate relative and accurate responses. Systems that need to rebuild context every time it’s used, tokens are burned and processes are slowed down, with Russo noting that “in AI, memory is no longer a technical detail or a supply chain challenge. It is a strategic resource.”
KV cache removes the need to rebuild the context, meaning that storage becomes active memory for AI, which for HPE means its HPE Alletra Storage X10000 system becomes the repository for that information.
“Not simply to store data, but to keep data, context, and intelligence available wherever AI needs it,” she said. “It reduces the time GPUs spend waiting for data. It increases the amount of useful work your infrastructure can perform. And the X10K turns storage into an active part of AI efficiency and is a key contributor to the economic value of AI.”
In tests, Russo said the MX 10000 provides 20 times faster time-to-first-token and 17 higher throughput.
Private Cloud AI servers will come with Nvidia’s Agent Toolkit, which includes the GPU maker’s OpenShell secure runtime, NemoClaw blueprints, and Nemotron models to give developers the tools they need to design, build, run, and orchestrate large-scale multi-agent environments.
HPE’s upcoming ProLiant DL394 Gen12 server, due early next year, will include Nvidia’s new Vera CPUs that are used to support agents with rapid tool calls, orchestration, and real-time data processing.
These points and others address some of the ways HPE is trying to ease the path for enterprises to bringing AI infrastructure to internal datacenters, because that’s where the workloads are heading, according to Cheri Williams, senior vice president and general manager of HPE’s private cloud and flex solutions.
“There’s still a place for experimentation and model training in the public cloud, and you still see customers doing that,” Williams said during a panel discussion. “But when it comes to production, on-prem is where most of the enterprise customers are going. The economics don’t make sense to be in production in the public cloud.”