Skip to main content

Here is the rewritten content:

NVIDIA has introduced Dynamo, an open-source inference software designed to accelerate and scale reasoning models within AI factories, available at NVIDIA.

Effectively managing and coordinating AI inference requests across a fleet of GPUs is crucial for ensuring that AI factories operate with optimal cost-effectiveness and maximize token revenue generation.

As AI reasoning becomes increasingly prevalent, each AI model is expected to generate tens of thousands of tokens with every prompt, representing its “thinking” process. Therefore, enhancing inference performance while reducing its cost is essential for accelerating growth and boosting revenue opportunities for service providers.

A New Generation of AI Inference Software

NVIDIA Dynamo, which succeeds the NVIDIA Triton Inference Server, represents a new generation of AI inference software specifically engineered to maximize token revenue generation for AI factories deploying reasoning AI models.

Dynamo orchestrates and accelerates inference communication across potentially thousands of GPUs, employing disaggregated serving, a technique that separates the processing and generation phases of large language models (LLMs) onto distinct GPUs. This approach allows each phase to be optimized independently, catering to its specific computational needs and ensuring maximum utilization of GPU resources.

“Industries worldwide are training AI models to think and learn in different ways, making them more sophisticated over time,” stated Jensen Huang, founder and CEO of NVIDIA. “To enable a future of custom reasoning AI, NVIDIA Dynamo helps serve these models at scale, driving cost savings and efficiencies across AI factories.”

Using the same number of GPUs, Dynamo has demonstrated the ability to double the performance and revenue of AI factories serving Llama models on NVIDIA’s current Hopper platform. Furthermore, when running the DeepSeek-R1 model on a large cluster of GB200 NVL72 racks, NVIDIA Dynamo’s intelligent inference optimizations have shown to boost the number of tokens generated by over 30 times per GPU.

To achieve these improvements in inference performance, NVIDIA Dynamo incorporates several key features designed to increase throughput and reduce operational costs.

Dynamo can dynamically add, remove, and reallocate GPUs in real-time to adapt to fluctuating request volumes and types. The software can also pinpoint specific GPUs within large clusters that are best suited to minimize response computations and efficiently route queries. Additionally, Dynamo can offload inference data to more cost-effective memory and storage devices while retrieving it rapidly when required, thereby minimizing overall inference costs.

NVIDIA Dynamo is being released as a fully open-source project, offering broad compatibility with popular frameworks such as PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM. This open approach supports enterprises, startups, and researchers in developing and optimizing novel methods for serving AI models across disaggregated inference infrastructures.

NVIDIA expects Dynamo to accelerate the adoption of AI inference across a wide range of organizations, including major cloud providers and AI innovators like AWS, Cohere, CoreWeave, Dell, Fireworks, Google Cloud, Lambda, Meta, Microsoft Azure, Nebius, NetApp, OCI, Perplexity, Together AI, and VAST.

NVIDIA Dynamo: Supercharging Inference and Agentic AI

A key innovation of NVIDIA Dynamo lies in its ability to map the knowledge that inference systems hold in memory from serving previous requests, known as the KV cache, across potentially thousands of GPUs.

The software then intelligently routes new inference requests to the GPUs that possess the best knowledge match, effectively avoiding costly recomputations and freeing up other GPUs to handle new incoming requests. This smart routing mechanism significantly enhances efficiency and reduces latency.

“To handle hundreds of millions of requests monthly, we rely on NVIDIA GPUs and inference software to deliver the performance, reliability, and scale our business and users demand,” said Denis Yarats, CTO of Perplexity AI.

“We look forward to leveraging Dynamo, with its enhanced distributed serving capabilities, to drive even more inference-serving efficiencies and meet the compute demands of new AI reasoning models.”

AI platform Cohere is already planning to leverage NVIDIA Dynamo to enhance the agentic AI capabilities within its Command series of models.

“Scaling advanced AI models requires sophisticated multi-GPU scheduling, seamless coordination, and low-latency communication libraries that transfer reasoning contexts seamlessly across memory and storage,” explained Saurabh Baji, SVP of engineering at Cohere.

“We expect NVIDIA Dynamo will help us deliver a premier user experience to our enterprise customers.”

Support for Disaggregated Serving

The NVIDIA Dynamo inference platform also features robust support for disaggregated serving, a technique that assigns the different computational phases of LLMs – including the crucial steps of understanding the user query and then generating the most appropriate response – to different GPUs within the infrastructure.

Disaggregated serving is particularly well-suited for reasoning models, such as the new NVIDIA Llama Nemotron model family, which employs advanced inference techniques for improved contextual understanding and response generation. By allowing each phase to be fine-tuned and resourced independently, disaggregated serving improves overall throughput and delivers faster response times to users.

Together AI, a prominent player in the AI Acceleration Cloud space, is also looking to integrate its proprietary Together Inference Engine with NVIDIA Dynamo. This integration aims to enable seamless scaling of inference workloads across multiple GPU nodes. Furthermore, it will allow Together AI to dynamically address traffic bottlenecks that may arise at various stages of the model pipeline.

“Scaling reasoning models cost-effectively requires new advanced inference techniques, including disaggregated serving and context-aware routing,” stated Ce Zhang, CTO of Together AI.

“The openness and modularity of NVIDIA Dynamo will allow us to seamlessly plug its components into our engine to serve more requests while optimizing resource utilization—maximizing our accelerated computing investment. We’re excited to leverage the platform’s breakthrough capabilities to cost-effectively bring open-source reasoning models to our users.”

Four Key Innovations of NVIDIA Dynamo

NVIDIA has highlighted four key innovations within Dynamo that contribute to reducing inference serving costs and enhancing the overall user experience:

  • GPU Planner: A sophisticated planning engine that dynamically adds and removes GPUs based on fluctuating user demand, ensuring optimal resource allocation and preventing both over-provisioning and under-provisioning of GPU capacity.
  • Smart Router: An intelligent, LLM-aware router that directs inference requests across large fleets of GPUs, minimizing costly GPU recomputations of repeat or overlapping requests and freeing up valuable GPU resources to handle new incoming requests more efficiently.
  • Low-Latency Communication Library: An inference-optimized library designed to support state-of-the-art GPU-to-GPU communication, abstracting the complexities of data exchange across heterogeneous devices and significantly accelerating data transfer speeds.
  • Memory Manager: An intelligent engine that manages the offloading and reloading of inference data to and from lower-cost memory and storage devices, ensuring a seamless process with no negative impact on the user experience.

NVIDIA Dynamo will be made available within NIM microservices and will be supported in a future release of the company’s AI Enterprise software platform.

See also: LG EXAONE Deep is a maths, science, and coding buff

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events, including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.


Source Link