Business Wire

KAYTUS Enhances KSManage with Full-Stack O&M Visibility for AI Data Centers

26.2.2026 09:02:00 CET | Business Wire | Press release

Share

As AI data centers scale to support increasingly complex AI workloads, traditional IT monitoring can no longer provide the visibility required for reliable operations. KAYTUS, a leading provider of end-to-end AI and liquid cooling solutions, has significantly upgraded KSManage, introducing full-stack, four-level visibility across components, servers and cabinets, clusters, and AI jobs, to address the challenges of complex troubleshooting, higher component failure rates, intricate application dependencies and delayed responses to operations and maintenance (O&M) incidents generated by demanding AI data center operations. The enhanced platform enables precise fault localization, faster incident response, and proactive operations. With KSManage, KAYTUS helps customers maximize availability, improve operational efficiency, and ensure the stability of mission-critical AI data centers powering next-generation computing.

Four Key Challenges Constrain the Operational Efficiency of AI Data Centers

The rapid evolution of large language models (LLMs) is accelerating the development of AI data centers, driving widespread adoption of heterogeneous CPU, GPU, and DPU architectures and increasing the need for cross-regional collaboration. These trends are significantly raising the complexity of operations and maintenance (O&M), where even a single outage can result in losses exceeding USD 1 million, underscoring the growing importance of availability and resilience in AI data center operations.

1. Infrastructure Complexity Hinders Troubleshooting.

AI heterogeneous data centers integrate a wide range of computing, networking, storage, and supporting systems. Traditional monitoring approaches treat devices as isolated entities and lack end-to-end visibility across the full system, making fault tracking and correlation difficult. As a result, these methods fall short of the stringent operational requirements of AI data centers, which demand rapid detection, rapid analysis, and rapid recovery. The inability to quickly identify root causes directly impacts recovery time and undermines overall system availability.

2. Rising Core Component Failure Rates and Limited Predictive Warning.

Core components such as GPUs and storage devices form the foundation of AI data center performance and operational stability. The rapid adoption of high–power-density hardware has significantly accelerated component wear, driving higher failure rates. Industry data indicate that GPU power consumption has increased more than fivefold over the past decade, while cabinet power density has risen to 20–50 kW, and gradually approaching 200 kW. Under such sustained high-load conditions, the risk of component failure increases sharply. However, traditional monitoring systems lack real-time health tracking and predictive trend analysis, limiting the ability to detect early warning signs and proactively prevent failures.

3. Complex AI Application Scenarios Lack End-to-End Business Correlation for Monitoring.

AI data centers support a wide range of application scenarios, including AI-generated content (AIGC), autonomous driving, and scientific computing. These workloads impose highly diverse requirements on compute, network, and storage resources, making it difficult to correlate underlying hardware issues, such as GPU memory leaks or InfiniBand packet loss, with specific AI jobs. Industry statistics show that approximately 8% of unplanned LLM training interruptions are caused by optical module or fiber failures. Even millisecond-level packet loss can disrupt training, trigger job restarts, and force progress rollbacks, resulting in significant waste of computing resources. Traditional monitoring approaches lack full-link visibility across hardware, workloads, and business processes, limiting their ability to pinpoint and resolve such issues efficiently.

4. Complicated Maintenance Processes Lead to Delayed O&M Responses.

The growing need for cross-regional collaboration has significantly increased the complexity of AI data center operations and maintenance. Critical tasks such as resource scheduling and network link planning still rely heavily on manual processes, which are time-consuming and prone to error. At the same time, limited operational staffing further slows response times, forcing organizations into a largely reactive approach to fault management. The lack of automated response mechanisms results in extended mean time to repair (MTTR), negatively impacting overall service availability and operational efficiency.

KSManage Address the Four Key Challenges by a Full-stack Four-level Intelligent Visibility

To address the operational and maintenance (O&M) challenges of AI data centers, KSManage introduces a newly established four-layer intelligent monitoring framework, spanning from components to systems. Leveraging global, end-to-end visibility, the solution enables automated fault detection, early warning, and intelligent remediation—significantly enhancing O&M efficiency and ensuring the high availability of AI data centers.

1. Full Correlated Visibility with Real-Time Troubleshooting and 3D Visualization

To address the complexity of troubleshooting in large-scale AI data centers driven by heterogeneous infrastructure and densely interwoven relations, KAYTUS KSManage delivers full correlated visibility with unified visual intelligence. The platform continuously collects real-time core metrics, including GPU and CPU utilization, video memory usage, power consumption, network bandwidth, and storage health, while concurrently aggregating operational events and network logs. Leveraging automated topology discovery, KSManage tracks end-to-end cross-node workloads, building an integrated “measurement–log–trace” data foundation. By correlating device health and down to port-level telemetry throughout the entire job lifecycle, KSManage dynamically visualizes resource allocation through real-time 3D modeling. This end-to-end approach overcomes the limitations of traditional siloed monitoring, enabling precise full correlation analysis and transforming root-cause diagnosis from time-consuming investigation into rapid, accurate fault localization, improving troubleshooting efficiency by up to 90%.

2. Predictive Hardware Trend Analysis with Early Warning for Core Component Reliability.

To address the lack of proactive early warning, rising failure rates, and accelerated component wear driven by the widespread adoption of high-power-density devices, KAYTUS KSManage establishes an intelligent hardware health management and early warning system. Leveraging comprehensive hardware telemetry, KSManage applies advanced algorithms to deeply analyze performance trends of critical components, including GPUs and storage devices. Early indicators of abnormal wear are accurately identified, enabling hardware failure risks to be predicted up to seven days in advance. In parallel, KSManage continuously monitors key operational parameters such as load and temperature, proactively mitigating potential failures under sustained high-load conditions and reducing component failure rates at the source.

3. End-to-End Application Dependencies Corelated with Network Monitoring and Workflows.

To address the challenges posed by diverse AI application scenarios, complex business workflows, and the difficulty of correlating hardware anomalies with AI training tasks, KAYTUS KSManage delivers full correlated visibility across hardware, platforms, and workloads. The solution precisely monitors critical network metrics, including bandwidth, latency, and packet loss, while reserving a 20% bandwidth margin to ensure stable data transmission, maintaining millisecond-level internal latency and packet loss below 0.01%. This enables accurate mapping of hardware anomalies to specific training jobs. By tracing the complete path from network anomalies through workloads to business impact, KSManage rapidly pinpoints root causes of LLM training interruptions, such as optical module or fiber faults, preventing training rollbacks, eliminating wasted compute resources, and delivering end-to-end visibility beyond the capabilities of traditional monitoring tools.

4. Four-level automated O&M with Precise Troubleshooting and Rapid Response

To address excessive reliance on manual operations, shortages of specialized O&M personnel, and delayed incident response, KAYTUS KSManage delivers a resilient, intelligent O&M system built on a four-layer visibility framework spanning components, servers and cabinets, clusters, and AI workloads. This unified architecture enables end-to-end automated operations and precise fault diagnosis across the entire AI data center. Automated backup success rates reach nearly 99.8%, while the combined application of knowledge graphs and time-series anomaly detection algorithms enables up to 90% of root causes to be automatically identified within five minutes. As a result, O&M efficiency is increased by up to four times, significantly reducing mean time to repair (MTTR) and minimizing dependence on manual intervention and human error. In parallel, KSManage establishes a resilient response mechanism featuring early warning, tiered protection, and automated isolation and remediation. Storage capacity risks can be predicted up to three days in advance, reducing overall O&M costs and delivering up to a 40% reduction in total cost of ownership (TCO).

Experience KSManage

KSManage is now offered for trial that can be launched in just a few clicks, allowing users to quickly and fully explore the product’s capabilities. To start your trial, please visit: https://ksmanage.kaytus.com (username: admin/password: Manage1!)

For any questions or additional information, please contact us at ksmanage@kaytus.com

Our team will respond promptly!

About KAYTUS

KAYTUS is a leading provider of end-to-end AI and liquid cooling solutions, delivering a diverse range of innovative, open, and eco-friendly products for cloud, AI, edge computing, and other emerging applications. With a customer-centric approach, KAYTUS is agile and responsive to user needs through its adaptable business model. Discover more at KAYTUS.com and follow us on LinkedIn and X

View source version on businesswire.com: https://www.businesswire.com/news/home/20260226499694/en/

Contacts

Media Contacts
media@kaytus.com

About Business Wire

Business Wire
24 Martin Lane
EC4R 0DR London

+44 20 7626 1982http://www.businesswire.co.uk

(c) 2018 Business Wire, Inc., All rights reserved.

Business Wire, a Berkshire Hathaway company, is the global leader in multiplatform press release distribution.

Subscribe to releases from Business Wire

Subscribe to all the latest releases from Business Wire by registering your e-mail address below. You can unsubscribe at any time.

Latest releases from Business Wire

H.I.G. Capital Announces the Sale of DGS S.p.A.11.6.2024 12:00:00 CEST | Press release

H.I.G. Capital (“H.I.G.”), a leading global alternative investment firm with $62 billion of capital under management, is pleased to announce that an affiliate has signed a definitive agreement to sell its portfolio company, DGS S.p.A. (“DGS” or the “Group”), a leading firm in the Italian Information Technology market, to DGS Co-Founders and management team in partnership with ICG, a global alternative asset manager. Since its inception in 1997, DGShas supported blue-chip customers in the design, integration, and maintenance of complex IT systems, with a specialization in digital transformation and cybersecurity services. The Group currently has over 1,900 employees, revenues of approximately €300 million, and maintains a group of highly loyal clientele. During H.I.G.’s ownership, DGS has tripled in size and consolidated its position as a leading Italian firm in cybersecurity services and digital transformation. DGS offers its clients sophisticated and proprietary digital transformation

Evertas Names Nick Selby Head of European Underwriting11.6.2024 12:00:00 CEST | Press release

Evertas, the world’s first crypto insurance company, has named Nick Selby as its new Head of European Underwriting. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240611141887/en/ Nick Selby, Executive Vice President and Head of European Underwriting at Evertas (Photo: Business Wire) Selby, an accomplished information and physical security professional, brings two decades of expertise in public and private sector information security, physical security, and complex incident handling, as well as seven years of experience leading teams securing billions of dollars in cryptoassets. Previously, his roles included VP of the Software Assurance Practice at Trail of Bits, Chief Security Officer at Paxos Trust Company, and Director of Cyber Intelligence and Investigations at the NYPD Intelligence Bureau. “Nick is an extremely valuable addition to our European team,” said Evertas CEO and Co-Founder J. Gdanski. “His public and private

Owlet utvider globalt fotavtrykk med lanseringen av medisinsk-sertifisert Dream Sock™ i Storbritannia og over hele Europa11.6.2024 11:00:00 CEST | Pressemelding

Owlet, Inc. («Owlet» or the «Company») (NYSE:OWLT), pioneren innen smart spedbarnsovervåking, kunngjør i dag den britiske og europeiske lanseringen av Dream Sock. Dette er en smart babymonitor med levende helseavlesninger og varsler for friske spedbarn mellom 0-18 måneder og 2,5-13,6 kg. Dette innovative medisinske utstyret gir foreldre helse og viktig informasjon i sanntid, noe som gir uovertruffen trygghet. Denne pressemeldingen inneholder multimedia. Se hele pressemeldingen her: https://www.businesswire.com/news/home/20240611820341/no/ (Photo: Business Wire) «Vi er svært stolte over å lansere Dream Sock til omsorgspersoner over hele Storbritannia og Europa og gi millioner av foreldre mer trygghet mens babyen sover,» sa Kurt Workman, Owlets administrerende direktør og medgründer. «Dream Sock er nå et globalt produkt som er anerkjent som medisinsk nøyaktig og trygt, etter å ha gjennomgått regulatoriske autorisasjoner og sertifiseringer innenfor flere geografier. I dag er misjonen vår

V-Nova Surpasses 1000 Patent Milestone in Media Technology Innovation11.6.2024 10:00:00 CEST | Press release

V-Nova, a leading provider of data compression solutions, video compression technology, XR technology, AI acceleration and parallel processing for a multitude of industries including media and entertainment, today announced its milestone achievement of 1000 active technology patents. This accomplishment underscores V-Nova’s dedication to research and development and its commitment to protecting its intellectual property globally. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240611724561/en/ V-Nova’s patent portfolio spans more than 50 different jurisdictions. Including over 400 patents in Europe, over 200 in the Americas, over 100 in the United States specifically, and over 200 in Asia. V-Nova forged new directions in data processing to enhance digital experiences, maximize efficiency, reduce costs, and increase sustainability. The company leads the way with key international data compression standards for the video indust

Alipay+ Reveals Top Scorer Trophy Design for UEFA EURO 2024™11.6.2024 09:24:00 CEST | Press release

Alipay+, a suite of cross-border mobile payment and digitalization technology solutions operated by Ant International and an Official Partner of UEFA EURO 2024™, today revealed the trophy that will be awarded to the most prolific marksman at the UEFA EURO 2024™ finale on July 14 in Berlin, Germany. This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20240610328619/en/ The UEFA Top Scorer Trophy presented by Alipay+ is unveiled for UEFA EURO 2024™ (Photo: Business Wire) Sculpted in the shape of the Chinese character “支” (pronounced zhi, and meaning payment as well as support), the trophy reflects Alipay+’s dedication to supporting consumers to enjoy seamless payment and a broad choice of deals using their preferred payment methods while traveling abroad. The character also resembles the fleeting moment of a barefooted striker poised to shoot, evoking the original beauty and power of football – a game that united people across the wo

World GlobeA line styled icon from Orion Icon Library.HiddenA line styled icon from Orion Icon Library.Eye