Reducing Risk And Complexity Across AI Cloud Project Phases

1 year ago 49

Sven Oehme, Chief Technology Officer (CTO) at DDN, drives innovation across both current and future products.

getty

Whether you’re deploying in the cloud or on-premises, how do you ensure that your solution consistently delivers the accelerated computing necessary to drive AI models? Is there a way to eliminate design guesswork, deployment complexity and unknown outcomes for AI initiatives? As mentioned in my previous article, “AI Is Accelerating The Demand For Cloud, But Which Type?” high-performance compute, storage and network components are available, but they need to be integrated and optimized for maximum performance and efficiency.

This is where the role of an AI reference architecture becomes extremely valuable. Storage and network resources must complement and maximize the power and scale of GPUs, otherwise, GPU resources will be underutilized and underperform. The integrated solution must be rigorously lab-tested, certified and proven in unforgiving data center environments where there’s very little tolerance for any hiccups related to performance at scale, data protection or application uptime. The objective is to eliminate as much risk as possible by using a standardized, repeatable solution based on a well-defined bill of materials (BOM) that delivers consistent, predictable behavior and results.

An AI reference architecture is usually a turnkey solution that includes not just the hardware components but intelligent AI workflow management software and an OS designed to optimize large-scale AI. Such reference architectures usually also include installation and support services—and guaranteed performance. To date, AI reference architectures have been deployed primarily in private data centers, but there’s an increasing suite of cloud providers that are now hosting them. An example of a reference architecture that’s being increasingly deployed as a standard in the AI realm is NVIDIA’s DGX SuperPOD.

When it comes to security, to be clear, there are vulnerabilities in both private and public cloud data centers. For most general-purpose computing use cases, the public cloud has proven to be very safe, with stringent data encryption protocols and customizable data access rules that eliminate many security-related headaches for IT staff.

However, IT organizations rightfully hold cloud providers to a higher standard when it comes to protecting customer data. And IT leaders are still wary about the public cloud regarding their critical initiatives. Recurring security issues that have appeared with Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP) haven’t helped assuage those fears. According to IBM’s 2022 State of the Cloud study, 54% of business and IT professionals surveyed internationally believed that "the public cloud is not secure enough for much of [their] data." Data that organizations use for AI training often includes proprietary or confidential information, which makes AI data a bigger target for bad actors and security threats. In some ways, a well-run enterprise IT department can take on-premises security measures a step further than the cloud can—especially for multi-tenant environments. It can deploy multiple levels of security, including tenant-based isolation, data encryption, Kerberos-based authentication and tight role-based access, which may be harder to control and customize in the cloud.

With the above factors to consider, there’s no obvious choice of where to deploy your AI projects. If you’re just getting started with AI development, it might make sense to try out the cloud for initial sizing and model experimentation and to get a feel for “plug-and-play” tools and pre-trained models. On one hand, you don’t have to bother with infrastructure provisioning and management, but on the other hand, you do have to be wary of how quickly cloud costs can add up. GPUs, storage and cloud-based tools can simplify your AI entry point, but the journey can be very expensive—even when using reserved, dedicated GPUs for cost efficiency.

According to experts who have trained thousands of AI models, a public cloud approach can ultimately cost two to three times as much as building your own private cloud for AI.

To summarize:

• If an organization doesn’t have the budget to support upfront AI hardware costs, a hyperscaler or GPU cloud option might be preferable. But be prepared for higher ongoing costs and consider in advance the on-premises or colocation infrastructure and AI tools that would be required for potential repatriation from the cloud.

• If utilizing pre-trained AI models and there’s no requirement for a highly customized deployment, then the cloud offers a simpler option and reduces the need for sophisticated in-house AI expertise.

• If running AI in the cloud, be sure to look at both hyperscale and GPU cloud options. Although hyperscalers provide a much broader range of cloud services, GPU cloud providers offer specialized services and expertise for AI environments that can deliver better performance and economics.

• If running latency-sensitive AI apps, an on-premises approach can offer more predictable latency and performance service levels.

• For regulated industries or the management of highly sensitive data, an on-premises AI solution could help businesses meet regulatory compliance measures and implement more rigorous and customized security features.

Regardless of where AI applications reside, businesses can simplify infrastructure deployment, optimize performance and reduce risk with a pre-validated AI reference architecture.


Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?


Read Entire Article