With the advent of Generative AI-powered chatbots, business leaders today stand at the precipice of a major transformation. McKinsey expects generative AI will raise productivity, creating upwards of $4.4 trillion in global profits annually over dozens of use cases and many industries. Generative AI could increase the value of productivity from the broader category of Artificial Intelligence and Analytics by as much as 40% compared to previous generations of the technology.
Given this surge in generative AI, it’s no surprise that the mechanics of AI and ML are rapidly evolving. We’re now seeing significant differences between the workflows supporting large language model operations (LLMOps) and those supporting traditional machine learning operations (“classic” MLOps). From today forward, an AI leader’s ability to understand and pursue the path from MLOps to LLMOps will produce an outsized impact on their AI team’s output, their organization’s innovation, and, ultimately, their ability to compete.
LLMs can be orders of magnitude greater in size than traditional ML models. Likewise, the number of people working on LLMs and the range of their ML expertise are typically far greater. As such, LLMOps present several unique challenges. Even for data science and IT teams that have successfully developed, deployed, scaled, and optimized classic ML models, managing the full LLMOps lifecycle can be a very different experience.
We’ve identified the most common pitfalls across model development, deployment, and monitoring that can derail your generative AI initiatives if your team is new to LLMOps.
1) Don’t let silos undermine LLM development initiatives
Keep pace with AI innovation by providing data science teams with the architectural and organizational approaches they need to operationalize LLMs.
Anyone versed in classic MLOps (or any kind of operations) understands the problem of silos. Information silos can lead to recreating work already undertaken elsewhere within the enterprise, and resource silos can lead to underutilization or bottlenecks. These apply to both LLMOps and classic MLOps, but experiencing the resulting inefficiencies in LLMOps can be far more detrimental to overall business outcomes.
To avoid information and resource silos in LLMOps, all data science teams must have easy access to a complete inventory of LLMs and other ML models that exist within the organization’s IP. This includes access to both proprietary and open foundation models. Further, any updates to those models must also be immediately available across the enterprise. In terms of resource silos, given the extreme compute power needed to train and run LLMs, all data science teams must understand what GPU capacity exists within the enterprise, where it resides, and how much it will cost to use.
Breaking down LLMOps silos comprises three steps:
- Perform an accurate inventory of all on-premises and cloud GPUs across the enterprise
- Make all LLMs and ML models discoverable to all data science teams
- Connect clouds with centralized access to show their location, available capacity, and the cost of each GPU cluster
Silos can undermine workflows around embeddings and prompts, too. This work requires a variety of expertise and must be highly collaborative to be efficient. Data scientists need workbenches for prompt management and vector stores to augment generative AI model performance at scale. Making embeddings and prompts easily discoverable and shareable across data science teams can minimize time-consuming activities.
2) Avoid expensive inefficiencies in LLM deployments
Teams must be well-aligned and able to easily navigate different infrastructure and data needs and associated costs.
The greater number of variables and overall complexity of LLMs require that enterprises remain as flexible as possible with deployment configurations to control costs. Those familiar with classic MLOps likely understand that hybrid cloud infrastructure offers the optimal blend of flexibility and cost efficiency with AI initiatives. This approach is even more essential for LLMOps, where controlling costs while maintaining model performance can make or break AI initiatives.
- Caching and reusing LLM inferences, instead of re-computing, significantly reduces costs while ensuring consistency and accuracy throughout all phases of the LLM lifecycle. Reuse also has the benefit of helping maintain model consistency over the model’s lifecycle.
- Running LLMs on infrastructure close to the data sources (rather than incurring steep data transfer fees by moving data to the models) offers the dual benefit of contributing to lower OpEx while reducing the enterprise’s exposure to data sovereignty and privacy transgressions.
- Bringing the models to the data also helps reduce latency in inference delivery, which, depending on the use case, can either be little more than an annoyance for the end user or an outright catastrophe for critical applications that rely on near-real-time data processing.
Classic MLOps involve both ensemble and cascading techniques. Ensemble techniques combine multiple base models to produce a single optimal predictive model. Cascading optimizes runtime performance by writing an algorithm that rules out obvious negative answers as quickly as possible before getting to positive answers.
The new way of deploying LLMs requires broader use of cascading, with each step performing a different task or solving a small problem to arrive at an answer that solves a query. Building and running multiple smaller models to do different things – including selecting which model is best suited to address the query in terms of accuracy, cost, performance, and latency – can be more efficient than building a single large LLM. These are crucial generative AI orchestration considerations that those working in classic MLOps are less likely to encounter.
3) Chart your LLM monitoring course carefully
Define and strictly follow responsible AI practices.
Observing responsible AI practices is paramount in LLMOps and classic MLOps alike. However, when end users can exploit LLMs via prompt injection – deliberate manipulations of LLMs to produce unauthorized inferences – the potential exposure the application owner faces makes responsible AI an even more critical aspect of LLMOps.
Data science teams working with LLMs and generative AI applications must be purposeful in building in tracking, auditability, reproducibility, and monitoring across the generative AI lifecycle. This is critical to ensure end users do not manipulate models to produce misinformation. Both training (distributed across many nodes) and inferring must be reproducible. This requires validating inferences by allowing other data scientists to test the LLM. If they are unable to produce the same results using the same data and parameters, the model may be faulty.
Prompt brittleness is also a major challenge, as minor variations related to prompt syntax and semantics can lead to serious flaws in the model’s output. Evaluation of different model versions, including fine-tuned models, demands continuous monitoring of user prompts and their generated output.
These dangers are exacerbated by the sheer number of end users interacting with LLMs as opposed to ML models. Because generative AI applications are designed to be “released into the wild,” sophisticated prompt engineering paired with robust observability measures must rule the LLMOps practice.
LLMOps stakes are high; so is the potential payoff of generative AI
Despite the obstacles, the rewards that AI can offer have brought LLMOps to the top of many executives’ to-do list. However, without a firm understanding of the challenges unique to LLMOps, enterprises increasing their investments in generative AI will fail to see a return and may expose the enterprise to liability. The first step in overcoming these challenges is to anticipate them. Then, with the right LLMOps development, deployment and monitoring technologies and strategies, your organization will be among those that will thrive in the generative AI future.