FMOps and LLMOps: Operationalize generative AI at Scale

Key Insights

“

Einstein SDR Agent autonomously engages with inbound leads, in natural language, to answer questions, handle objections, and book meetings for human sellers Einstein Sales Coach Agent autonomously engages in role-plays with sellers, simulating a buyer during discovery, pitch, or negotiation calls Accenture will leverage these agents to improve deal team effectiveness, scale to support more deals, and allow their people to focus their time and effort on the most complex deals

”

FMOps & LLMOps - Operationalize Generative AI at Scale

Introduction

Generative AI has swiftly revolutionized various sectors, spanning entertainment, healthcare, finance, and marketing, by autonomously creating realistic content, encompassing images, videos, and text. In 2023, a Statista survey unveiled that 29% of Gen Z, 28% of Gen X, and 27% of millennials in the US embraced generative AI tools. Noteworthy industries like technology, education, business services, manufacturing, and finance exhibit a strong preference for OpenAI's solutions. Gartner forecasts that by 2025, generative AI will generate 10% of all data, a remarkable leap from its current sub-1% share. Furthermore, the generative AI market, which was valued at USD 7.9 billion in 2021, is anticipated to soar to USD 110.8 billion by 2030, with the Asia-Pacific region displaying remarkable growth. Unicorn companies like OpenAI and Hugging Face are at the forefront of this investment wave, as generative AI companies have secured $1.7 billion in VC funding over 255 deals in recent years. This dynamic landscape underscores Generative AI's pivotal role in reshaping industries and charting the course for our future.

However, the full potential of Generative AI lies in its operationalization, particularly at scale. It's not merely about having these powerful AI models; it's about seamlessly integrating them into business operations. This integration unlocks a multitude of practical benefits that span across industries, offering a roadmap to streamline processes, boost efficiency, and unlock substantial time and resource savings.

MLOps

MLOps, which stands for Machine Learning Operations, is a pivotal component in the journey of taking AI models from development to production. While creating a machine learning model is an essential first step, it's far from the final destination. To make a model useful, it must be seamlessly integrated into applications and able to handle issues like sudden spikes in user requests or changes in real-world data. This is precisely where MLOps comes into play, offering a toolbox and processes to construct and maintain a robust and constantly updated AI system. Introduction

The standard MLOps workflow encompasses several phases, including data ingestion, validation, preprocessing, model training, validation, and deployment. Automating these steps can significantly expedite the model development process, resulting in quicker innovation, cost reductions, and enhanced model quality.

FMOps

As we enter the era of Foundation Models (FMs), there's a notable transformation happening in the realm of MLOps. The conventional approach, which involved integrating various task-specific models and business logic downstream, is now evolving into a more forward-thinking strategy. This new approach prioritizes intelligent data preparation, fine-tuning, guiding the emergence of FM behavior, and elevating the post-processing and chaining of FM outputs to earlier stages of development.

In 2021, researchers from Stanford University introduced the concept of Foundation Models (FMs), defining them as versatile machine learning models trained on vast and diverse datasets, capable of adapting to a wide array of tasks. Unlike traditional task-specific models, FMs are colossal, boasting billions of parameters and pretrained on extensive datasets. What sets them apart are their remarkable emergent capabilities, such as reading comprehension and artistic creativity, as they learn to reconstruct data. Various FMs have emerged, covering tasks like text-to-text, text-to-image, and speech-to-text, each offering unique levels of control and accessibility.

To offer a concise definition of FMOps, we propose the following:

FMOps encompasses the operational capabilities essential for efficiently managing data, aligning, deploying, optimizing, and monitoring foundation models within the framework of an AI system.

LLMOps

LLMOps, or Large Language Model Ops, is a specialized subset of FMOps, with a focus on operationalizing solutions based on large language models, particularly those used in text-to-text applications. It encompasses a collection of practices, techniques, and tools specifically designed for managing large language models in production environments. As the demand for integrating these models effectively into operational workflows grows, LLMOps plays a crucial role in enabling streamlined deployment, continuous monitoring, and ongoing maintenance of these models. Similar to traditional Machine Learning Ops (MLOps), LLMOps involves collaborative efforts among data scientists, DevOps engineers, and IT professionals.MLOps vs FMOps vs LLMOps

Aspect	MLOps (Machine Learning Operations)	FMOps (Foundation Model Operations)	LLMOps (Large Language Model Operations)
Definition	Operationalizes traditional ML models and solutions.	Operationalizes generative AI solutions, including foundation models.	A subset of FMOps, focusing on operationalizing large language models (LLMs).
Primary Focus	Traditional ML models and tasks (e.g., classification, regression).	Generative AI solutions, including various use cases powered by FMs.	LLM-based solutions in text-to-text applications (e.g., chatbots, summarization).
Processes	Data preparation, model development, deployment, monitoring, retraining.	Selection, testing, fine-tuning, and deployment of FMs for generative AI.	Selection, evaluation, backend/frontend development, user interaction, feedback integration.
Use Cases	Broad range of ML use cases, both traditional and non-generative AI tasks.	Diverse generative AI use cases (text-to-text, text-to-image, text-to-audio, etc.).	LLM-based text-to-text applications in natural language understanding and generation.
Scope	Covers the entire ML lifecycle, including model training and evaluation.	Expands MLOps principles to address challenges specific to generative AI.	Focuses on the operationalization of LLMs for text-based applications.
Example Tasks	Classification, regression, clustering, predictive analytics.	Content generation, chatbots, summarization, text-to-image generation, etc.	Building and deploying chatbots, text summarizers, content creators, etc.

Components of FMOps

Choosing a Foundation Model

When selecting foundation models (FMs) for various applications, several critical dimensions must be carefully considered. These factors, contingent upon specific use cases, data availability, regulatory requirements, and more, form the basis of a comprehensive decision-making checklist:

i. Proprietary vs. Open-Source

The choice between proprietary FMs (offering premium quality with financial costs) and open-source FMs (providing accessibility and flexibility) hinges on the project's needs.

ii. Commercial License

Licensing terms must be scrutinized to ensure alignment with commercial objectives, as some open-source models may have restrictions.

iii. Parameters

The number of model parameters affects complexity, performance, and computational resources, necessitating a balance between power and cost.

iv. Speed

Model size influences processing speed, with larger models typically having higher latency, making it crucial to match model size with real-time requirements.

v. Context Window Size

The model's context window size impacts its ability to understand and generate longer text sequences, catering to tasks involving extensive conversations or documents.

vi. Training Dataset

Awareness of the FM's training data sources, which may include diverse text or multimodal datasets, is vital for assessing suitability and addressing copyright concerns.

vii. Quality

FM quality varies based on type, size, and training data, with context-specific considerations influencing suitability.

viii. Fine-Tunability

The capability to fine-tune FMs to specific applications enhances performance but demands additional resources and expertise.

ix. Existing Customer Skills

Selection may be influenced by the expertise of the customer or development team, as well as their familiarity with a particular FM.
To facilitate decision-making, it's advisable to create shortlists of proprietary and open-source models tailored to specific needs, keeping in mind that performance and parameters evolve rapidly and may require periodic reassessment. Additionally, factors like language support should be considered for specific customer requirements.

Testing and Evaluating Foundation Models (FMs)

In the process of selecting the most suitable foundation model (FM) for a given application, rigorous testing and evaluation are imperative. The approach to evaluation depends on the availability and nature of evaluation data, and we offer distinct methodologies based on these factors

Labeled Data Evaluation:

If labeled test data is accessible, traditional model evaluation methods akin to those used in conventional machine learning can be applied. This entails inputting samples and comparing the generated outputs with the provided labels. For tasks with discrete labels, such as sentiment analysis, established accuracy metrics like precision, recall, and F1 score are employed. For unstructured output tasks like summarization, similarity metrics such as ROUGE and cosine similarity are recommended.

Unlabeled Data Evaluation(No True Answer):

In scenarios where, definitive correct answers are elusive, evaluating models becomes more complex.

Two approaches are proposed:

a. Human-in-the-Loop (HIL):

Expert human testers review model responses. The extent of review, ranging from 100% to a sample, depends on the application's criticality.

b. LLM-powered Evaluation:

A more cost-effective method employs a more powerful Language Model (LLM) to assess all model-generated responses. Though potentially of lower quality, this approach offers rapid evaluation without human involvement.

An example prompt for LLM-powered evaluation is provided, where an LLM assesses the response generated by a model, assigning a score based on criteria like helpfulness, relevance, accuracy, and level of detail.

The evaluation process entails creating an evaluation prompt catalog comprising example prompts tailored to the specific application. This catalog, combined with labeled or unlabeled evaluation datasets, facilitates model assessment. The evaluation results dataset includes prompts, FM outputs, and labeled outputs with scores (if available). Unlabeled datasets necessitate HIL or LLM assessment to provide scores and feedback.

Upon collecting evaluation results, model selection is guided by multiple dimensions, often balancing precision, speed, and cost. Prioritizing these dimensions depends on the use case. Informed decisions are then made based on the performance and trade-offs of each FM along these criteria, ensuring the chosen FM aligns with the application's specific requirements and priorities.

Development of the Generative AI Application Backend and Frontend

After the selection of the appropriate foundation model (FM) for the specific application, the generative AI development process proceeds with the creation of the application, which is divided into two integral layers: the backend and frontend.

Backend Development

In this phase, generative AI developers seamlessly integrate the chosen FM into the solution. Collaboration with prompt engineers is essential to automate the conversion of end-user input into suitable FM prompts. Prompt testers contribute by crafting entries in the prompt catalog, facilitating automatic or manual (Human-in-the-Loop or LLM-powered) testing. Furthermore, generative AI developers construct prompt chaining mechanisms, breaking down complex tasks into smaller, manageable sub-tasks. This promotes dynamic and contextually-aware Language Model (LLM) applications. To ensure input and output quality, monitoring and filtering mechanisms are established. For instance, toxicity detectors can be applied to eliminate toxic requests and responses. Additionally, a rating mechanism is implemented to augment the evaluation prompt catalog with positive and negative examples, although detailed specifics of these mechanisms will be addressed in subsequent posts.

Frontend Development

To offer functionality to end-users, a frontend website is developed to interface with the backend. DevOps and Application Developers (AppDevs) adhere to best development practices to implement input/output functionality and rating features.
In addition to core functionality, both the frontend and backend must incorporate user account creation, data uploading, initiation of fine-tuning as a black box, and the utilization of personalized models instead of the base FM. Productionization follows a conventional application development approach. Generative AI developers, prompt engineers, and DevOps or AppDevs manually create and test the application, deploying it via Continuous Integration/Continuous Deployment (CI/CD) to a development environment. Testing extends to the preproduction environment, where extensive prompt combinations are evaluated by prompt testers. The results and associated data are integrated into the evaluation prompt catalog to automate future testing. Finally, the application is promoted to production via CI/CD by merging with the main branch. All data, including prompt catalogs, evaluation data, end-user data, and fine-tuned model metadata, are stored in the data lake or data mesh layer, while CI/CD pipelines and repositories reside in a separate tooling account, akin to MLOps practices.

Leading Generative AI Companies

OpenAI (Best Overall)
OpenAI is a prominent generative AI company, valued at approximately $29 billion and backed by major tech firms like Microsoft. It offers various solutions, including ChatGPT and DALL-E, alongside an API and customizable models for businesses. However, occasional content inaccuracies and offensive outputs pose challenges, and some models can be costly.
Hugging Face (Best for Community-Driven AI Development)
Hugging Face is a community-driven platform for AI and ML model development. It provides a wide range of prediction models and datasets, facilitating custom generative AI solutions. While developer-focused, solutions like AutoTrain require minimal coding. Hugging Face is also partnered with AWS, enhancing accessibility.
Alphabet (Google) (Best for Scalability): Google is investing in generative AI, particularly in cloud ecosystems and ethical AI development. DeepMind, its subsidiary, plays a pivotal role. Google emphasizes transparency in AI ethics and cost-efficient, high-performance AI solutions. However, initial hesitancy and rapid tool deployment could pose challenges, with many programs currently available through Trusted Tester Programs.
Conclusion
Foundation Models have emerged as transformative forces in the realm of AI, poised to reshape the landscape of AI systems significantly. With their capacity to acquire proficiency in intricate knowledge tasks, these models are poised to revolutionize human-machine interactions. As industries specializing in knowledge-based tasks recognize their potential, companies must meticulously assess their strategic choices to derive maximum value. Presently, a prudent strategy involves investing in the establishment of robust data infrastructure to fully harness forthcoming advancements. Concurrently, proactive experimentation with specialized models and system integration is crucial, necessitating internal workflow and process adaptations to accommodate these innovations. This proactive approach will also facilitate the generation of essential data for fine-tuning tomorrow's models, unlocking unprecedented levels of performance and utility.