A Generative AI Engineer is designing a RAG application for answering user questions on technical regulations as they learn a new sport.
What are the steps needed to build this RAG application and deploy it?
Answer : B
The Generative AI Engineer needs to follow a methodical pipeline to build and deploy a Retrieval-Augmented Generation (RAG) application. The steps outlined in option B accurately reflect this process:
Ingest documents from a source: This is the first step, where the engineer collects documents (e.g., technical regulations) that will be used for retrieval when the application answers user questions.
Index the documents and save to Vector Search: Once the documents are ingested, they need to be embedded using a technique like embeddings (e.g., with a pre-trained model like BERT) and stored in a vector database (such as Pinecone or FAISS). This enables fast retrieval based on user queries.
User submits queries against an LLM: Users interact with the application by submitting their queries. These queries will be passed to the LLM.
LLM retrieves relevant documents: The LLM works with the vector store to retrieve the most relevant documents based on their vector representations.
LLM generates a response: Using the retrieved documents, the LLM generates a response that is tailored to the user's question.
Evaluate model: After generating responses, the system must be evaluated to ensure the retrieved documents are relevant and the generated response is accurate. Metrics such as accuracy, relevance, and user satisfaction can be used for evaluation.
Deploy it using Model Serving: Once the RAG pipeline is ready and evaluated, it is deployed using a model-serving platform such as Databricks Model Serving. This enables real-time inference and response generation for users.
By following these steps, the Generative AI Engineer ensures that the RAG application is both efficient and effective for the task of answering technical regulation questions.
A Generative AI Engineer just deployed an LLM application at a digital marketing company that assists with answering customer service inquiries.
Which metric should they monitor for their customer service LLM application in production?
Answer : A
When deploying an LLM application for customer service inquiries, the primary focus is on measuring the operational efficiency and quality of the responses. Here's why A is the correct metric:
Number of customer inquiries processed per unit of time: This metric tracks the throughput of the customer service system, reflecting how many customer inquiries the LLM application can handle in a given time period (e.g., per minute or hour). High throughput is crucial in customer service applications where quick response times are essential to user satisfaction and business efficiency.
Real-time performance monitoring: Monitoring the number of queries processed is an important part of ensuring that the model is performing well under load, especially during peak traffic times. It also helps ensure the system scales properly to meet demand.
Why other options are not ideal:
B . Energy usage per query: While energy efficiency is a consideration, it is not the primary concern for a customer-facing application where user experience (i.e., fast and accurate responses) is critical.
C . Final perplexity scores for the training of the model: Perplexity is a metric for model training, but it doesn't reflect the real-time operational performance of an LLM in production.
D . HuggingFace Leaderboard values for the base LLM: The HuggingFace Leaderboard is more relevant during model selection and benchmarking. However, it is not a direct measure of the model's performance in a specific customer service application in production.
Focusing on throughput (inquiries processed per unit time) ensures that the LLM application is meeting business needs for fast and efficient customer service responses.
A Generative Al Engineer is developing a RAG system for their company to perform internal document Q&A for structured HR policies, but the answers returned are frequently incomplete and unstructured It seems that the retriever is not returning all relevant context The Generative Al Engineer has experimented with different embedding and response generating LLMs but that did not improve results.
Which TWO options could be used to improve the response quality?
Choose 2 answers
Answer : A, B
The problem describes a Retrieval-Augmented Generation (RAG) system for HR policy Q&A where responses are incomplete and unstructured due to the retriever failing to return sufficient context. The engineer has already tried different embedding and response-generating LLMs without success, suggesting the issue lies in the retrieval process---specifically, how documents are chunked and indexed. Let's evaluate the options.
Option A: Add the section header as a prefix to chunks
Adding section headers provides additional context to each chunk, helping the retriever understand the chunk's relevance within the document structure (e.g., ''Leave Policy: Annual Leave'' vs. just ''Annual Leave''). This can improve retrieval precision for structured HR policies.
Databricks Reference: 'Metadata, such as section headers, can be appended to chunks to enhance retrieval accuracy in RAG systems' ('Databricks Generative AI Cookbook,' 2023).
Option B: Increase the document chunk size
Larger chunks include more context per retrieval, reducing the chance of missing relevant information split across smaller chunks. For structured HR policies, this can ensure entire sections or rules are retrieved together.
Databricks Reference: 'Increasing chunk size can improve context completeness, though it may trade off with retrieval specificity' ('Building LLM Applications with Databricks').
Option C: Split the document by sentence
Splitting by sentence creates very small chunks, which could exacerbate the problem by fragmenting context further. This is likely why the current system fails---it retrieves incomplete snippets rather than cohesive policy sections.
Databricks Reference: No specific extract opposes this, but the emphasis on context completeness in RAG suggests smaller chunks worsen incomplete responses.
Option D: Use a larger embedding model
A larger embedding model might improve vector quality, but the question states that experimenting with different embedding models didn't help. This suggests the issue isn't embedding quality but rather chunking/retrieval strategy.
Databricks Reference: Embedding models are critical, but not the focus when retrieval context is the bottleneck.
Option E: Fine tune the response generation model
Fine-tuning the LLM could improve response coherence, but if the retriever doesn't provide complete context, the LLM can't generate full answers. The root issue is retrieval, not generation.
Databricks Reference: Fine-tuning is recommended for domain-specific generation, not retrieval fixes ('Generative AI Engineer Guide').
Conclusion: Options A and B address the retrieval issue directly by enhancing chunk context---either through metadata (A) or size (B)---aligning with Databricks' RAG optimization strategies. C would worsen the problem, while D and E don't target the root cause given prior experimentation.
Which TWO chain components are required for building a basic LLM-enabled chat application that includes conversational capabilities, knowledge retrieval, and contextual memory?
Answer : B, C
Building a basic LLM-enabled chat application with conversational capabilities, knowledge retrieval, and contextual memory requires specific components that work together to process queries, maintain context, and retrieve relevant information. Databricks' Generative AI Engineer documentation outlines key components for such systems, particularly in the context of frameworks like LangChain or Databricks' MosaicML integrations. Let's evaluate the required components:
Understanding the Requirements:
Conversational capabilities: The app must generate natural, coherent responses.
Knowledge retrieval: It must access external or domain-specific knowledge.
Contextual memory: It must remember prior interactions in the conversation.
Databricks Reference: 'A typical LLM chat application includes a memory component to track conversation history and a retrieval mechanism to incorporate external knowledge' ('Databricks Generative AI Cookbook,' 2023).
Evaluating the Options:
A . (Q): This appears incomplete or unclear (possibly a typo). Without further context, it's not a valid component.
B . Vector Stores: These store embeddings of documents or knowledge bases, enabling semantic search and retrieval of relevant information for the LLM. This is critical for knowledge retrieval in a chat application.
Databricks Reference: 'Vector stores, such as those integrated with Databricks' Lakehouse, enable efficient retrieval of contextual data for LLMs' ('Building LLM Applications with Databricks').
C . Conversation Buffer Memory: This component stores the conversation history, allowing the LLM to maintain context across multiple turns. It's essential for contextual memory.
Databricks Reference: 'Conversation Buffer Memory tracks prior user inputs and LLM outputs, ensuring context-aware responses' ('Generative AI Engineer Guide').
D . External tools: These (e.g., APIs or calculators) enhance functionality but aren't required for a basic chat app with the specified capabilities.
E . Chat loaders: These might refer to data loaders for chat logs, but they're not a core chain component for conversational functionality or memory.
F . React Components: These relate to front-end UI development, not the LLM chain's backend functionality.
Selecting the Two Required Components:
For knowledge retrieval, Vector Stores (B) are necessary to fetch relevant external data, a cornerstone of Databricks' RAG-based chat systems.
For contextual memory, Conversation Buffer Memory (C) is required to maintain conversation history, ensuring coherent and context-aware responses.
While an LLM itself is implied as the core generator, the question asks for chain components beyond the model, making B and C the minimal yet sufficient pair for a basic application.
Conclusion: The two required chain components are B. Vector Stores and C. Conversation Buffer Memory, as they directly address knowledge retrieval and contextual memory, respectively, aligning with Databricks' documented best practices for LLM-enabled chat applications.
A Generative Al Engineer is ready to deploy an LLM application written using Foundation Model APIs. They want to follow security best practices for production scenarios
Which authentication method should they choose?
Answer : A
The task is to deploy an LLM application using Foundation Model APIs in a production environment while adhering to security best practices. Authentication is critical for securing access to Databricks resources, such as the Foundation Model API. Let's evaluate the options based on Databricks' security guidelines for production scenarios.
Option A: Use an access token belonging to service principals
Service principals are non-human identities designed for automated workflows and applications in Databricks. Using an access token tied to a service principal ensures that the authentication is scoped to the application, follows least-privilege principles (via role-based access control), and avoids reliance on individual user credentials. This is a security best practice for production deployments.
Databricks Reference: 'For production applications, use service principals with access tokens to authenticate securely, avoiding user-specific credentials' ('Databricks Security Best Practices,' 2023). Additionally, the 'Foundation Model API Documentation' states: 'Service principal tokens are recommended for programmatic access to Foundation Model APIs.'
Option B: Use a frequently rotated access token belonging to either a workspace user or a service principal
Frequent rotation enhances security by limiting token exposure, but tying the token to a workspace user introduces risks (e.g., user account changes, broader permissions). Including both user and service principal options dilutes the focus on application-specific security, making this less ideal than a service-principal-only approach. It also adds operational overhead without clear benefits over Option A.
Databricks Reference: 'While token rotation is a good practice, service principals are preferred over user accounts for application authentication' ('Managing Tokens in Databricks,' 2023).
Option C: Use OAuth machine-to-machine authentication
OAuth M2M (e.g., client credentials flow) is a secure method for application-to-service communication, often using service principals under the hood. However, Databricks' Foundation Model API primarily supports personal access tokens (PATs) or service principal tokens over full OAuth flows for simplicity in production setups. OAuth M2M adds complexity (e.g., managing refresh tokens) without a clear advantage in this context.
Databricks Reference: 'OAuth is supported in Databricks, but service principal tokens are simpler and sufficient for most API-based workloads' ('Databricks Authentication Guide,' 2023).
Option D: Use an access token belonging to any workspace user
Using a user's access token ties the application to an individual's identity, violating security best practices. It risks exposure if the user leaves, changes roles, or has overly broad permissions, and it's not scalable or auditable for production.
Databricks Reference: 'Avoid using personal user tokens for production applications due to security and governance concerns' ('Databricks Security Best Practices,' 2023).
Conclusion: Option A is the best choice, as it uses a service principal's access token, aligning with Databricks' security best practices for production LLM applications. It ensures secure, application-specific authentication with minimal complexity, as explicitly recommended for Foundation Model API deployments.
A Generative Al Engineer is developing a RAG application and would like to experiment with different embedding models to improve the application performance.
Which strategy for picking an embedding model should they choose?
Answer : A
The task involves improving a Retrieval-Augmented Generation (RAG) application's performance by experimenting with embedding models. The choice of embedding model impacts retrieval accuracy, which is critical for RAG systems. Let's evaluate the options based on Databricks Generative AI Engineer best practices.
Option A: Pick an embedding model trained on related domain knowledge
Embedding models trained on domain-specific data (e.g., industry-specific corpora) produce vectors that better capture the semantics of the application's context, improving retrieval relevance. For RAG, this is a key strategy to enhance performance.
Databricks Reference: 'For optimal retrieval in RAG systems, select embedding models aligned with the domain of your data' ('Building LLM Applications with Databricks,' 2023).
Option B: Pick the most recent and most performant open LLM released at the time
LLMs are not embedding models; they generate text, not embeddings for retrieval. While recent LLMs may be performant for generation, this doesn't address the embedding step in RAG. This option misunderstands the component being selected.
Databricks Reference: Embedding models and LLMs are distinct in RAG workflows: 'Embedding models convert text to vectors, while LLMs generate responses' ('Generative AI Cookbook').
Option C: Pick the embedding model ranked highest on the Massive Text Embedding Benchmark (MTEB) leaderboard hosted by HuggingFace
The MTEB leaderboard ranks models across general tasks, but high overall performance doesn't guarantee suitability for a specific domain. A top-ranked model might excel in generic contexts but underperform on the engineer's unique data.
Databricks Reference: General performance is less critical than domain fit: 'Benchmark rankings provide a starting point, but domain-specific evaluation is recommended' ('Databricks Generative AI Engineer Guide').
Option D: Pick an embedding model with multilingual support to support potential multilingual user questions
Multilingual support is useful only if the application explicitly requires it. Without evidence of multilingual needs, this adds complexity without guaranteed performance gains for the current use case.
Databricks Reference: 'Choose features like multilingual support based on application requirements' ('Building LLM-Powered Applications').
Conclusion: Option A is the best strategy because it prioritizes domain relevance, directly improving retrieval accuracy in a RAG system---aligning with Databricks' emphasis on tailoring models to specific use cases.
A Generative Al Engineer is setting up a Databricks Vector Search that will lookup news articles by topic within 10 days of the date specified An example query might be "Tell me about monster truck news around January 5th 1992". They want to do this with the least amount of effort.
How can they set up their Vector Search index to support this use case?
Answer : B
The task is to set up a Databricks Vector Search index for news articles, supporting queries like ''monster truck news around January 5th, 1992,'' with minimal effort. The index must filter by topic and a 10-day date range. Let's evaluate the options.
Option A: Split articles by 10-day blocks and return the block closest to the query
Pre-splitting articles into 10-day blocks requires significant preprocessing and index management (e.g., one index per block). It's effort-intensive and inflexible for dynamic date ranges.
Databricks Reference: 'Static partitioning increases setup complexity; metadata filtering is preferred' ('Databricks Vector Search Documentation').
Option B: Include metadata columns for article date and topic to support metadata filtering
Adding date and topic as metadata in the Vector Search index allows dynamic filtering (e.g., date 5 days, topic = ''monster truck'') at query time. This leverages Databricks' built-in metadata filtering, minimizing setup effort.
Databricks Reference: 'Vector Search supports metadata filtering on columns like date or category for precise retrieval with minimal preprocessing' ('Vector Search Guide,' 2023).
Option C: Pass the query directly to the vector search index and return the best articles
Passing the full query (e.g., ''Tell me about monster truck news around January 5th, 1992'') to Vector Search relies solely on embeddings, ignoring structured filtering for date and topic. This risks inaccurate results without explicit range logic.
Databricks Reference: 'Pure vector similarity may not handle temporal or categorical constraints effectively' ('Building LLM Applications with Databricks').
Option D: Create separate indexes by topic and add a classifier model to appropriately pick the best index
Separate indexes per topic plus a classifier model adds significant complexity (index creation, model training, maintenance), far exceeding ''least effort.'' It's overkill for this use case.
Databricks Reference: 'Multiple indexes increase overhead; single-index with metadata is simpler' ('Databricks Vector Search Documentation').
Conclusion: Option B is the simplest and most effective solution, using metadata filtering in a single Vector Search index to handle date ranges and topics, aligning with Databricks' emphasis on efficient, low-effort setups.