When preprocessing text data for an LLM fine-tuning task, why is it critical to apply subword tokenization (e.g., Byte-Pair Encoding) instead of word-based tokenization for handling rare or out-of-vocabulary words?
Answer : C
Subword tokenization, such as Byte-Pair Encoding (BPE) or WordPiece, is critical for preprocessing text data in LLM fine-tuning because it breaks words into smaller units (subwords), enabling the model to handle rare or out-of-vocabulary (OOV) words effectively. NVIDIA's NeMo documentation on tokenization explains that subword tokenization creates a vocabulary of frequent subword units, allowing the model to represent unseen words by combining known subwords (e.g., ''unseen'' as ''un'' + ''##seen''). This improves generalization compared to word-based tokenization, which struggles with OOV words. Option A is incorrect, as tokenization does not eliminate embeddings. Option B is false, as vocabulary size is not fixed but optimized. Option D is wrong, as punctuation handling is a separate preprocessing step.
NVIDIA NeMo Documentation: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/intro.html
In the development of Trustworthy AI, what is the significance of 'Certification' as a principle?
Answer : C
In the development of Trustworthy AI, 'Certification' as a principle involves verifying that AI models are fit for their intended purpose according to regional or industry-specific standards, as discussed in NVIDIA's Generative AI and LLMs course. Certification ensures that models meet performance, safety, and ethical benchmarks, providing assurance to stakeholders about their reliability and appropriateness. Option A is incorrect, as transparency is a separate principle, not certification. Option B is wrong, as ethical considerations are broader and not specific to certification. Option D is inaccurate, as compliance with laws is related but distinct from certification's focus on fitness for purpose. The course states: ''Certification in Trustworthy AI verifies that models meet regional or industry-specific standards, ensuring they are fit for their intended purpose and reliable.''
Which tool would you use to select training data with specific keywords?
Answer : D
Regular expression (regex) filters are widely used in data preprocessing to select text data containing specific keywords or patterns. NVIDIA's documentation on data preprocessing for NLP tasks, such as in NeMo, highlights regex as a standard tool for filtering datasets based on textual criteria, enabling efficient data curation. For example, a regex pattern like .*keyword.* can select all texts containing ''keyword.'' Option A (ActionScript) is a programming language for multimedia, not data filtering. Option B (Tableau) is for visualization, not text filtering. Option C (JSON parser) is for structured data, not keyword-based text selection.
NVIDIA NeMo Documentation: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/intro.html
What is the fundamental role of LangChain in an LLM workflow?
Answer : C
LangChain is a framework designed to simplify the development of applications powered by large language models (LLMs) by orchestrating various components, such as LLMs, external data sources, memory, and tools, into cohesive workflows. According to NVIDIA's documentation on generative AI workflows, particularly in the context of integrating LLMs with external systems, LangChain enables developers to build complex applications by chaining together prompts, retrieval systems (e.g., for RAG), and memory modules to maintain context across interactions. For example, LangChain can integrate an LLM with a vector database for retrieval-augmented generation or manage conversational history for chatbots. Option A is incorrect, as LangChain complements, not replaces, programming languages. Option B is wrong, as LangChain does not modify model size. Option D is inaccurate, as hardware management is handled by platforms like NVIDIA Triton, not LangChain.
NVIDIA NeMo Documentation: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/intro.html
LangChain Official Documentation: https://python.langchain.com/docs/get_started/introduction
Which calculation is most commonly used to measure the semantic closeness of two text passages?
Answer : C
Cosine similarity is the most commonly used metric to measure the semantic closeness of two text passages in NLP. It calculates the cosine of the angle between two vectors (e.g., word embeddings or sentence embeddings) in a high-dimensional space, focusing on the direction rather than magnitude, which makes it robust for comparing semantic similarity. NVIDIA's documentation on NLP tasks, particularly in NeMo and embedding models, highlights cosine similarity as the standard metric for tasks like semantic search or text similarity, often using embeddings from models like BERT or Sentence-BERT. Option A (Hamming distance) is for binary data, not text embeddings. Option B (Jaccard similarity) is for set-based comparisons, not semantic content. Option D (Euclidean distance) is less common for text due to its sensitivity to vector magnitude.
NVIDIA NeMo Documentation: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/intro.html
In Natural Language Processing, there are a group of steps in problem formulation collectively known as word representations (also word embeddings). Which of the following are Deep Learning models that can be used to produce these representations for NLP tasks? (Choose two.)
Answer : A, E
Word representations, or word embeddings, are critical in NLP for capturing semantic relationships between words, as emphasized in NVIDIA's Generative AI and LLMs course. Word2vec and BERT are deep learning models designed to produce these embeddings. Word2vec uses shallow neural networks (CBOW or Skip-Gram) to generate dense vector representations based on word co-occurrence in a corpus, capturing semantic similarities. BERT, a Transformer-based model, produces contextual embeddings by considering bidirectional context, making it highly effective for complex NLP tasks. Option B, WordNet, is incorrect, as it is a lexical database, not a deep learning model. Option C, Kubernetes, is a container orchestration platform, unrelated to NLP or embeddings. Option D, TensorRT, is an inference optimization library, not a model for embeddings. The course notes: 'Deep learning models like Word2vec and BERT are used to generate word embeddings, enabling semantic understanding in NLP tasks, with BERT leveraging Transformer architectures for contextual representations.'
What do we usually refer to as generative AI?
Answer : A
Generative AI, as covered in NVIDIA's Generative AI and LLMs course, is a branch of artificial intelligence focused on creating models that can generate new and original data, such as text, images, or audio, that resembles the training data. In the context of LLMs, generative AI involves models like GPT that produce coherent text for tasks like text completion, dialogue, or creative writing by learning patterns from large datasets. These models use techniques like autoregressive generation to create novel outputs. Option B is incorrect, as generative AI is not limited to generating classification models but focuses on producing new data. Option C is wrong, as improving model efficiency is a concern of optimization techniques, not generative AI. Option D is inaccurate, as analyzing and interpreting data falls under discriminative AI, not generative AI. The course emphasizes: 'Generative AI involves building models that create new content, such as text or images, by learning the underlying distribution of the training data.'