期末复习 | Rye Land

2024-CS404FZ-January

Q2

Suppose you are developing an image classification model to identify images of cats and dogs. You decide to use a Convolutional Neural Network (CNN) as your model. The following are some relevant data and parameters:

Training set: 5000 images of cats and 5000 images of dogs

Validation set: 1000 images of cats and 1000 images of dogs

Test set: 2000 cat images and 2000 dog images

Answer the following questions:

(a) How will you process the image data for use by the CNN model?

Resizing: Adjust all images to a uniform size (e.g., 224x224 pixels) to match the input size expected.

Normalization: Scale pixel values to a range between 0 and 1 by dividing by 255 to improve training stability.

Label Encoding: Convert the categorical labels (“cat” and “dog”) into numerical format, typically using one-hot encoding or integer encoding.

Augmentation: Apply techniques such as rotation, flipping, cropping, and color adjustments to artificially increase the size of the training dataset and make the model more robust to variations.

(b) What techniques will you use during training to improve the performance and generalization of your model?

Batch Normalization: Normalize activations of the layers to stabilize and accelerate the training process.

Transfer Learning: Use a pre-trained CNN and fine-tune it on the dataset to leverage pre-learned features.

Q3

Transformer is an important model in natural language processing, answer the following questions:

(a) Explain the structure and main components of a Transformer.

Lecture 9

Encoder: Processes the input sequence in parallel, applying multiple layers of self-attention and feedforward networks. It generates contextualized representations of the input

Decoder: Generates the output sequence step-by-step, using masked self-attention, encoder-decoder attention (to attend to encoder outputs), and feedforward layers to predict the next token.

Input/Output Embeddings: Words or tokens are converted into dense vectors.

Positional Encoding: Since transformers don’t have inherent sequential order, positional encodings are added to the input embeddings to maintain word order.

Multi-head Attention Mechanism: Allows the model to focus on different parts of the sentence simultaneously, enhancing learning.

Feed-forward Networks :A fully connected network processes the data.

Residual Connections and Layer Normalisation: Helps stabilize and speed up training.

(b) Please give some examples of its applications in NLP tasks.

Lecture 9

Translation

Summarisation

Q&A.

(c) What are the differences between GPT and BERT in pre-training tasks?

Lecture 10

Pre-trained on large corpora: WebText (GPT-2), diverse internet text (GPT-3)

Pre-training Objective: Causal Language Modeling (CLM)

Predict the next token in a sequence
Uses a decoder-only architecture with unidirectional attention (left-to-right context)

Pre-trainedon large corpora: Wikipedia, BookCorpus

Pre-training Objective 1: Masked Language Model (MLM)

Randomly mask some tokens in the input
Model learns to predict masked tokens based on context from both directions (Encoder only)

Pre-training Objective 2: Next Sentence Prediction (NSP)

Predict if two sentences are sequential in the text

Q4

Word2Vec is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. Answer the following questions:

Lecture 7

(a) Explain the Word2Vec algorithm.

An word embeddings based on local co-occurrence within small context windows capture better context information than those based on whole documents.

(b) Specify two typical models for the Word2Vec algorithm.

Predict center word from surrounding words.

Aster to train and is effective on larger datasets, as it averages context words for prediction.

Predict surrounding words from the center word.

Effective for capturing relationships in smaller datasets, as it learns representations for each context word individually.

(c) Describe its applications and advantages in natural language processing.

Text classification

Sentiment analysis

Machine translation

Information retrieval

Reduces dimensionality compared to one-hot encoding.

Enables better generalization and improves performance in downstream NLP tasks.

2023-CS404FZ-January

Q1

(a) Another name for the Backpropagation network is the multi-layer perceptron (MLP). What is a Perceptron and what are the differences in capabilites between an MLP and a Perceptron?

Lecture 2

Perceptron is the building blocks of neural networks. It is a linear binary classifier, typically used for supervised learning.

MLP is a perceptron with multiple layers, non-linear activation functions, and various loss functions can now solve complex, non-linear problems, handle multi-class classification, and learn hierarchical patterns.

(b) In what important ways are current machine learning (ML) models different from those of traditional AI. What are the disdvantages of the current generation of ML models?

The current machine learning algorithms are based on the original backpropagation gradient descent algorithm but access to very large amounts of data and the availability of parallel computational resources such as GPUs has made building large-scale models more feasible. The deeper the model and the larger its training data, the better it performs.

The disdvantages of current models relate to the computational cost of training them and the difficulty in curating the quality of the information they are trainined on.

(c) Using a diagram, describe at a high level how gradient descent works in training an MLP.

The diagram below indicates the relationship between output error and incremental weight adjustment.

The is is used as a scaling factor to adjust the weight changes to small amounts and the negative sign is used to ensure the weight changes move along the error gradient towards a minimum.

Q2

Lecture 8

(a) What is the main class of problem a simple recurrent network (e.g., above) was designed to solve? What design features allow it to do so?

It was designed to handle data that was extended over time, such as words of a language or phonemes in speech.

The main design feature that allowed it to do process temporal information was the facilitation of just one input every time step and the addition of hidden unit information from the previous time step.

(b) Explain the roles of the hidden units and hidden units from t-1.

The hiddent units allow the network to learn the dependencies between the current input and the previous input sequence. The hidden units at t-1 provide a memory of the preceding sequence.

(c) Typically, what output is produced by the network and how is this achieved?

The goal is to predict what the next element in a sequence will be given the current input.

In order to do this successfully, the network needs to learn the statistical regularities of the input sequence. In the case of language input, this will involve learning the grammar of the language.

Q3

(a) ResNet50 (see above) is a type of convolutional neural network (CNN) that is effective for classifying images. Explain with an example what convolutions are why they are helpful for image classification.

Convolutional filters are designed to act as trainable feature detectors that become optimally configured to detect features relevant to the classification task.

For example, in the case of the hand-written digit recognition task, MNIST, the features learned relate to number shapes.

(b) ResNet50 is an example of a deep neural network. What does “deep” refer to in this context and how does depth help in the performance of the network?

A deep neural network is one comprising a large number of layers. For example ResNet50 is so called because it involves 50 layers.

Depth helps the network learn ever more abstract features to aid it in its classification task. The lower laywers will detect simple geometric patterns while the layers higher up will learn to recognise more complex combinations of features.

(c) ResNet50 and similar deep CNN networks are frequently used for transfer learning. What is transfer learning? Describe an example of transfer learning using the ResNet50 network.

Lecture 4

Transfer learning refers to using a network that’s been trained on a given task, such as identifying objects in an image, and fine-tuning that network to perform a related but more specific task. It has the advantage of not requiring the expensive re-training of large networks for minor variations in a classification task.

One example of this is repurposing ResNet50, which was trained to recognise 1000 catgories of image from over a million examples, to tell the difference between a cat and a dog.

Q4

(a) Transformer networks have proven to have significant advantages over their competitors in the area of language processing, such as machine translation. What features of the network architecture give it an advantage?

Lecture 9

The key innovation of Transformer models is the use of an attentional mechanism that allows the network to track important elements in a sequence, such as a specific word in a sentence.

Transformers can do this more effectively than simple recurrent networks because the input is presented to the network simultaneously rather than sequentially.

(b) GPT-3 is an example of a transformer-based language generation model. Give an example of an application of GPT-3? What are the strengths and weaknesses of GPT-3 for language generation?

Lecture 10

One of applications of GPT-3 is AI Writers, where the user can ask the application to, say and write words.

The strength is the quality of the text produced.

The disadvantage is that factual information can be unreliable and needs to tbe checked.