Natural Language Processing
Lecture 1 Introduction
Overview of NLP
What we will learn in this course ⬇️
- NLP and Machine Learning (lecture 1 - lecture 4)
- NLP Techniques (lecture 5 - lecture 8)
- Advanced topic (lecture 9 - lecture 12)

Let’s have a look at the chart first ⬇️
Based on the chart above, the differences between Classical NLP and Deep Learning-based NLP are:
| Classical | Deep Learning |
|---|---|
| Procedure Complex | Procedure Simpler |
| Lots of Pre-processing | Easy Pre-processing |
| Each natural language needs an individual processing pipeline | Is able to process multiple tasks in one model |
| Manually Feature Extraction | Automatically Feature Extraction by hidden layer |
| Rely on Linguistical knowledge | Word Embedding (word2vec, GloVe…) |
| Low Scalability | High Scalability |
Why Process Language?
- language stores knowledge
- language communicates new knowledge
- language is a key to culture and human experience
- language is a natural interface for human
What is Natural Language Processing?
NLP is the branch of AI focused on developing systems that allow computers to communicate with people using everyday language.
Ambiguity is Explosive
Ambiguities compound to generate enormous numbers of possible interpretations.
In English, a sentence ending in n prepositional phrases has over 2n interpretations.
- “I saw the man with the telescope”: 2 parses
- “I saw the man on the hill with the telescope.”: 5 parses
- “I saw the man on the hill in Texas with the telescope”: 14 parses
- “I saw the man on the hill in Texas with the telescope at noon.”: 42 parses
- “I saw the man on the hill in Texas with the telescope at noon on Monday” 132 parses
Word Meaning and Representation
One common solution is WordNet, but it has a lot of problems:
- Great as a resource but missing nuance
- e.g., “proficient” is listed as a synonym for “good”. This is only correct in some contexts.
- e.g., “good” can be synonym for “serious”???
- Missing new meanings of words
- e.g., wicked, badass, nifty, wizard, genius, ninja, bombast
- Impossible to keep up-to-date!
- Subjective
- Requires human labor to create and adapt
- Can’t compute exact and accurate word similarity
Count-based Word Representation
One-hot Encoding
Each integer value is represented as a binary vector that is all zeros values except the index of the integer, which is marked with a 1.
Problems with One-hot Encoding:
- No word similarity representation
- Inefficiency
Bag of Words (BoW)
A bag-of-words model (BoW) is a representation of text that describes the occurrence of words within a document. It involves two things:
- A vocabulary of known words.
- A measure of the presence of known words.
So it can be used for:
- Classification (Spam Detection, Sentiment Analysis)
- Similarity between different documents

Problems with BoW:
- Lost the sequence information of words (e.g. “dog bites man” and “man bites dog” will be detected as the same)
- The matrix is sparse when the dimension goes higher
- Can’t understand new words.
Term Frequency-Inverse Document Frequency
Short for TF-IDF.
Definition
A statistical measure used to evaluate the importance of a word to a document in a collection of documents.
Formula: TF-IDF = TF × IDF
TF (Term Frequency): Frequency of the term in the document
IDF (Inverse Document Frequency): log(Total number of documents / Number of documents containing the term)
Higher frequency in current document → Higher TF → More important
More common across all documents → Lower IDF → Less important
Example
Suppose we have 3 documents:
Document 1: “machine learning is interesting”
Document 2: “deep learning is a branch of machine learning”
Document 3: “I like learning”
For words in Document 2:
“learning”: High TF, but low IDF (appears in all 3 documents) → Lower TF-IDF
“deep”: Moderate TF, but high IDF (appears in only 1 document) → Higher TF-IDF
“is”/“a”: May have high TF, but very low IDF (common stop words) → Very low TF-IDF
Lecture 2 Word Embeddings and Representation
Introduction to the concept “Prediction”
Count-based Word Representation
Count-based methods are what we have learned: One-hot, Bag of Words, TF-IDF.
They are sparse and lack the ability in semantics expression. As a result, we need dense embedding.
Prediction-based Word Representation
Core thinking: “You shall know a word by the company it keeps”.
So, to get the word embedding, we can use the neural network to train the model to predict the context of a certain word.
We can use the Cosine Similarity.

Word2Vec
CBOW
CBOW is short for Continuous Bag of Words. The aim is to predict the target single word based on the context.
The neural network has 3 layers:
- Input Layer
- Hidden Layer
- Output Layer
Here, I will use a simple example given by ChatGPT to understand how those 3 layers work.
Assume we have only 4 words in our vocabulary:
| Word | Index |
|---|---|
| I | 0 |
| love | 1 |
| deep | 2 |
| learning | 3 |
Task:
We want the model to predict the centre word “love”, given the context words “I” and “deep”.
So:
1 | Context words → ["I", "deep"] |
Step 1: Input Layer
Each word is a one-hot vector (size = vocabulary size = 4):
| Word | One-hot vector |
|---|---|
| I | [1, 0, 0, 0] |
| love | [0, 1, 0, 0] |
| deep | [0, 0, 1, 0] |
| learning | [0, 0, 0, 1] |
So our input layer is:
1 | x₁ = [1, 0, 0, 0] → “I” |
Step 2: Hidden Layer
We use a weight matrix W, embedding size N = 2.
Assume W is 2*4:
| 0.2 | 0.4 |
| I 0.1 | 0.3 |
| love 0.7 | 0.5 |
| deep 0.6 | 0.9 learning |
Now multiply each one-hot vector by W:
- For “I”: [1,0,0,0] × W = [0.2, 0.4]
- For “deep”: [0,0,1,0] × W = [0.7, 0.5]
Then we average them (CBOW averages all context embeddings):
v = ([0.2,0.4] + [0.7,0.5]) / 2 = [0.45, 0.45]
This [0.45, 0.45] is our hidden layer output — the context representation.



After understanding this simple instance, you might be able to understand ⬇️



Skip-Gram
The aim is to predict the context words based on the given centre single word.
Limitation
The limitation of Word2Vec:
- Morphological Issues: Similar words such as teach/teacher/teaching are treated as different vectors.
- Rare Word Problem: Low-frequency words are insufficiently trained.
- OOV (Out-of-Vocabulary) Problem: New words outside the vocabulary cannot be represented.
FastText
Made by Facebook AI Research.
Breaks words into n-gram subword units for training.
Example: apple → { app, ppl, ple }; word vector = sum of subword vectors.
Advantages:
- Supports rare words and new words (OOV);
- Captures morphological features (teach/teacher).
GloVe
Addresses Word2Vec’s limitation of focusing only on local context windows.
Learns global distribution through P(k|i) (probability of context word k given centre word i).
Performs well on analogy tasks (king − man + woman ≈ queen).
Lecture 3 Word Classification with Machine Learning
Word Embedding Evaluation
Intrinsic Evaluation
Evaluates embeddings directly on linguistic subtasks.
Example: Word Analogies:
“man : woman :: king : ?” → model should output queen using vector arithmetic and cosine similarity.

Extrinsic Evaluation
Assesses embeddings on real NLP tasks: POS tagging, Dependency Parsing, Language Modelling etc.
- Advantage: Measures end-to-end performance (real outcome).
- Disadvantage: Training and interpretation are slower and more complex.

Deep Neural Network for NLP
Perceptron and Neural Network (NN)
What is Perceptron?
A single neuron:
- Inputs = features (e.g. word embeddings).
- Weights & bias adjust during training.
- Activation function f(·) adds non-linearity.
Common Activation Functions:
Sigmoid: Smooth probabilities (0–1)Tanh: Output range (–1, 1)ReLU: max(0, x), prevents vanishing gradient

Multilayer Perceptron
Input -> Hidden -> Output
Multilayer Perceptron can learn non-linear patterns.
Now we need to understand an important concept Back-propagation.
Back-propagation is the practice of fine-tuning the weights of a neural net based on the error rate (i.e. loss) obtained in the previous epoch (i.e. iteration). Proper tuning of the weights ensures lower error rates, making the model reliable by increasing its generalisation.

To understand Back-propagation, here’s a real example:
Cost Function -Method 1: MSE (Mean Squared Error)
Mean Square Error (MSE) is the most commonly used regression loss function. MSE is the average of squared distances between our target variable and predicted values

Cost Function -Method 2: ACE (Averaged Cross Entropy Error)
Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label.

Here comes another important concept Gradient (Optimiser)
In neural networks, the key technique to arrive at the optimal weights is the Gradient Descent algorithm. This algorithm relies on a hyperparameter called the learning rate, which allows one to moderate the rate of weight change, such that the cost function is minimised.

Lecture 4 Sequence Model and Recurrent Neural Networks
Seq2Seq Learning
Goal: Extend standard machine learning (N-to-1, N-to-N tasks) to handle temporal sequences.
The key innovation is introducing the concept of “time” — treating each word as part of a sequence rather than an independent item.
| Task Type | Example | Description |
|---|---|---|
| N→1 | Sentiment Analysis | Input a sentence, output one label |
| N→N | PoS Tagging | Each word has a tag |
| N→M | Translation | Input English, output Chinese |
Examples:
- Speech recognition: Speech → Text
- Video labelling: Frames → Scene Labels
- PoS tagging: Words → Tags
- Machine translation: English → Chinese
- Dialogue / Chatbot: Input sentence → Response
- Sentence completion: Predict missing or next words
Seq2Seq Deep Learning
RNN(Recurrent Neural Network)
Process:
- Input Sequence - Feed sequential data (text, time series) into the network one element at a time
- Hidden State Initialisation - Set initial hidden state h₀ (usually zero vector)
- Recurrent Processing - At each time step t:
- Receive current input xₜ and previous hidden state hₜ₋₁
- Compute new hidden state: hₜ = f(Wₓₓ·xₜ + Wₕₕ·hₜ₋₁ + b)
- Where f is an activation function (such as tanh or ReLU)
- Output Generation - Generate output yₜ based on current hidden state hₜ
- State Propagation - Pass hₜ to the next time step
- Repeat - Repeat steps 3-5 for each element in the sequence
Problem:
- Vanishing Gradient Issue
- Exploding Gradient

LSTM(Long Short-Term Memory)
LSTM introduces gates to control what information to keep or forget.
Information from the previous hidden state and information from the current input are passed through the sigmoid function. Values come out between 0 and 1. The closer to 0 means to forget, and the closer to 1 means to keep.
LSTM solves the vanishing gradient by preserving long-term information via cell state.

GRU(Gated Recurrent Unit)

Data Transformation for Deep Learning NLP
Purpose: Prepare textual data for input into neural networks.
Examples:
- Image classification → pixels to classes
- Topic classification → document to label
- Visual Q&A → image + question → answer
Data Transformation Techniques

Lecture 5 Language Fundamental
Sequential Text Classification
Used for sentiment analysis, PoS tagging, sequence labelling, etc.
RNN (N -> 1 Task)
Input: sequence of words → Output: single label (e.g., positive/negative sentiment).
Bi-RNN and Bi-LSTM
Process sequence in both directions (forward & backward).
Combine information via concatenation or summation.
Summation is more efficient but may cause information leakage.
Application: Sentiment Analysis
Sentiment analysis identifies the emotion, attitude, or opinion expressed in text.
Key Challenges:
- Sarcasm Detection
- Positive words used to express negative intent.
- Context understanding is essential.
- Word Ambiguity
- Same word may change polarity by context.
- e.g., “unpredictable story” (positive) vs “unpredictable steering” (negative).
- Multipolarity
- One text may have mixed sentiments about different aspects.
- e.g. “The audio quality of my new laptop is so cool but the display colors are not too good.”
Sentiment Components
| Element | Description |
|---|---|
| Target Object | The entity being evaluated (e.g. iPhone) |
| Attributes | Components (battery) and properties (weight, colour) |
| Attitude Holder | The person expressing opinion |
| Attitude Type | Positive / Negative / Neutral |
| Time | When the sentiment was expressed |
Sentiment Task Levels
| Type | Example |
|---|---|
| Basic | Positive or Negative |
| Likert Scale | Rank 1-5 |
| Advanced | Detect targets/aspects/emotions (joy, anger, fear, etc.) |
Emotion resources: EmoLex, NRC Twitter Emotion Corpus, SenticNet, EmoSenticNet.
Evaluation Methods
Evaluation Metrics:
- Precision: High when false positives are costly (e.g., spam detection).
- Recall: High when false negatives are costly (e.g., medical diagnosis).
- F1-score: Balances precision and recall.
Sequential Language Model
Language Model
A Language Model (LM) predicts the probability of the next word given the previous ones.
e.g. P(here | can, you, come)
Neural Language Model
Traditional Neural LM
Fixed-size input window → cannot capture long dependencies.
RNN-based LM
- Processes variable-length input, shares weights across time.
- Learns long-term sequence patterns.
- Can generate text by sampling next words recursively.
Language Fundamental
| Level | Description | 中文说明 |
|---|---|---|
| Phonology | Study of sounds in a language | 语音学:研究语言的发音规律 |
| Morphology | Study of word formation and morphemes | 词法学:研究词的结构和构成 |
| Syntax | Word order and grammatical structure | 句法学:研究句子的语法结构 |
| Semantics | Meaning of words and combinations | 语义学:研究词义与组合意义 |
| Pragmatics | Meaning in context | 语用学:研究语境中的语言意义 |

Text Preprocessing
Every NLP task requires text preprocessing.
- Tokenisation
• Split text into tokens.
• Handle multilingual and compound words (German, Japanese, Arabic). - Normalisation
• Unify text format (e.g., U.S.A. → USA). - Case Folding
• Convert all to lowercase except when case matters (e.g., “US” ≠ “us”). - Lemmatisation (词形还原)
• Convert words to dictionary base form using grammar context. - Stemming(词干提取)
• Remove affixes crudely; faster but less accurate than lemmatisation. - Sentence Segmentation
• Detect sentence boundaries, handle abbreviations like “Dr.”, “Inc.”. - Regular Expressions
• Used for token matching and rule-based preprocessing.
Stemming vs Lemmatisation
Lecture 6 Part of Speech Tagging
Concept
Part-of-Speech (PoS) tagging assigns a syntactic category (noun, verb, adjective, etc.) to each word in a sentence.
e.g. Emma has a beautiful flower → NNP VBZ DT JJ NN
Tagging Criteria
- Distributional criteria: Where can the words occur?
- For example, nouns can appear with possession: “his car”, ”her idea”
- Morphological criteria: What form does the word have? (E.g. -tion, -ize). What affixes can it take? (E.g. -s, -ing, -est).
- ness, -tion, -ity, and -ance tend to indicate nouns. (happiness, exertion, levity, significance).
- Notional(or semantic) criteria: What sort of concept does the word refer to? (E.g. nouns often refer to ‘people, places or things’). More problematic: less useful for us
- Nouns generally refer to living things (mouse), places (Perth), non-living things (computer), or concepts (marriage).
PoS Tagging Challenges
Given an input text, tag each word correctly:
There/ was/ still/ lemonade/ in/ the/ bottle/
In the above, the bottle is a noun, not a verb. The still could be an adjective or an adverb.
The purpose of PoS Tagging
Essential ingredient in natural language applications.
- Useful in and of itself (more than you’d think)
- Text-to-speech: record, lead
- Lemmatization: saw[v] see, saw[n] saw
- Linguistically motivated word clustering
- Useful as a pre-processing step for parsing
- Useful as features to downstream systems.
Baseline Approaches
Rule-based Model
Old POS taggers used to work in two stages, based on hand-written rules:
- The first stage identifies a set of possible POS for each word in the sentence (based on a lexicon), and
- the second uses a set of hand-crafted rules in order to select a POS from each of the lists for each word

Look-up Table Model

N-Gram Model


Seq2Seq(N-to-N) Approaches
Other N-to-N Sequence Tasks
Named Entity Recognition
Natural Language Understanding: Slot Tagging
Lecture 7 Attention
We will introduce 2 key advance NLP topics:
- Question Answering (QA) — how machines answer human questions.
- Attention Mechanism — how deep learning models learn to “focus” on important information during sequence processing.
Question Answering
QA is a field combining information retrieval (IR) and NLP, where systems automatically answer human questions written in natural language.
Type of Questions
| Type | Example |
|---|---|
| Yes/No | Are you a student? |
| Wh- (who, what, where, when, why, how) | When did you get to this lecture? |
| Choice | Which one of these cities is in Australia? |
| Factoid | “Who is the president of France?” — answer exists in text as a short phrase. |
Key Questions in Building QA systems
1. What do the answers look like?
2. Where can we get the answers from?
3. What does our training data look like?
Knowledge-based Question Answering
Convert a natural language question into a logical query that can be executed against a knowledge base (KB) like DBPedia or Freebase.
Example:
1 | Question: When was Justin Bieber born? |
Implementation
1. Map question text → logical form (e.g., SQL/SPARQL query).
2. Execute query in the knowledge base.
3. Return factual answer.
Pros and Cons
✅ Logical Form instead of (direct) answer makes system robust.
✅ Answer independent of question and parsing mechanism.
✅ The answer may be from the trustworthy sources.
❌ Constrained to queriable questions in Database Schema.
❌ Difficult to find the well-structured training dataset.
IR-based Question Answering (Reading Comprehension)
Find relevant passages from a large corpus (e.g., Wikipedia) → extract answer span.
Generic Neural Model for Reading Comprehension
1. Convert text & question to word vectors.
2. Encode both using sequence models (LSTM).
3. Combine question & context via attention.
4. Predict start and end tokens of answer span.
Attention
In Seq2Seq models (e.g., translation), the decoder must encode all input info into one vector → information bottleneck problem.
On each decoding step, attention lets the model focus on relevant parts of the input sequence.
How it works?
1. Compute attention scores between decoder hidden state (query) and encoder hidden states (values).
2. Apply softmax to get a distribution.
3. Compute weighted sum to form context vector.
4. Concatenate context vector with decoder hidden state to generate next output.
Benefits of Attention
✅ Solves vanishing gradient problem.
✅ Allows decoder to “look back” directly at the input.
✅ Improves translation & comprehension accuracy.
✅ Provides interpretability — shows where model focuses.
Categories
| Category | Definition |
|---|---|
| Global / Local | Attend to all input states / part of them |
| Self-Attention | Each word attends to other words in same sequence (used in Transformer) |
An example of Self-Attention:
In reading comprehension, self-attention helps relate current word to previous sentence context — allowing deeper contextual understanding.
Another example of Bi-LSTM + Attention:
For each question–document pair:
- Encode question and document separately.
- Use attention to connect question representations with document words.
- Predict start and end positions of the answer.
Question Answering with Attention
Some questions require non-textual context (images, videos).
e.g. Question: “What are sitting in the basket on a bicycle?” Answer: “Dogs.”
Open-domain (textual) question answering
Uses a Retriever–Reader framework (Chen et al., 2017):
1. Retriever: fetch top-K relevant passages from large corpus (e.g., Wikipedia).
2. Reader: apply reading comprehension model to extract answer.
This enables systems like Google Search / ChatGPT to answer any factual question from vast text collections.

Lecture 8 Transformer
This lecture explains the evolution of Machine Translation (MT) — from Statistical Machine Translation (SMT) to Neural Machine Translation (NMT), and finally to the Transformer model that revolutionised modern NLP. It also introduces the rise of pre-trained language models like BERT and GPT.
Machine Translation
Translate a sentence from one language (source) into another language (target).
e.g. 生命短暂 → life is short
Statistical Machine Translation
Learning a probabilistic model from data.
- Translation Model (P(x|y)):ensure fidelity
- Language Model (P(y)):ensure fluency
What is Alignment?

Neural Machine Translation
Transformer
The rise of the Pre-trained Model
Lecture 9 Advanced NLP: Pretrained Models in NLP
…
Lecture 10 Advanced NLP: Natural Language Generation
…
Lecture 11 Advanced NLP: GPT
…
Lecture 12 Course Review and Exam Guide
…
