本文最后更新于：1 年前

这里是大模型相关的内容，是Tsinghua和OpenBMB社区合办的课程昂！！！Talk is cheap, show me the code.

Outline

课程大纲：
- Basic Knowledge of Big Models
  - L1-NLP Big Model Basics(GPU server,Linux,Bash,Conda,…)
  - L2-Neural Network Basics(PyTorch)
  - L3-Transformer and PLMs(Huggingface Transformers)
- Key Technology of Big Models
  - L4-Prompt Tuning Delta Tuning (OpenPrompt,OpenDelta)
  - L5-Efficient Training Model Compression(OpenBMB suite)
  - L6-Big-Model-based Text understanding and generation
- Interdisciplinary Application of Big Models
  - L7-Big Models X Biomedical Science
  - L8-Big Models X Legal Intelligence
  - L9-Big Models X Brain and Cognitive Science

L1 NLP Basics

NLP的一些任务:
- 词性标注：把一句话中每个词的词性标注出来
- 句子中的命名实体识别：一句中的命名实体
- 共指消解：代词和哪个实体是同一个对象
- 句子中各种句法和依赖关系的识别
- 中文分词
- text matching
- query engine
- 知识图谱
- machine reading
- machine translation
- 人机对话
- Personal Assistant
- Sentiment Analysis and Opinion Mining
- Computational Social Science
  - 社会变迁
  - 心理变化
  - …

词的表示

基本问题：词的表示
- 让机器了解，词的表示，词的相似度计算。
- 让机器了解，词之间的语义关系。
one-hot representation（词表）
- All the vectors are orthogonal. No natural notion of similarity for one-hot vectors
Use context words to represent current word.（用上下文去描述当前词）
- Increase in size with vocabulary
- Require a lot of storage
- Sparsity issues for those less frequent words -> Subsequent classification models will be less robust
Word Embedding: Distributed Representation
- Build a dense vector for each word learned from large-scale text corpora
- Learning method: Word2Vec (We will learn it in the next class)

语言模型

两个能力
- 判断一系列词出现的联合概率。
- 通过前文，去预测后文的单词。
Assumption: 后文概率只受前文概率的影响，单纯概率相乘法。
N-gram Model:
- 简单统计，利用出现频度来进行预测（哪个越多，我就选这个）。马尔可夫假设！根据这个词之前有限的词去进行统计频度，并得出结果。
- Not considering contexts farther than 1 or 2 words
- Not capturing the similarity between words
Neural Language Model:
- A neural language model is a language model based on neural networks to learn distributed representadons of words
  - Associate words with distributed vectors
  - Compute the joint probability of word sequences in terms of the feature vectors
  - Opbmize the word feature vectors (embedding matrix E) and the parameters of the loss funcbon (map matrix W)

大模型

Why Big Models
- Size up, data up, 性能和各种功能有显著提升。
- 能力：
  - World Knowledge
  - Common Sense
  - Logical Reasoning
- 关注度也一直在往上走
Why LLM works: Large-scale Unlabeled Data(Model Pre-training) -> Task-specific Training Data(Model Fine-tuning) -> Data(Final Model)
The basic paradigm of pre-training and fine-tuning can be traced back to transfer learning Humans can apply previously learned knowledge to handle new problems faster, and we want machines to have similar abilities.
Prerequisites
- GPU
  - You own
  - Rent
  - Use Google colab
- SSH
- Linux command
- Vim
- Tmux
- Virtual environment & conda & pip
- Vscode + remote connection
- Git
- Bash

L2 NN Basics

Outline
- Neural Network Components
  - Simple Neuron; Multilayer; Feedforward; Non-linear; …
  - How to Train
    - Objective; Gradients; Backpropogation
- Word Representation: Word2Vec
  - Common Neural Networks
    - RNN
      - Sequential Memory; Language Model
      - Gradient Problem for RNN
      - Variants: GRU; LSTM; Bidirectional;
    - CNN
- NLP Pipeline Tutorial (PyTorch)

How NN works

A single layer neural network: Hooking together many simple neurons. Multilayer Neural Network: Stacking multiple layers of neural networks.
Forward Propagation & Backward Propagation.
Without non-linearities, deep neural networks cannot do anything more than a linear transform. Extra layers could just be compiled down into a single linear transform. With non-linearities, neural networks can approximate more complex functions with more layers!
Input -> Hidden -> Output, Output depends on the task:
- Linear output: 用于预测连续的值。
- Sigmoid output: 把输出压到0-1之间，可以用于二分类问题。
- Softmax output：多分类问题。
Choices of non-linearities: Sigmoid, Tang, ReLU
Summary
- Simple neuron
- Single layer neural network
- Multilayer neural network
  - Stack multiple layers of neural networks
- Non-linearity activation function
  - Enable neural nets to represent more complicated features
- Output layer
  - For desired output

Training NN

损失函数: 均方误差函数（MSE），可以用于判定回归问题的拟合效果。
损失函数: 交叉熵（Cross-entropy），可以用于判定模型对于多分类问题的正确率，衡量模型分类正确的负log概率。
最小化损失函数：Stochastic Gradient Descent，梯度下降法。
链式法则：用于神经网络中求梯度。
Backpropagation
- Compute gradients algorithmically
- Used by deep learning frameworks (TensorFlow, PyTorch, etc.)
- Computational Graphs: Representing our neural net equations as a graph
  - Source node: inputs
  - Interior nodes: operations
  - Edges pass along result of the operation
- Go backwards along edges: Pass along gradients
- Single Node:
  - Node receives an “upstream gradient”
  - Goal is to pass on the correct “downstream gradient”
- Each node has a local gradient: The gradient of its output with respect to its input. [downstream gradient] = [upstream gradient] x [local gradient]
Summary:
- Forward pass: compute results of operation and save intermediate values
- Backpropagation: recursively apply the chain rule along computational graph to compute gradients
  - [downstream gradient] = [upstream gradient] x [local gradient]

NN Example: Word2Vec

Word2vec uses shallow neural networks that associate words to distributed representations.
Typical Models: Word2vec can utilize two architectures to produce distributed representations of words:
- Continuous bag-of-words (CBOW)
- Continuous skip-gram
Sliding Window:
- Word2vec uses a sliding window of a fixed size moving along a sentence
- In each window, the middle word is the target word, other words are the context words
  - Given the context words, CBOW predicts the probabilities of the target word
  - While given a target word, skip-gram predicts the probabilities of the context words
  一个是Context -> Word，另外一个是Word -> Context
Continuous Bag-of-Words
- In CBOW architecture, the model predicts the target word given a window of surrounding context words
- According to the bag-of-word assumption: The order of context words does not influence the prediction
Continuous Skip-Gram: In skip-gram architecture, the model predicts the context words from the target word
Problems of Full Softmax: When the vocabulary size is very large
- Softmax for all the words every step depends on a huge number of model parameters, which is computationally impractical
- We need to improve the computation efficiency
Improving Computational Efficiency
- In fact, we do not need a full probabilistic model in word2vec
- There are two main improvement methods for word2vec:
  - Negative sampling
    - As we discussed before, the vocabulary is very large, which means our model has a tremendous number of weights need to be updated every step
    - The idea of negative sampling is, to only update a small percentage of the weights every step
    - Then we can compute the loss, and optimize the weights (not all of the weights) every step
    - Suppose we have a weight matrix of size 300×10,000, the output size is 5
    - We only need to update 300×5 weights, that is only 0.05% of all the weights
  - Hierarchical softmax
Other Tips for Learning Word Embeddings
- Sub-Sampling. 平衡常见词和罕见词出现的概率
- Soft sliding window：非固定的滑动窗口，随机采样一个范围中的词，作为window size

RNN

Key concept for RNNs: Sequential memory during processing sequence data
Definition: a mechanism that makes it easier for your brain to recognize sequence patterns
RNNs update the sequential memory recursively for modeling sequence data
Application Scenarios
- Sequence Labeling
  - Given a sentence, the lexical properties of each word are required
- Sequence Prediction
  - Given the temperature for seven days a week, predict the weather conditions for each day
- Photograph Description
  - Given a photograph, create a sentence that describes the photograph
- Text Classification
  - Given a sentence, distinguish whether the sentence has a positive or negative emotion
Advantages & Disadvantages
- Advantages
  - Can process any length input
  - Model size does not increase for longer input
  - Weights are shared across timesteps
  - Computation for step i can (in theory) use information from many steps back
- Disadvantages
  - Recurrent computation is slow
  - In practice, it’s difficult to access information from many steps back.
  - Gradient vanish or explode

GRU

Introduce gating mechanism into RNN
- Update gate
- Reset gate
Gates are used to balance the influence of the past and the input
If reset is close to 0. Ignore previous hidden state, which indicates the current activation is irrelevant to the past.
Update gate controls how much of past state should matter compared to the current activation.

LSTM

Long Short-Term Memory network (LSTM) . LSTM is a special kind of RNN, capable of learning long-term dependencies like GRU
cell state t
- Extra vector for capturing long-term dependency
- Runs straight through the entire chain, with only some minor linear interactions
- Easy to remove or add information to the cell state
Steps:
- The first step is to decide what information to throw away from the cell state: forget gate
- The next step is to decide what information to store in the cell state
- Update the old cell state. Combine the results from the previous two steps.
- The final step is to decide what information to output -> Adjust the sentence information for a specific word representation.
Powerful especially when stacked and made even deeper (each hidden layer is already computed by a deep internal network) . Very useful if you have plenty of data.

Bidirectional RNNs

In traditional RNNs, the state at time t only captures information from the past. Problem: in many applications, we want to have an output depending on the whole input sequence. E.g. handwriting recognition & speech recognition
Recurrent Neural Network
- Sequential Memory
- Gradient Problem for RNN
RNN Variants
- Gated Recurrent Unit (GRU)
- Long Short-Term Memory Network (LSTM)
- Bidirectional Recurrent Neural Network

CNN

Convolutional Neural Networks
- Generally used in Computer Vision
- Achieve promising results in a variety of NLP tasks:
  - Sentiment classification
  - Relation classification
- CNNs are good at extracting local and positioninvariant patterns
CNNs extract patterns by:
- Computing representations for all possible n-gram phrases in a sentence.
- Without relying on external linguistic tools (e.g., dependency parser)
Architecture: Input Layer -> Convolutional Layer -> Max-pooling Layer -> Non-linear Layer
Input Layer: Transform words into input representations x via word embeddings
Extract feature representation from input representation via a sliding convolving filter.
Application Scenarios: Object Detection, Video Classification, Speech Recognition, Text Classification
CNN vs RNN
- CNN:
  - Extracting local and position-invariant features
  - Less parameters
  - Better parallelization within sentences
- RNN:
  - Modeling long-range context dependency
  - More parameters
  - Cannot be parallelized within sentences

Pytorch Demo

Pipeline for Deep Learning: prepare data -> build model -> train model -> evaluate model -> test model
Context
- target: to predict next word
  - input: never too old to learn
  - output: too old to learn English
- model: LSTM
- loss: cross_entropy

L3 Transformer and PLM

Transformer
- Attention Mechanism
- Transformer Structure
Pretrained Language Models
- Language Modeling
- Pre-trained Langue Models (PLMs)
- Fine-tuning Approaches
- PLMs after BERT
- Applications of Masked LM
- Frontiers of PLMs
Transformers Tutorial
- Introduction
- Frequently-used APIs
- Quick Start
- Demo

Transformer

Attention Machanism

The Bottleneck Problem
- The single vector of source sentence encoding needs to capture all information about the source sentence
- The single vector limits the representation capacity of the encoder: the information bottleneck
Attention
- Attention provides a solution to the bottleneck problem
- Core idea: at each step of the decoder, focus on a particular part of the source sequence
A more general definition of attention: Given a query vector and a set of value vectors, the attention technique computes a weighted sum of the values according to the query
Intuition:
- Based on the query, the weighted sum is a selective summary of the values.
- We can obtain a fixed-size representation of an arbitrary set of representations via the attention mechanism.
Attention Variants： Attention has a lot of variants.
Insights:
- Attention solves the bottleneck problem: The decoder could directly look at source
- Attention helps with vanishing gradient problem: By providing shortcuts to long-distance states
- Attention provides some interpretability:
  - We can find out what the decoder was focusing on by the attention map:
  - Attention allows the network to align relevant words

Transformer Structure

Motivations
- Sequential computation in RNNs prevents parallelization
- Despite using GRU or LSTM, RNNs still need attention mechanism which provides access to anuuiy state
- Maybe we do not need RNNs? -> Attention is all you need
Transformer
- Architecture: encoder-decoder
- Input: byte pair encoding + positional encoding
- Model: stack of several encoder/decoder blocks
- Output: probability of the translated word
- Loss function: standard crossentropy loss over a softmax layer

Input

Byte Pair Encoding (BPE)
- A word segmentation algorithm
- Start with a vocabulary of characters
- Turn the most frequent n-gram to a new n-gram
Byte Pair Encoding (BPE)
- Solve the OOV (out of vocabulary) problem by encoding rare and unknown words as sequences of subword units
- In the example above, the OOV word “lowest” would be segmented into “low est”
- The relation between “low” and “lowest” can be generalized to “smart” and “smartest”
Positional Encoding
- Byte Pair Encoding (BPE): Dimension: d
- Positional Encoding (PE): The Transformer block is not sensitive to the same words with different positions
Input = BPE + PE

Encoder Block

Two sublayers
- Multi-Head Attention
- Feed-Forward Network (2-layer MLP)
Two tricks
- Residual connection
- Layer normalization
  - Changes input to have mean 0 and variance 1
General Dot-Product Attention
Inputs
- A query q and a set of key-value (k, v) pairs
- Queries and keys are vectors with dimension
- Values are vectors with dimension
Output
- Weighted sum of values
- Weight of each value is computed by the dot product of the query and corresponding key
- stack multiple queries q in a matrix Q
Scaled Dot-Product Attention
- Problem
  - 梯度可能会越来越小，模型更新慢
  - The softmax gets very peaked; Gradient gets smaller
- Solution
  - Scale by the length of the query/key vectors
Self-attention
- Let the word vectors themselves select each other
- Q, K, V are derived from the stack of word vectors from a sentence
Multi-head Attention
- Different head: same computation component & different parameters
- Concatenate all outputs and feed into the linear layer
Two sublayers
- Multi-head attention
- 2-layer feed-forward network
Two tricks
- Residual connection
- Layer normalization
  - Changes input to have mean 0 and variance 1
In each layer, Q, K, V are the same as the previous layer’s output

Decoder Block

Two changes:
- Masked self-attention: The word can only look at previous words
- Encoder-decoder attention: Queries come from the decoder while keys and values come from the encoder
- Blocks are also repeated 6 times
Other tricks
- Checkpoint averaging
- ADAM optimizer
- Dropout during training at every layer just before adding residual
- Label smoothing
- Auto-regressive decoding with beam search and length penalties
Multi-head Demo

Summary of Transformer

Advantage:
- The Transformer is a powerful model and proven to be effective in many NLP tasks
- The Transformer is suitable for parallelization
- It proves the effectiveness of the attention mechanism
- It also gives insights to recent NLP advancements such as BERT and GPT
Disadvantage:
- The architecture is hard to optimize and sensitive to model modifications
- $O(n^2)$ per-layer complexity makes it hard to be used on extremely long document (usually set max length to be 512)

PLM

Language Modeling
Pre-trained Langue Models (PLMs)
Fine-tuning Approaches
- GPT and BERT
PLMs after BERT
Applications of Masked LM
- Cross-lingual and Cross-modal LM Pre-training
Frontiers of PLMs
- GPT-3, T5 and MoE

LM

Language Modeling is the task of predicting the upcoming word
Language Modeling: the most basic and important NLP task
Contain a variety of knowledge for language understanding, e.g., linguistic knowledge and factual knowledge
Only require the plain text without any human annotations
The language knowledge learned by language models can be transferred to other NLP tasks easily
There are three representative models for transfer learning of NLP
- Word2vec
- Pre-trained RNN
- GPT&BERT

PLM

We have mentioned several PLMs in the last section: word2vec, GPT, BERT, …
PLMs: language models having powerful transferability for other NLP tasks
word2vec is the first PLM
Nowadays, the PLMs based on Transformers are very popular (e.g. BERT)
Two Mainstreams of PLMs
- Feature-based approaches
  - The most representative model of feature-based approaches is word2vec
  - Use the outputs of PLMs as the inputs of our downstream models
- Fine-tuning approaches
  - The most representative model of fine-tuning approaches is BERT.
  - The language models will also be the downstream models and their parameters will be updated

GPT

GPT-1:
- Inspired by the success of Transformers in different NLP tasks, GPT is the first work to pre-train a PLM based on Transformer
- Transformer + left-to-right LM
- Fine-tuned on downstream tasks
GPT-2:
- A huge Transformer LM
- Trained on 40GB of text
- SOTA perplexities on datasets it’s not even trained on
More than LM
- Zero-Shot Learning: Ask LM to generate from a prompt
- Reading Comprehension
- Summarization
- Question Answering
A very powerful generative model
Also achieve very good transfer learning results on downstream tasks
- Outperform ELMo significantly
The key to success
- Big data (Large unsupervised corpus)
- Deep neural model (Transformer)

Bert

Problem: Language models only use left context or right context, but language understanding is bidirectional
Why are LMs unidirectional
- Reason 1: Directionality is needed to generate a wellformed probability distribution
- Reason 2: Words can “see themselves” in a bidirectional encoder
Unidirectional vs.Bidirectional Models
- Unidirectional context: Build representation incrementally
- Bidirectional context: Words can “see themselves
Solution: Mask out k% of the input words, and then predict the masked words. k=15% in BERT
- Too little masking: too expensive to train
- Too much masking: not enough context
Masked LM
- Problem: [Mask] token never seen at fine-tuning
- Solution: 15% of the words to predict
- 80% of the time, replace with [MASK]
  - went to the store → went to the [MASK]
- 10% of the time, replace with a random word
  - went to the store → went to the running
- 10% of the time, keep the same went to the store → went to the store
Next Sentence Prediction
- To learn relationships between sentences, predict whether Sentence B is the actual sentence that proceeds Sentence A, or just a random sentence
- Input Representation
  - Use 30,000 WordPiece vocabulary on input.
  - Each token is the sum of three embeddings
  - Single sequence is much more efficient.
Effect of Pre-training Task:
- Masked LM (compared to left-to-right LM) is very important on some tasks
- Next Sentence Prediction is important for other tasks
Effect of Model Size
- Big models help a lot
- Going from 110M -> 340M params helps even on datasets with 3,600 labeled examples
Empirical results from BERT are great, but biggest impact on the field is: With pre-training, bigger == better, without clear limits (so far)
Excellent performance for researchers and companies building NLP systems

Summary

Feature-based approaches transfer the contextualized word embeddings for downstream tasks
Fine-tuning approaches transfer the whole model for downstream tasks
Experimental results show that fine-tuning approaches are better than feature-based approaches
Hence, current research mainly focuses on fine-tuning approaches
Is BERT really perfect?
- Any optimized pre-training paradigm?
- The gap between pre-training and fine-tuning
  - [MASK] token will not appear in fine-tuning • The efficiency of Masked Language Model
- Only predict 15% words
RoBERTa
- Explore several pre-training approaches for a more robust BERT
  - Dynamic Masking
  - Model Input Format
  - Next Sentence Prediction
  - Training with Large Batches
  - Text Encoding
- Massive experiments
ELECTRA
- Recall: the efficiency of bi-directional pre-training
  - Masked LM: 15% prediction
  - Premutation LM: 1/6~1/7 prediction
- Traditional LM: 100% prediction
  - Single direction
- Replaced Token Detection
  - A new bi-directional pre-training task
  - 100% prediction

MLM

Basic idea: to use bi-direction information to predict the target token
Beyond token: use multi-modal or multi-lingual information together by masking
Input the objects from different domains together and predict the target object based on the input objects

Cross-lingual LM Pre-training

Translation Language Modeling (TLM)
The TLM objective extends MLM to pairs of parallel sentences (e.g., English-French)
To predict a masked English word, the model can attend to both the English sentence and its French translation, and is encouraged to align English and French representations.
The translation language modeling (TLM) objective improves cross-lingual language model pretraining by leveraging parallel data

Pairs of videos and texts from automatic speech recognition (ASR)
Generate a sequence of “visual words” by applying hierarchical vector quantization to features derived from the video using a pre-trained model
Encourages the model to focus on high-level semantics and longer-range temporal dynamics in the video

Summary

Masked LM inspired a variety of new pre-training tasks
What’s your idea about transferring Masked LM?

Frontiers

GPT-3

A super large-scale PLM
Excellent few-shot/in-context learning ability
GPT-3: Doesn’t know when to say “I do not know”
T5
- Reframe all NLP tasks into a unified text-to-text-format where the input and output are always text strings
- Encoder-decoder architecture
Larger Model with MoE
- Enhance encoder-decoder with MoE (Mixture of Experts) for billions of parameters
- Gshard 600B parameters
- Switch Transformer 1,571B parameters

Summary

The technique of PLMs is very important for NLP (from word2vec to BERT).
Fine-tuning approaches are widely used after BERT.
The idea of Masked LM inspired the research on unsupervised learning.
Consider PLMs first when you plan to construct a new NLP system.

Transformers Tutorial

Introduction

Various pre-trained language models are being proposed
Introduction
- Various pre-trained language models are being proposed
- Is there any package that helps us:
  - Reproduce the results easily
  - Deploy the models quickly
  - Customize your models freely
Hugging Face:
- Transformers is a package:
  - Providing thousands of models
  - Supporting PyTorch, TensorFlow, JAX
  - Hosting pre-trained models for text, audio and vision
- Fairly easy to use. Low barrier to entry for researchers.
- Almost all the researches on pre-trained models are built on Transformers!

Pipeline

I want to directly use the off-the-shelf model on down-stream tasks -> Use pipeline!

from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier('I love you! ')

from transformers import pipeline
question_answerer = pipeline('question-answering')
question_answerer({
	'question': 'What is the name of the repository ?’,
	'context': 'Pipeline has been included in the huggingface transformers repository'
})

Pipeline automatically uses a fine-tuned model and perform the downstream task.

Tokenization

Pre-trained language models have different tokenization
- BPE (Byte-Pair Encoding): GPT, Roberta, …
- WordPiece: BERT, Electra, …
- SentencePiece: ALBERT, T5, …

1
2
3

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer("I love you.")

The tokenizer automatically uses the tokenization strategy of the given model to tokenize your text.

Frequently-used APIs

Load the pre-trained models in a few lines

1
2
3

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-baseuncased')

Tokenize the texts

1	`inputs = tokenizer(”Hello World!”, return_tensors='pt')`

Run the model

1	`outputs = model(**inputs)`

Save the fine-tuned model in one line

1	`model.save_pretrained("path_to_save_model")`

from_pretrained也可以把这个预训练模型提取出来。

Train the model with Trainer

trainer = Trainer(
  model,
  args,
  train_dataset=encoded_dataset["train"],
  eval_dataset=encoded_dataset["validation"],
  tokenizer=tokenizer,
  compute_metrics=compute_metrics
)
trainer.train() # Start training!
trainer.evaluate()

Demo

We have provided a demo, which fine-tunes BERT for sentiment analysis task.
You will be able to use Transformers after going through this demo.
See https://colab.research.google.com/drive/1tcDiyHIKgEJp4TzGbGp27HYbdFWGolU_?usp=sharing, video is: https://www.bilibili.com/video/BV1UG411p7zv?p=40

L4 Prompt Delta

Background & Overview
Prompt -learning
- Template
- Verbalizer
- Learning Strategy
- Applications
Delta Tuning
- Addition -based Methods
- Specification -based Methods
- Reparameterization -based Methods
- Advanced Topics
OpenPrompt
OpenDelta
Pre-trained Language Models are Infrastructure in NLP. There are Plenty of NLP tasks. How to adapt PLMs to them?

Fine Tuning

Example: BERT
- Token representations for sequence tagging
- [CLS] for text classification
- Feed appropriate representations to output layers
Example: Relation Extraction
- Extract the relation between two marked entities
Example: GPT
- Feed the last hidden state to a linear output layer
Example: T5
- Encoder-decoder with 11 billion parameters
- Cast tasks to seq2seq manner with simple demonstrations
- A decoder is trained to output the desired tokens

不同的下游任务的分类器之类的东西 -> seq2seq的任务 + 一点合适的demonstration, labels

When it Comes to GPT-3
- Huge model with 175 billion parameters
- No parameters are updated at all
- Descriptions (Prompts) + Few-shot examples to generate tokens

不去微调模型了，第一次提出了prompt的概念。in-context learning，few show/zero shot，通过prompt去让大模型微调 or 学习。

An Irreversible Trend: Model Scaling, Larger PLMs Tend to Lead Better Performance.
- Better natural language understanding capability
- Better quality for natural language generation
- Better capacity to continually learn novel knowledge
An Irreversible Trend: Difficult Tuning. How to Adapt Large-scale PLMs?
- A Predominant Way — Fine-tuning
- Prohibitive Computing: update all the parameters;
- Prohibitive Storage: retaining separate instances for different tasks;
- Poor generalization with supervision is insufficient
- Results in scarce use for large-scale PLMs in research
Advanced Model Adaptation, Effective Model Adaptation.
- Task&Data-wise: Use prompt-learning to enhance the few-shot learning capability by bridging the gap between model tuning and pre-training.
- Optimization-wise: Use delta tuning to stimulate models with billions of parameters with optimization of a small portion of parameters.

Prompt-learning

Fine-turing
- Use PLMs as base encoders
- Add additional neural layers for specific tasks
- Tune all the parameters
- There is a GAP between pre-training and fine-tuning
Prompt-learning
- Use PLMs as base encoders
- Add additional context (template) with a [MASK] posistion
- Project labels to label words (verbalizer)
- Bridge the GAP between pre-training and fine-tuning

Fill the gap between Fine-turing and Prompt-learning, use prompt fill the GAP.

Sentiment Classification
- Prompting with a Template
  - Input: x = “I love this movie”
  - Template: [x] Overall, it was a [z] movie
  - Prompting: x’ = “I love this movie. Overall it was a [z] movie.”
- Predict an answer
  - Predicting: x’ = “I love this movie.Overall it
    was a fantastic movie.”
- Map the answer to a class label with a Verbalizer
  - Mapping:fantastic =Positive
Prompt-learning: Considerations
- Pre-trained Model
  - Auto-regressive(GPT-1,GPT-2,GPT-3;OPT…)
  - Masked Language Modeling(BERT,RoBERTa,DeBERTa)
  - Encoder-Decoder (T5,BART)
- Template
  - Manually Design
  - Auto Generation
  - Textual or Continuous…
- Verbalizer
  - Manually Design
  - Expanding by external knowledge…

PTM Selection

Auto-regressive (GPT-1, GPT-2, GPT-3; OPT…) -> Encoder
- Suitable for super-large pre-trained models
- Autoregressive Prompt

擅长生成

Masked Language Modeling (BERT, RoBERTa, DeBERTa) -> Decoder
- Suitable for natural language understanding(NLU)
- Cloze-style Prompt

擅长NLU

Encoder-Decoder (T5, BART) -> Encoder + Decoder
- Bidirectional attention for encoder
- Autoregressive for decoder

通用，两种都可以。

Template

Template Construction
- Manually Design based on the characteristics of the task
- Auto Generation with search or optimization
- Textual or Continuous
- Structured, incorporating with rules
Template: Extract World Knowledge
- Copy the entity in the Template
- Predict fine-grained entity types
- Extract world knowledge
Template: Incorporating Rules and Logic
- Prompt-learning with logic-enhanced templates
Structured Template
- Key-value Pairs for all the prompts
- Organize different tasks to a structured format
Ensembling Templates
- Use multiple different prompts for an input instance
- Alleviate the cost of prompt engineering
- Stabilize performance on tasks
Methods
- Uniform Averaging
- Weighted Averaging
Template: Automatic Search
- Gradient-based search of prompts based on existing words
- Use a encoder-decoder model to generate prompts

本质是：Prompt -> Tokens，那很多人类不理解的，复杂的Prompt，可能对人没有含义，但是可能比人定义的Prompt更能work.

也许可以训练一个模型，去训练，更好的Prompt，template比人更好？又或者生成的人能理解，并且效果好！

Optimization of Continuous Prompts
- Generative models for NLU by optimizing continuous prompts
- P-tuning v1: prompts to the input layer (with Reparameterization)
- P-tuning v2: prompts to every layer (like prefix-tuning)
Performance of Prompt-learning
- Exdraordinary few-shot learning performance
- Huge impact from the templates

Verbalizer

Verbalizer
- Mapping: Answer -> Unfixed Labels
- Tokens: One or more tokens in the pre-trained language model vocabulary
- Chunks: Chunks of words made up of more than one tokens
- Sentence: Sentences in arbitrary length
Construction
- Hand-crafted
- Auto-generation
Verbalizer Construction
- Manually design with human prior knowledge
- Start with an initial label word, paraphrase & expand
- Start with an initial label word, use external knowledge & expand
- Decompose the label with multiple tokens
- Virtual token and optimize the label embedding
Knowledgeable Prompting
- Label -> Words
- Use External Knowledge to expand the label words
Virtual Tokens as Label Words
- Project the hidden states of [MASK] tokens to the embedding space and learn prototypes
- The learned prototypes constitute the verbalizer and map the PLM outputs to corresponding labels.

Learning Strategy

The Evolvement
- Traditional: Learning from scratch;
- After BERT: Pre-training-then-fine-tuning;
- T5: Pre-training-then-fine-tuning with text-to-text format;
- GPT: Pre-training, then use prompt & in-context for zero- and few- shot;
Prompt-learning Introduces New Learning Strategies
- Pre-training, prompting, optimizing all the parameters (middle-size models, few-shot setting)
- Pre-training, adding soft prompts, freezing the model and optimizing the prompt embeddings (delta tuning perspective)
- Pre-training with prompted data, zero-shot inference (Instruction tuning& T0)
Prompt-Tuning
- Injecting soft prompts (embeddings) to the input layer
- Extraordinary power of scale
- Comparable results to fine-tuning conditioned on 11B PLM
- Essentially a parameter efficient (delta tuning) method

Delta-tuning: 不符合fine-tuning本身的intuition, 小参数驱动大模型。

Prompting Prompt Tuning
- Injecting Prompts to Pre-training.
- Full data:fine-tuning and prompt-tuning are comparable.
- Few data:only tuning prompts have poor performance.
- The vanilla prompt tuning cannot generalize effectively in low-data situation.
- Injecting soft prompts to pre-training improve the generalization of prompt tuning
Fine-tuning with Prompted Data
- Multi-task Pre-training with Hand-crafted Prompts
  - Finetuning a 130B PLM with prompts on 60 tasks
  - Substantially improve the zero-shot capability
- Multi-task Pre-training with Hand-crafted Prompts
  - Use manually written prompts to train encoder-decoder model
- Multi-task Pre-training with Hand-crafted Prompts
  - Use manually written prompts to train encoder-decoder model
  - Zero-shot generalization on unseen tasks

Applications

Biomedical Prompt-learning: Prompt-learning can support Clinical Decision
- Big models in general domain (like GPT-3) can’t perform well on specific domain like biomedical
- Prompt-learning shows significantly effectiveness
Cross-Modality Prompt-learning: Cross-Modal Prompt-learning
- Create colorful frames in images
- Add color-wise textual prompts to input data
Summary: Prompt-learning
- A comprehensive framework that considers PLMs, downstream tasks, and human prior knowledge
- The design of Template & Verbalizer is crucial
- Prompt-learning has promising performance in low-data regime, and high variance with the select of templates
- Prompt-learning has broad applications

Delta-Tuning

How to Adapt Large-scale PLMs?
- An Efficient Way — Delta Tuning
- Only updating a small amount of parameters of PLMs
- Keeping the parameters of the PLM fixed
How to Adapt Large-scale PLMs?
- An Efficient Way — Delta Tuning
- Only updating a small amount of parameters of PLMs
- Keeping the parameters of the PLM fixed
Why Parameter Efficient Work?
- In the Past Era
  - Parameter efficient learning can’t be realized in the past
  - Because all the parameters are randomly initialized
- With Pre-training
  - Pre-training can learn Universal Knowledge
  - Adaptation of downstream
  - Imposing universal knowledge to specific tasks
Delta Tuning: Parameter Efficient Model Tuning
- Addition-based methods introduce extra trainable neural modules or parameters that do not exist in the original model;
- Specification-based methods specify certain parameters in the original model or process become trainable, while others frozen;
- Reparameterization-based methods reparameterize existing parameters to a parameter-efficient form by transformation.

Addition-based

Adapter
- Adapter-Tuning
  - Injecting small neural modules (adapters) into Transformer Layer
  - Only fine-tuning adapters and keeping other parameters frozen
  - Adapters are down-projection and up-projection
  - Tunable parameters: 0.5%~8% of the whole model
- Move the Adapter Out of the Backbone
  - Bridge a ladder outside the backbone model
  - Save computation of backpropagation
  - Save memory by shrinking the hidden size
Prefix-Tuning
- Inject prefixes (soft prompts) to each layer of the Transformer
- Only optimizing the prefixes of the model
Prompt-Tuning
- Injecting soft prompts (embeddings) only to the input layer
- Extraordinary power of scale
- Comparable results to fine-tuning conditioned on 11B PLM

Specification-based

BitFit
- A simple strategy: only updating the bias terms
- Comparable performance of full fine tuning

Reparameterization-based

Intrinsic Prompt Tuning
- 假设：优化过程本质上可以在一个低纬的空间中完成。
- The Model tuning is mapped into a low-dimensional subspace
- 89% of the full-parameter fine-tuning performance could be achieved in as low tasks as 5-dimensional a subspace in 120 NLP tasks
Manipulate NLP in Low-dimension Space
- 本质是一个“低秩”的，做矩阵分解，例如1000 x 1000分解为1000 x 2和2 x 1000。
- LoRA: Low-Rank Adaptation
- Freeze the model weights
- Injects trainable rank-decomposition matrices to each Transformer layer
- LoRA tunes 4.7 million paramters of the 175 billion parameters of the GPT-3 model

Connections

The Reparameterization-based Methods Are Connected
- Based on similar hypothesis
- The optimization process could be transformed to a parameter efficient version
A Unified View
- Adapter, Prefix Tuning and LoRA could be connected
- Function form
- Insertion form
- Modified Representation
- Composition Function
- Adapter, Prefix Tuning, and LoRA could be connected in form
- New variants could be derived under this framework

大一统 -> 推导更多更常见的方法

Deep Analysis of Delta Tuning
- Theoretical Analysis
  - From optimization
  - Low-dimensional representation in solution space
  - Low dimensional representation in functional space
  - From optimal control
  - Seek the optimal controller
- A Rigorous Comparison of Performance
  - Experiments on 100+ NLP tasks
  - There is no way to gain an absolute advantage for delta tuning, fine-tuning is still the best model tuning method;
- Power of Scale: The power of scale is observed in all the methods, even random tuning
- A Rigorous Comparison of Performance
  - Combination of different delta tuning methods
  - Implies the existence of Optimal Structure which is not defined manually
  - Automatically search the structure
  - 1/10000 parameters could work
- Transferability
  - Delta Tuning shows non-trivial task-level transferability
  - Implies the possibility to construct a sharing platform
- Efficient Tuning with low GPU RAM
  - Tune T5-large on 11G single GPU (Nvidia 1080Ti, 2080, etc.)
  - Tune T5-3b on 24G single GPU （Nvidia 3090 and V100)
  - Tune T5-11b on 40G single GPU （Nvidia A100, with BMTrain)
- Summary
  - Delta tuning could effectively work on super-large models -> Optimizing only a small portion of parameters could stimulate big models.
  - The structure may become less important as the model scaling
  - What’s NeXT?

Futhur Reading

Paper List
- PromptPapers: https://github.com/thunlp/PromptPapers
- DeltaPapers: https://github.com/thunlp/DeltaPapers
Programming Toolkit
- OpemPrompt: https://github.com/thunlp/OpenPrompt
- OpenDelta: https://github.com/thunlp/OpenDelta

OpenPrompt

Plz see the Video first
API design
- Modularity
- Flexibility
- Uniformity
How to use OpenPrompt 78 https://github.com/thunlp/OpenPrompt
- Step 1: Define a task
  - Think about what’s your data looks like and what do you want from the data!
- Step 2: Obtain a PLM
  - Choose a PLM to support your task;
  - Different models have different attributes;
  - Essentially obtain a modeling strategy with pre-trained tasks;
  - Support , more coming…
- Step 3: Define a Template: A Template is a modifier of the original input text, which is also one of the most important modules in prompt-learning.
- Step 4: Define a Verbalizer (optional): A Verbalizer projects the original labels to a set of label words.
- Step 5: Define a PromptModel
  - A PromptModel is responsible for training and inference
  - It defines the (complex) interactions of mentioned modules
- Step 6: Train and Inference
  - Train and evaluate the PromptModel in PyTorch fashion
Mixed Template
- Basic hard and soft template
- Incorporation of meta information
- Soft template initialized with textual tokens
- Post-processing
- Fast token duplication
Generation Verbalizer
- Label words defined as part of input -> Similar fashion with mixed template
- Especially powerful in transforming ALL NLP tasks to generation tasks
Newly Designed Template Language - Mixed Template -> Write Template in a flexible way
Implement All Kinds of Prompt-Learning Pipelines
- Modify separate modules and create new methods
- Apply existing methods to other scenarios
1.7k stars for our Github repository
- https://github.com/thunlp/OpenPrompt
Along with 2.0k stars for referenced paper list
- https://github.com/thunlp/PromptPapers

OpenDelta

Plz see the Video first
OpenDelta: Toolkit for Delta Tuning
- Clean: No need to edit the backbone PTM’s codes.
- Simple: Migrating from full-model tuning to delta-tuning needs as little as 3 lines of code.
- Sustainable: Evolution in external libraries doesn’t require update.
- Extendable: Various PTMs can share the same delta-tuning codes.
- Flexible: Able to apply delta-tuning to (almost) any position.
Apply OpenDelta to Various Models
- Supported models
Adapter Hub
- Need to modify the backbone code.
- Need reimplementation for EVERY PTM.
- Codes frozen at transformers version 4.12
- Need constant update to suit Huggingface’s update (to suit new feature)
- Can only apply Adapter under existing mode (e.g. not supporting adding adapters to a fraction of layers or other places in the model)
How do we achieve it?
- Key based addressing: Find the module according to the module/parameter key.
- Three modification operations can cover most delta tuning:
  - Replace, Insert after, Insert before.
- The modified model will have the same doc & I/O & address & Signature etc. to the original model.
- Create pseudo data to automatically determine the parameter size of delta models.
How do we achieve it?
- Alternating the flow of tensor.
- Use a wrapper function to wrap the original forward function to let the tensor pass the delta models as well.
More than aggregating delta models …
1. Visualize the parameters’ location in the PTM.
2. Insert delta modules in arbitrary layers.
3. Delta center to save fine-tuned delta models
AutoDelta Feature
- Automatically load and define delta moduels from configuration
- Automatically load and define delta moduels from pre-trained
Multitask Serving

Collaboration

Collaboration of OpenDelta & OpenPrompt
- OpenDelta is a toolkit for Delta Tuning
- Collaborated with OpenDelta, there is a loop to efficiently stimulate LMs

Demos also exist on Github

L5 BMSystem

BMTrain

CPU vs. GPU
- CPU: small number of large cores.
- GPU: large number of small cores.

nvidia intro

GPU Memory component
1. Parameter
2. Gradient
3. Intermediate
  - The input of each Linear Module needs to be saved for backward
  - Each with Shape [Batch, SeqLen, Dim]
4. Optimizer:
  - Commonly used Adam Optimizer needs to store extra states.
  - The number of states is greater than 2 times the number of parameters.

Data Parallel

There is a parameter server.
Forward:
- The parameter is replicated on each device.
- Each replica handles a portion of the input.
Backward
- Gradients from each replica are averaged.
- Averaged gradients are used to update the parameter server.
Collective Communication.
- Broadcast: Send data from one GPU to other GPUs
- Reduce: Reduce (Sum/Average) data of all GPUs, send to one GPU.
- All Reduce: Reduce (Sum/Average) data of all GPUs, send to all GPUs.
- Reduce Scatter: Reduce (Sum/Average) data of all GPUs, send portions to all GPUs.
- All Gather: Gather data of all GPUs, send all GPUs.
Methods;
1. Data Parallel
2. Model Parallel
3. ZeRO
4. Pipeline Parallel

Data Parallel

There is a parameter server.
Forward:
- The parameter is replicated on each device.
- Each replica handles a portion of the input.
Backward:
- Gradients from each replica are averaged.
- Averaged gradients are used to update the parameter server.

Distributed Data Parallel
- There is no parameter server.
- Forward:
  - Each replica handles a portion of the input.
- Backward:
  - Gradients from each replica are averaged using All Reduce.
  - Each replica owns optimizer and update parameters itself.
  - Since gradients are shared, parameters are synced.
The input of each Linear Module needs to be saved for backward. Each with Shape:
1. Without Data Parallel [Batch, Len, Dim]
2. With Data Parallel -> [Batch/n, Len, Dim]

Batch/n >= 1

Model Parallel

Partition the matrix parameter into sub-matrices.
Sub-matrices are separated into different GPUs.
Each GPU handle the sample input.

Intermediates are not partitioned.

ZeRO

Zero Redundancy Optimizer

ZeRO-Stage 1:
- Each replica handles a portion of the input.
- Forward
- Backward
- Average all gradients using Reduce Scatter
- Each replica owns part of optimizer & update part of params
- Updated parameter are synced using All Gather
ZeRO-Stage 2:
- Each replica handles a portion of the input.
- Forward.
- Backward (Average gradients using Reduce Scatter).
- Each replica owns part of optimizer & update part of params.
- Updated parameter are synced using All Gather.
ZeRO-Stage 3
- Each replica handles a portion of the input.
- Forward (Share parameters using All Gather).
- Backward (Average gradients using Reduce Scatter).
- Each replica owns part of optimizer & update part of params.

Pipeline Parallel

Transformer are partitioned layer by layer.
Different layers are put on different GPUs.
Forward : Layer i -> Layer i+1
Backward: Layer i -> Layer i-1

Techniques

Mixed precision
Offloading
Overlapping
Checkpointing

Mixed Precision

FP32: 1.18e-38~3.40e38 with 6–9 significant decimal digits precision. FP16: 6.10e−5 ~65504 with 4 significant decimal digits precision. -

Advantages:
- Math operations run much faster.
- Math operations run even more faster with Tensor Core support.
- Data transfer operations require less memory bandwidth.
- Smaller range but not overflow.
Disadvantages:
- Weight update ≈ gradient x lr Smaller range, especially underflow
Keep a master FP32 parameters in optimizer.

训练的时候多存一个FP32，并进行训练累积。可以累积到一定的数量之后，再作用于FP16。也可以在后续推理的时候，就用FP16推理，这样速度快一些昂！！！

Offloading

Bind each GPU with multiple CPUs.
Offload the partitioned optimizer states to CPU.
1. Send Gradients from GPU to CPU.
2. Update optimizer states on CPU ( using OpenMP + SIMD ).
3. Send back updated parameters from CPU to GPU.

Overlapping

Memory operations are asynchronous.
Thus, we can overlap Memory operations with Calculations.

Checkpointing

Forward:
- Some hidden states (checkpoint) are reserved.
- All other intermediate results are immediately freed.
Backward:
- Freed intermediates are recomputed.
- And released again after obtaining gradient states.

Performance

Speedup, Simple replacement,
Demo

BMCook

The model size of PLMs has been growing at a rate of about 10x per year
Huge Computational Cost: The growing size comes with huge computational overhead
- Limits the application of large PLMs in real-world scenarios
- Leads to large carbon emissions
Towards Efficient PLMs
- Model Compression: Compress big models to small ones to meet the demand of real-world scenarios
- Existing Methods
  - Knowledge Distillation
  - Model Quantization
  - Model Pruning
Knowledge Distillation
- Proposed by Hinton on NIPS 2014 Deep Learning Workshop
- Problem of Ensemble Model
  - Cumbersome and may be too computationally expensive
  - Similar to current PLMs
- Solution
  - The knowledge acquired by a large ensemble of models can be transferred to a single small model
  - We call “distillation” to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment.
- What is knowledge: In a more abstract view, knowledge is a learned mapping from input vectors to output vectors.

Knowledge Distillation

Proposed by Hinton on NIPS 2014 Deep Learning Workshop
Problem of Ensemble Model
- Cumbersome and may be too computationally expensive
- Similar to current PLMs
Solution
- The knowledge acquired by a large ensemble of models can be transferred to a single small model
- We call “distillation” to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment.
What is knowledge -> In a more abstract view, knowledge is a learned mapping from input vectors to output vectors.
Soft targets provide more information than gold labels.
Key research question: how to build more soft targets -> Previous methods only use the output from the last layer
Learn from multiple intermediate layers of the teacher model
Mean-square loss between the normalized hidden states
Learn from multiple intermediate layers
Learn from the embedding layer and output layer
Learn from attention matrices

Model Pruning

模型剪枝

Remove the redundant parts of the parameter matrix according to their important scores
Unstructured pruning and structured pruning
Weight pruning (unstructured)
- 30-40% of the weights can be discarded without affecting BERT’s universality (prune pre-train)
- Fine-tuning on downstream tasks does not change the nature (prune downstream)
Attention head pruning (structured)
- Ablating one head
- Define the importance scores of attention heads
- Iteratively prune heads on different models(blue line)
Layer pruning (structured)
- Extend dropout from weights to layers
- Training: randomly drop layers
- Test: Select sub-networks with any desired depth

Model Quantization

Reduce the number of bits used to represent a value -> Floating point representation -> Fixed point representation
Three steps: 1. Linear scaling 2. Quantize 3. Scaling back
Models with different precisions -> Extreme quantization (1 bit) is difficult
Loss landscapes are sharper
Train a half-sized ternary model
Initialize a binary model with the ternary model by weight splitting
Fine-tune the binary model

Other Methods

ALBERT: Two parameter reduction techniques
- Decompose the large vocabulary embedding matrix into two small matrices
- Cross-layer parameter sharing

Low-rank Approximation

Low-rank Approximation
Difficult to directly conduct low-rank approximation
View more at: Here

Architecture Search

Is the architecture of Transformer perfect?
Neural architecture search based on Transformer
- Pre-define several simple modules
- Training several hours with each architecture
Two effective modifications
- Multi-DConv-Head Attention(MDHA)
- Squared ReLU in Feed Forward Block
Primer learns faster and better

Summary

Large-scale PLMs are extremely over-parameterized
Several methods to improve model efficiency
- Knowledge Distillation
- Model Pruning
- Model Quantization
- …
Our model compression toolkit: BMCook -> Includes these methods for extreme acceleration of big models

Usage Intro

Github link is Here

Compared to existing compression toolkits, BMCook supports all mainstream acceleration methods for PLMs
Implement different compression methods with just a few lines of codes
Compression methods can be combined in any way towards extreme acceleration
Core of BMCook: Compression Configuration File
Implement various methods with few lines. The GitHub demos have multiple demos about how to use BMCook to supports all mainstream acceleration methods for PLMs.

BMInf

BMInf is the first toolkit released by OpenBMB.
Github repo: https://github.com/OpenBMB/BMInf
BMInf has received 270 stars (hope more after this course XD).
In June 2021, we released CPM-2 with 10 billion parameters.
It is powerful in many downstream tasks.

Background

high hardware requirements
- For each demo we used 4xA100s for inference.
inefficient
- Each request takes about 10 seconds to handle.
costly
- The cost of 4xA100s is ¥1200 per day.
Another thought: serve demo on our server -> make it possible for everyone to run big models on their own computers.

Difficulties

How difficult is it?
- High Memory Footprint
  - The checkpoint size of CPM-2 model is 22GB.
  - It takes about 2 minutes to load the model from disk.
- High Computing Power
  - Generating 1 token with A100 takes 0.5 seconds.

Linear Layer

The linear layer is actually matrix multiplication.
using lower precision for speedup. -> FP64 -> FP32 -> FP16 -> FP8? INT8
INT 8
- samller range
- precise value

Quantization

Using integers to simulate floating-point matrix multiplication
- find the largest value in the matrix
- scale to 127 for quantification
- multiply scaling factor for dequantization
Matrix multiplication after quantization
Row-wise matrix quantization:
- calculate the scaling factor for each row/column
- scale each row/column to -127~127
We quantized the linear layer parameters of CPM-2
- model size is reduced by half
- 22GB -> 11 GB
- still too large for GTX 1060 ( 6GB memory )

Memory Scheduling

虚拟内存的想法，只将当前用到的参数，加在到CPU/GPU上。

Not all parameters need to be placed on GPU.
- Move parameters that won’t be used in a short time to CPU.
- Load parameters from CPU before use.
- Calculation and loading are performed in parallel.
Implemented in CUDA 6: Unified Memory
We only need to store two layers of parameters in the GPU.
- one for calculating
- the other for loading

It’s about 500MB for CPM-2 .

In fact, it is much slower to load than to calculate.
- It takes a long time if we only place two layers on GPU.
- Put as many layers as possible on the GPU.
Assuming that up to n layers can be placed on the GPU.
- n - 2 layers are fixed on GPU that will not be moved to the CPU.
- 2 layers are used for scheduling.

Which layers are fixed on GPU?

Consider two layers need to be placed on the CPU.
- A larger interval is always better than smaller one.
- Maximize the interval between two layers.

Usage

BMInf runs up CPM-2 on GTX 1060. It also achieves good performance on better GPUs.
Installation: pip install bminf
Hardware Requirements: GTX 1060 or later
OS: both Windows and Linux

L6 BM Application in NLP

Big-model-based Text Understanding and Generation

Introduction
- Typical NLP applications: understanding and generation
- Big models bring revolutions
- NLP Key applications:
  - NLU(Natural Language Understanding): Information Retrieval
  - NLG(Natural Language Generation): Text Generation
  - NLU + NLG: Question Answering
Information retrieval
- Find relevant documents given queries.
- Big models can provide more intelligent and accurate search results.
- PLM-based methods ranked high
Question answering
- Big models can answer more complex questions
Text generation
- Machine translation; poetry generation; dialogue systems…
- Big models can generate more fluent and natural texts

Information Retrieval(IR)

Background
- Information explosion:
  - Amount: 40ZB, 50% annual growth rate
  - Variety: Update period in minutes
- Rising demand for automatic information retrieval
  - 4.39 billion information users
  - Annual growth rate of 6~21%
- Requirement: Query -> A sea of information -> A few relevant information
- Application
  - Typical application: Search Engine. Public opinion analysis / Fact verification, QA system, Retrieval-Augment Text Generation
  - Examples
    - Document Ranking Query
    - Question Answering
Formulation
- How to formulate?
  - Given a query
  - Given a document collection
  - IR system computes the relevance score and ranks all documents based on the scores
  Retrieval -> Re-Ranking
- Evaluation Metrics
  - MRR@k
  - MAP@k
  - NDCG@k
  Only care the k the system retrieves.
- MRR (Mean Reciprocal Rank): MRR is the average of the reciprocal ranks of the first relevant results for a query set.
- MAP (Mean Average Precision): MAP is the mean of the average precision score for a set of queries.
- NDCG (Normalized Discounted Cumulative Gain): divides docs into different levels according to the relevance with the query.
- Discounted Cumulative Gain (DCG): You get five results for a query search and classify them into three grades: Good (3), Fair (2) and Bad (1)
Traditional IR
- BM25 (Best Matching 25)
  - Lexical exact-match model
  - Given a query and a document collection
  - BM25 computes the relevance score
- TF (Term Frequency): The weight of a term that occurs in a document is simply proportional to the term frequency.
- IDF (Inverse Document Frequency): The specificity of a term can be quantified as an inverse function of the number of documents in which term t appears.
- Problems:
  - Vocabulary mismatch: Different vocabulary, same semantics
  - Semantic mismatch: Same vocabulary, different semantics
Neural IR
- Neural IR can mitigate traditional IR problems
- Query + Document -> Neural Network -> Vector Space -> Relevance Score
- Neural IR outperform traditional IR significantly
- Being neural has become a tendency for IR
- Architecture
  - Re-ranking, Cross-encoder, Model finer semantics of qry and doc; Superior performance; Higher computational cost
  - Retrieval, Dual-encoder, Independent representations for qry/doc; Reduce computation cost
- Cross-Encoder
  - Given a query q and a document d
  - They are encoded to the token-level representations H
  - Get the ranking score
  - Training: Training data + Training loss
- Dual-Encoder
  - DPR: embed query and documents using dual encoders
  - Negative log likelihood (NLL) training loss
  - Offline computation of doc representations
  - Nearest neighbor search supported by FAISS: Batching & GPU can greatly improve retrieval speed (~1ms per q for 10M documents, KNN)
  - Retrieval Performance
    - More training examples (from 1k to 59k) further improves the retrieval accuracy consistently
    - Bigger model size, better retrieval performance
Advanced Topics
- How to mine negative?
  - In-batch negative
  - Random negative
  - BM25 negative
  - Self-retrieved hard negative (ICLR 2021)
- Negative-enhanced Fine-tuning
  - ANCE (Approximate nearest neighbor Negative Contrastive Learning) -> Asynchronous Index Refresh: document index goes stale after every gradient update → Refresh the index every k steps
  - ANCE (Approximate nearest neighbor Negative Contrastive Learning) -> Performance Beat other dense retrieval
  - RocketQA (NAACL 2021) -> Uses cross-encoder to filter hard negatives. Performance beats ANCE.
- IR-oriented Pretraining
  - SEED-Encoder (EMNLP 2021)
    - pre-trains the autoencoder using a weak decoder to push the encoder to provide better text representations.
    - The encoder and decoder are connected only via [CLS]. The decoder is restricted in both param size and attention span.
    - beats standard pretrained models.
  - ICT (Inverse Cloze Task)
    - Given a passage consisting of n sentences
    - The query is a sentence randomly drawn from the passage, and the document is the rest of sentences
    - ICT pre-training improves retrieval performance
- Few-Shot IR
  - Many real-world scenarios are “few-shot” where large supervision is hard to obtain
  - Weak supervision generation
  - Weak supervision selection
    - Reinforcement data selection (ReinfoSelect) -> Learn to select training pairs that best weakly supervise the neural ranker
    - Meta-learning data selection (MetaAdaptRank) -> Learn to reweight training pairs that best weakly supervise the neural ranker
    - MetaAdaptRank beats ReinfoSelect
    - Generalizable T5-based dense Retrievers (GTR)
- Conversational IR -> Models multiple rounds of query
- How to use big model to retrieve long documents? -> Long-range dependency
Demo: Vedio, Load Document Representations -> Load Query Representations -> Batch Search -> Visualize retrieved results.

Question Answering(QA)

Background
- Why do we need question answering (QA) ?
  - When we search for something in Google, it’s usually hard to find answers from the document list
  - With QA systems, answers are automatically found from large amount of data
- Better search experience
- Applications of QA
  - IBM Watson: 2011 Winner in Jeopardy
  - Defeat two human players (Ken and Brad)
  - Intelligent assistants
- History
  - Template-based QA Expert System
  - IR-based QA
  - Community QA
  - Machine Reading Comprehension KBQA
- Types of QA
  - Machine Reading Comprehension: Read specific documents and answer questions
  - Open-domain QA: Search and read relevant documents to answer questions
  - Knowledge-based QA: Answer questions based on knowledge graph
  - Conversational QA and dialog: Answer questions according to dialog history
  - …

Reading Comprehension(RC)

Reading Comprehension
- Task Definition and Dataset
  - Definition of RC
    - Documents, Questions, Candidate answers
  - Types of RC
    - Cloze test: CNN/Daily Mail (93k CNN articles, 220k Daily Mail articles)
    - Cloze test: CBT (Children’s Book Test), Context: 20 continuous sentences, Question: the 21st sentence, with an entity masked, Answer: the masked entity, 10 candidates
    - Multiple choice -> RACE: 100k multiple choice questions collected from English exams in China.
    - Extractive RC: Predict a span in documents -> SQuAD: 10k human-annotated questions and 536 articles from Wikipedia. Every answer is a span in the article
- Datasets
Traditional Pipeline
- Model Framework -> General framework in RC: embed, encode, interact, and predict
- Bilinear, Pointer Network. Attention: d2q, q2d. LSTM, GRU, Attention. GloVe, ELMo, Char Embedding.
- An Example of RC Model: BiDAF. Four layers
  - Prediction Layer
  - Attention Based Interaction Layer
  - Context-aware Encoding Layer
  - Word Embedding Layer
Big-model-based Methods
- Use PLMs (like BERT) to replace the first three layers -> BERT model has no RNN modules
- Model chang: Pre-trained Representation Model -> Prediction Layer
- Using BERT for RC:
  - Feed the concatenation of the question and the context to BERT. Get the question-aware context representation to predict the start/end of answers.
  - Excellent performance on SQuAD
- UnifiedQA, Unifying different QA formats
  - Four types: extractive, abstractive, multiple-choice, yes/no
  - Text-to-text format
  - Single QA system is on-par with, and often out-performs dedicated models
  - Using prompt, we can do it easily!

Open-domain QA

Task Definition
- RC assumes that any question has a short piece of relevant text, which is not always true
  - In open-domain QA, the model should be able to find relevant texts from a corpus and read them
    - Wikipedia can be viewed as a large-scale corpus for factoid question
- Goal: build an end-to-end QA system that can use full Wikipedia to answer any factoid question
Generation-based Methods
- Answer Questions with Big Models:
  - GPT-3, T5, etc. can generate answers directly
  - Fine-tune T5 on open-domain QA
  - Achieve competitive performance
  - Bigger models perform better
  - “Power of scale”
Retrieval-based Methods
- Document Retriever + Document Reader
  - Document retriever: finding relevant articles from 5 million Wikipedia articles
  - Document reader (reading comprehension system): identifying the answer spans from those articles
- Document Retriever
  - Return 5 Wikipedia articles given any question
  - Features:
    - TF-IDF bag-of-words vectors
    - Efficient bigram hashing (Weinberger et al., 2009)
  - Better performance than Wikipedia search: (hit@5)
- Document Reader
  - Simple reading comprehension model
  - Features:
    - Word embeddings
    - Exact match features: whether the word appears in question
    - Token features: POS, NER, term frequency
    - Aligned question embedding
  - Using Shared-Norm for multiple documents
- Distance Supervision: For a given question, automatically associate paragraphs including the answer span to this question.
- Results
  - Reasonable performance across all four datasets
  - Models using DS outperform models trained on SQuAD -> Multi-task: Training on SQuAD + DS data
- Retrieval-Augmented Language Model PreTraining, REALM:
  - Augment language pre-training with a neural knowledge retriever that retrieves knowledge from a textual knowledge corpus (e.g., Wikipedia)
  - Allow the model to attend documents from a large corpus during pre-training, fine-tuning and inference
  - Pre-training of REALM: The knowledge retriever and knowledge-augmented encoder are jointly pre-trained on the unsupervised language modeling task
  - Fine-tuning of REALM: The pre-trained retriever (θ) and encoder (φ) are fine-tuned on a task of primary interest, in a supervised way
  - Excellent performance for open-domain QA
- Document Retrieval and Synthesis with GPT3
  - WebGPT
    - Outsource document retrieval to the Microsoft Bing Web Search API
    - Utilize unsupervised pre-training to achieve high-quality document synthesis by fine-tuning GPT-3
    - Create a text-based web-browsing environment that both humans and language models can interact with
  - Pipeline:
    - Fine-tune GPT-3 to imitate human behaviors when using the web-browser
    - Write down key references when browsing
    - After browsing, generate answers with references
  - WebGPT-produced answers are more preferred than human-generated ones
  - Better coherence and factual accuracy
- Demo
  - QA with T5 using OpenPrompt: Zero-shot inference. vedio is here.
  - QA with T5 using OpenPrompt and OpenDelta: Delta tuning.

Text Generation(TG)

TG

Introduction to text generation
- Formal Definition: Produce understandable texts in human languages from some underlying non-linguistic representation of information. [Reiter et al., 1997]
- Text-to-text generation and data-to-text generation are both instances of TG [Reiter et al., 1997]
- Applications under umbrella of text generation
Tasks of text generation: Data-To-Text (image, table, graph), Dialogue, Machine Translation, Poetry Generation, Style Transfer, Storytelling, Summarization
- Data-to-Text -> Various of data forms: image; table; graph……
- Dialogue -> Generate conversations that meet the purpose in response to specific user input
- Machine Translation -> Translate natural language sentences into a target language
- Poetry Generation -> Generate texts that meet the rhythmic requirements of the poem, based on keywords, or emotional control, etc
- Style Transfer -> Control the style of input text while preserve the the meaning
- Storytelling -> Generate a story that meets the attribute requirements based on the given keywords, story line, etc.
- Summarization -> Summarize the input text with selected part of input text (extractive) or with generated text (abstractive)
Neural text generation
- Language Modeling
  - Predict next word given the words so far
  - A system that produces this probability distribution is called a Language Model
  - We use language models every day, such as …
- Conditional Language Modeling
  - The task of predicting the next word, given the words so far, and also some other input
  - x input/source
  - y output/target sequence
- Seq2seq(Encoder -> Decoder)
  - Seq2seq is an example of conditional language model
  - Encoder produces a representation of the source sentence
  - Decoder is a language model that generates target sentence conditioned on encoding
  - seq2seq can be easily modeled using a single neural network and trained in an end-to-end fashion
  - seq2seq training by teacher forcing
  - Training: predict next word based on previous ground-truth tokens, instead of predicted tokens
  - Testing: predict next word based on previous predicted tokens
  - Exposure Bias: The gap between training & testing distribution
- Text-to-Text-Transfer-Transformer (T5):
  - A Shared Text-To-Text Framework: reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings
  - Training objective -> Colossal Clean Crawled Corpus (C4) dataset, a cleaned version of Common Crawl (deduplication, discarding incomplete sentences, and removing offensive or noisy content), Unlabeled data.
- Autoregressive Generation: Generate future values from past values.
- Generative Pre-Trained Transformer (GPT)
  - GPT-1: Improving language understanding by generative pretraining
  - GPT-2: Language models are unsupervised multitask learners
  - GPT-3: Language models are few shot learners
- GPT-2
  - GPT-2: Language models are unsupervised multitask learners
  - Train the language model with unlabeled data, then fine-tune the model with labeled data according to corresponding tasks
- Non-Autoregressive Generation: Given a source, Generate in parallel.
Decoding
- Greedy decoding
- Beam search
- Sampling methods
  - Pure sampling
  - Top-n sampling
  - Nucleus sampling
- Greedy Decoding: Generate the target sentence by taking argmax on each step of the decoder.
- Beam Search Decoding：
  - Find a high-probability sequence
  - Beam search
    - On each step of decoder, keep track of the k most probable partial sequences
    - After you reach some stopping criterion, choose the sequence with the highest probability
    - Not necessarily the optimal sequence
  - What’s the effect of changing beam size k
    - Small k has similar problems to greedy decoding
      - Ungrammatical, unnatural, nonsensical, incorrect
    - Larger k means you consider more hypotheses
      - Reduces some of the problems above
      - More computationally expensive
    - But increasing k can introduce other problems
      - For neural machine translation (NMT): Increasing k too much decreases BLEU score (Tu et al., Koehn et al.)
      - chit-chat dialogue: Large k can make output more generic
- Sampling-based Decoding:
  - Pure sampling: On each step t, randomly sample from the probability distribution $P_{t}$ to obtain your next word
  - Top-n sampling:
    - On each step t, randomly sample from $P_{t}$, restricted to just the top-n most probable words
    - $n = 1$ is greedy search, $n = V$ is pure sampling
  - Nucleus sampling (Top-p sampling)
    - On each step t, randomly sample from $P_{t}$, restricted to the top words that cover probability ≥ $p$
    - $p = 1$ is pure sampling
  - Sample with temperature: Before applying the final softmax, its inputs are divided by the temperature τ
  - Increase n/p/temperature to get more diverse/risky output
  - Decrease n/p/temperature to get more generic/safe output
  - Both of these are more efficient than Beam search
In summary
- Greedy decoding
  - A simple method
  - Gives low quality output
- Beam search
  - Delivers better quality than greedy
  - If beam size is too high, it will return unsuitable output (e.g. Generic, short)
- Sampling methods
  - Get more diversity and randomness
  - Good for open-ended/creative generation (poetry, stories)
  - Top-n/p/temperature sampling allows you to control diversity

Controllable Text Generation

Control text generation: avoid repeating, more diverse, …
Prompt methods
- Horror xxx
- Reviews xxx
- 加个prefix，去训练这个prefix。P-tuning, Prefix + LM，训练prefix。
Modifying probability distribution：贴近天使模型，远离魔鬼模型，进行概率控制。
Reconstructing model architecture：
- 修改模型结构，添加部分transformer结构，专门用于编码控制信号/关系。在对source的文本进行cross-attention之前，首先和guidance signal进行cross-attention，线对于控制信号进行感知。
- Specialized encoder for guidance signal
- Decoder: self-attention -> (+guidance signal)cross-attention -> (+source document)cross-attention -> FFN

Text generation evaluation

Common metrics
- BLEU (Bilingual evaluation understudy)
  - easy to compute
  - doesn’t consider semantics & sentence structure
- PPL (perplexity)
  - Evaluate how well a probability model predicts a sample.
Overlap-based Metric
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): solve the problem of missed flipping (low recall rate)
- NIST: consider the amount of n-gram information
- METEOR: based on the harmonic mean of precision and recall
Distance-based Metrics
- Edit Dist(cosine similarity);SMD(embedding distance);
  YISI (weighted similarity)
Diversity Metrics
- Distinct (n-gram diversity);Entropy;KL_divergence
Task-oriented Metrics
- SPICE(Semantic propositional image caption evaluation)
Human Evaluation
- Intrinsic (fluency,internal relevance,correctness)
- Extrinsic(performance on the downstream subtasks)

TG Tasks: Challenges

Challenges
- Training model strategy
  - Always generate repeated words
  - Exposure bias
- Commonsense
  - Lack of logical consistency
- Controllability
  - Difficult to ensure both language quality and control quality
- Evaluation:Reasonable metrics and datasets
Demo: GPT-2
- Task
  - The WebNLG challenge consists in mapping data to text
  - The training data consists of Data/Text pairs where the data is a set of triples extracted from DBpedia and the text is a verbalization of these triples.
  - Example:
    - a. (John_E_Blaha birthDate 1942_08_26) (John_E_Blaha birthPlace San_Antonio) (John_E_Blaha occupation Fighter_pilot) b. John E Blaha, born in San Antonio on 1942-08-26, worked as a fighter pilot
- Text generated with untuned GPT-2
- Loss
- Text generated with tuned GPT-2

L7 BM x Biomedical

Introduction

Outline
- Brief Introduction of Biomedical NLP
- Biomedical Text Mining: Tasks, PLMs, Knowledge, Application
- Diagnosis Assistance: Text Classification, Conversation
- Substance Representation: DNA, Protein, Chemicals
- Project: BioSeq PLMs and Benchmark
- Biomedical NLP: Future Directions
What does biomedical NLP study?
- Search and read long literature in large number? → Obtain ready-made knowledge directly!
- Line up at the door of consulting room? → Ask automatic diagnosis system for efficiency!
- Predict the properties of some organic substance? → Use AI model to get deeper insights into biomedical substances!
What does biomedical NLP study?
- For knowledge and efficiency: biomedical literature, drug instructions, clinical records, experimental operation guide, …
- For practical applications: diagnosis assistance, meta-analysis, exploration for new drugs, pharmacy, …
- For insights into domain-specific data: molecules, proteins, DNA, …
Biomedical NLP can go far beyond the traditional ‘language’.
What characteristics does biomedical NLP have?
- Mass of raw data / Little golden annotated data
- Unsupervised and Weakly supervised / Supervised
- Resources: PubMed, ChemProt
What characteristics does biomedical NLP have?
- High knowledge threshold
- knowledge-enhanced learning

Text Mining: Tasks

Entities, Entities -> BioNER/BioNEN
- Traditional: Dictionary-based; Semantic; Statistical. DL-based: End2end. https://www.ncbi.nlm.nih.gov/research/pubtator/
- Rule-based; CRF…
Highlighted words are recognized entity mentions.
- Link entities to various KBs.
Literatures, Literatures -> topic recognition/indexing
- Supervised machine learning models;
- Ranking models; Ontology matching models.
- PubMed literature search interface
Relations & Events, Relations & Events -> BioRE/RD, Event Extraction
- Template/rule-based; Statistical
- NLP(parsing)-based; Sequence Labeling
Pathways & Hypothesis, Pathways & Hypothesis -> pathway extraction/literature-based discovery
- Rule-based; ML-based; Hybrid.
- ABC co-occurrence model based
A common pipeline of biomedical text mining
- Named entity recognition (NER) -> Named entity normalization (NEN) -> Relation Extraction (RE)
- Simple but work baselines for NER (include entity typing):CNNs,BiLSTM CRF
- With PLMs as backbone:BERTs CRF, BERTs + Prompt
- Common scenario of NEN:representation “distance”
- Key for NEN: entity disambiguation (context + knowledge in KB)
- SciSpacy: a python package tailored for biomedical semantic analysis, including NER and NEN pipelines
- PubTator: a Web-based system providing automatic NER and NEN annotations (PubMed + PMC)
BERT + BiLSTM + CRF (A common Method for NER)
- BERT + Prompt (Entity Typing)
A common pipeline of biomedical text mining
- Named entity recognition (NER) → Named entity normalization (NEN) → Relation Extraction (RE)
- RE: sentence-level / document-level
- Benchmarks: ChemProt, PPI / BC5CDR, GDA
- Common Methods: BERT-based and graph-based methods
- Relation types: from binary to complex
A simple BERT-based document-level RE model
A GCN-based document-level RE model
Data characteristics of biomedical text mining
- The cost of professional data labeling is extremely high
- Problems concerned with data: small scale and incomplete categories
- ChemProt: chemical – proteins, 1820 / BC5CDR: chemical – diseases, 1500
- Unsupervised: PLMs; Weakly Supervised: distant supervision (denoise)
- An example of labeling PubMed with CTD
- Common labeling strategy: NER + NEN tools + KG; model-based methods
Model-based denoising
Self-Training denoising

Text Mining: PLMs

PLMs have shown their power in various of tasks (the power of unsupervised learning)
Domain-specific PLM:
- domain corpus (Sci-BERT, BioBERT, clinical BERT, …)
- special pretraining task (MC-BERT, KeBioLM, …)

Text Mining: Knowledge

Knowledge Bases (KBs)/Knowledge Graphs (KGs)
An important application of text mining: unstructured -> structured
Famous KBs:MeSH,UMLS,NCBI Gene,UniProt,..
KGs:CTD,DisGeNet,HuRl,..
Challenges:KBs all have their own limitations and are far from unified;KGs are small in scale and incomplete
Conversely, KBs/KGs can also help the model to better handle downstream tasks
Knowledge-Enhanced:
- shallow (entity disambiguation)
- deep (semantic information in intricate KGs)
Methods to integrate knowledge into PLMs: Adapters, Customized pretraining tasks, Prompt Tuning, Delta Tuning, …
Enhanced NER for proteins and genes
SMedBERT: Enhanced PLM

Text Mining: Application

NER and NEN:
- Easy access to knowledge when reading literature
- Bridge the gap between documents and KBs/KGs
- Correspond colloquial expressions (e.g. patient consultation) to standard technical terminology
- triage / QA assistance
Building of KBs/KGs:
- Obtain Knowledge within several clicks
- Is that enough?
- Search for entity “aspirin” in CTD
- Diseases and evidences related to “aspirin”
Relation Extraction:
- Building of knowledge graphs
- Relation-aware literature retrieval
NER + NEN + RE (sometimes Event Extraction, …):
- Clinical analysis: Automatically extract and analyze valid information from clinical records and integrate experimental conclusions
- Lead to new biomedical discovery and hypothesis
30 patients with type 2 diabetes mellitus who showed poor glycemic control with glimepiride (4 mg/d) were randomized to rosiglitazone (4 mg/d) and metformin (500 mg bid) treatment groups. The plasma concentrations of resistin were measured at baseline and at 6 months of treatment for both groups. The resistin levels decreased in rosiglitazone group (2.49 F 1.93 vs 1.95 F 1.59 ng/ml; P b .05) but increased in metformin group (2.61 F 1.69 vs 5.13 F 2.81 ng/ml; Pb.05)…

Diagnosis Assistance

Biomedical NLP for the crowd
Scarce medical resources / Flourishing online services
Reduce the pressure on doctors and improve the work efficiency of hospital systems

Diagnosis Assistance: Text Classification

Common tasks: automatic triage&medicine prescription
Datasets: annotated entities prediction
Backbones: SVM, LSTM; BERT; GPT…
Common tasks: automatic triage&medicine prescription
Classify as a matching/retrieval process

we may try to inject more knowledge (e.g. description from KBs)

Diagnosis Assistance: Dialogue

AI systems: replace the doctor’s role to complete more operations including communicating with the patients
Datasets: MedDialog ( large-scale Chinese dataset )
Dialogue as a typical text generation task:
- Different from QA: usually multi-turn; no candidate answer
- Chat-box; task-oriented …… many practical systems
- Dialogue as a typical text generation task:
  - Different from QA: usually multi-turn; no candidate answer
  - Chat-box; task-oriented …… many practical systems
Retrieval-based Dialogue System: traditional method
Fluent but not always related
Combine with generation-based DS
Knowledge-based Dialogue System: More logical
In the real world: …
Incorporate knowledge
Human thinking process
- Language models capture knowledge and generate language
- Dialogue Generation from KGs with Graph Transformers
Medical Dialogue: Safety( Plenty of knowledge + Interpretability)
A typical application for medical knowledge interactivity:
- Users->Models: extract emperical knowledge
- Models->Users: query existing knowledge
Stylized language: gap between colloquial style of patients and thestandard terms and structured items in KB/KGs
- Entity Linking / Standarlization for diagnosis
- Privacy protection
Summarize the key concepts
Ready for the further KB enhancing
Patient states & Physician policies
KL loss for state distribution
Clear and understandable
- 1st: States training
- 2nd: States+Actions training
Our exploration:
- Multi-task & soft prompt learning during pre-training
- 2-stage framework for the medical dialogue task

Diagnosis Assistance

Something about Big Models:
- Externally,we integrate KB/KGs during the encoding of medical dialogue text
- Internally, we regard the PLM itself as a KB,hoping to query corresponding information from it
- Prompt/Cloze?CoT?
- How to protect privacy?

Substance Representation

NLP systems can process natural language text
What if we want to process biomedical substances?
NLP systems can process not only natural language text
To represent biomedical substances as linear text
Background knowledge review
Nucleic acid sequence: A, G, C, T (U)
Amino acid sequence: 20 for human
Protein: Quaternary structure

Substance Representation: DNA

Major research object: non-coding DNA
Tasks:
- predict gene expression
- predict proximal and core promoter regions
- identify transcription factor binding sites
- figure out important regions, contexts and sequence motifs
- …
Datasets: plenty of open-access resources
- Homo sapiens genome assembly (CRCh38/hg38)
- Cap Analysis Gene Expression (CAGE) Databases
- Descartes: Human Chromatin Accessibility During Development
- ……
Natural language models are good at capturing patterns from mass of sequence data
From simple frameworks (e.g. CNN&LSTM) to Transformer
“Tokens” are fewer than natural language -> less information in word embeddings
- position is important
- k-mer sliding window input

Substance Representation: Protein

We mainly focus on the amino acid sequences
Tasks:
- Structure Prediction
- Evolutionary Understanding
- Protein Engineering
- …
Datasets:
- Uniref: provide clustered sets of sequences from the UniProt Knowledgebase
- GO annotations: capture statements about how a gene functions at the molecular level
- Protein Data Bank……
Methods: BiLSTM + CRFs, Autoencoder models …
Big Model:
- Models with larger-scale parameters are better at capturing the features from the biologic sequences.
- Pre-training is proved to be especially helpful!
Alpha-Fold: One of the most inspiring research results!
Predict 3D structure with the help of molecular dynamics
MSA + EvoFormer +End2end training: perfect combination for biomedical knowledge and NLP technique
A breakthrough for the 3D structure prediction accuracy (comparable to human level)
Inspired by AlphaFold: MSA Transformer
Column/Row attention structure
Mean attention better than individual?
- EvoFormer: unsupervised MSA mask learning for initialization
- Structure: annotated data for the initial network; predict the unannotated data and noisestudent training
- MSA row/column attention; templates
- A representation for each pair of residues
- pairwise repr graph iterative update
- single repr and pair repr; blackhole initialize; Peptide bond angles and distances
Interaction is here

Substance Expression: Chemicals

Molecular fingerprints:essential cheminformatics tools for virtual screening and mapping chemical space
Get fingerprint representation by deep-learning models?
Molecular graphs -GCNs; SMILES strings -LMs
Tasks:molecule property classification,chemical reaction classification,…
Datasets:MoleculeNet,USPTO 1k TPL,..
Case:KV-PLM
Bridging chemicals with general text
- Complementary features of heterogeneous data
- Inspired by human observing and learning mapping correlation
PLM intergrating chemical sturcture & text
Comprehensively processing both SMILES strings and general text
Model finishing chemical exam: property prediction
Conversely, it provides help for drug discovery

Project: BioSeq PLMs and Benchmark

Background
1. NLP technologies are widely introduced to processing biological sequences
2. There exist differences between natural language and Bio-Seq.Better PLMs are expected to be proposed.
Long-term Goals
1. Propose a robust and comprehensive benchmark for DNA data process
2. Explore better model structure and pre-train method for DNAs
Projects
1. Reproduce and improve DNA pre-trained baseline methods
2. Build down-stream DNA tasks from open-source databases

Biomedical NLP: Future Directions

Knowledgeable big model: models with more expert knowledge achieving better performance
Al for science: user-friendly assistant tools with lower barriers to entry;unleash human researcher productivity
Cross-modal processing: bridging vision language information or different forms of data (e.g.graphs)
Low-resource learning: lack of annotated data

L8 BM x Legal Intelligence

Background

Challenges
- In US, roughly 86% of low-income individuals with civil legal problems report receiving inadequate or no legal help
- In China, roughly 80% of cases have no access to the support of lawyers
Legal Artificial Intelligence (LegalAI)
- AI for Law: Apply the technology of artificial intelligence, especially natural language processing, to benefit tasks in the legal domain
- Law for AI: Use laws to regulate the development, deployment, and use of AI
AI for Law
- Reduce the time consumption of tedious jobs and improve work efficiency for legal professionals
- Provide a reliable reference to those who are unfamiliar with the legal domain
Challenges
- Lack of labeled data -> There are only limited high-quality human-annotated data for legal tasks, and data labeling is costly
- High demand for professional knowledge -> Legal tasks usually involve many legal concepts and knowledge

Legal Intelligence Applications

Legal Judgement Prediction -> Given the fact description, legal judgement prediction aims to predict the judgement results, such as relevant law articles, charges, prison terms
Legal Judgement Prediction
- Multiple subtasks
  - Criminal cases: relevant law article prediction, charge prediction, prison term prediction, fine prediction …
  - Civil cases: relevant law article prediction, cause of action prediction, ……
- Task formalization
  - Inputs: the fact description
  - Relevant law article: classification
  - Charge/Cause of action: classification
  - Prison term/Fine: regression
- Challenges
  - Confusing charges
  - Interpretability
Similar Case Retrieval
- Given a query case, similar case retrieval aims to retrieve relevant supporting cases
- Task formalization
  - Query case: q
  - Candidate cases: C
  - Outputs: relevance score for each query-candidate pair $(q, c_i)$
- Challenges
  - Long document matching
  - Relevance definition
  - Diverse user intention
Legal Question Answering
- Legal question answering aims to provide explanations, advice, or answers for legal questions.
- Task formalization
  - Inputs: question
  - Step 1: retrieve the relevant knowledge (law articles, legal concepts) from the knowledge base
  - Step 2: answer the question based on the relevant knowledge
- Challenges
  - Concept-Fact Matching
  - Multi-hop reasoning
  - Numerical reasoning
Court View Generation
- Given the fact description and plaintiff’s claim, court view generation aims to generate the rationales and results of the cases.
- Task formalization
  - Inputs: claim and fact description
  - Outputs: The decisions (Accept/Reject) and the corresponding reasons
Other applications
- Legal Cases Retrieval
- Legal Information Recommendation
- Risk Warning
- Legal Judgment Prediction
- Legal Documents Translation
- Legal Text Mining
- Legal Documents Generation
- Legal Question-Answering
- Compliance Review
Two Lines of Research
- Data-Driven Methods
  - Legal cases
  - Trademarks Patents
  - Court Trial
- Knowledge-Guided Methods
  - Legal Regulations
  - Judicial Interpretation
  - Legal Literatures

Data-Driven Methods

Utilize deep neural networks to capture semantic representations from large-scale data
Large-scale open-source legal corpora
- 130 millions legal case documents
- 160 millions patents/trademarks documents
- 19 millions court trial data
Typical data-driven methods
- Word embeddings
- Pre-trained language models
Open-domain PLMs is suboptimal for the legal domain
- Differences in narrative habits and writing styles
- Many terminology and concepts in legal documents
Train PLMs based on large-scale unlabeled legal documents
- Masked Language Model
PLMs in the legal domain
- Don’t stop pre-training!
- Additional pre-training on target corpora can lead to performance improvement
- Legal-BERT: pretrained on English legal documents
- OpenCLaP: pretrained on Chinese legal documents
PLMs for long documents in the legal domain
- Legal documents usually involve complex facts and consist of 1260.2 tokens on average
- Most existing PLMs can only handle documents with no more than 512 tokens
PLMs for legal long documents in the legal domain
- Lawformer utilizes the sparse self-attention mechanism instead of full self-attention mechanism to encode the long documents
- Pre-training Data, Model Parameters, Tasks
- Lawformer can achieve significant performance improvement
Legal PLMs: Learning Responsible Data Filtering from the Law
- Privacy Filtering
  - the law provides a number of useful heuristics that researchers could deploy to sanitize data
  - juvenile names, dates of birth, account, and identity number

Knowledge-Guided Methods

Knowledge-Guided Methods
- Enhance the data-driven neural models with the legal domain knowledge to improve the performance and interpretability on downstream tasks
- Knowledge in open-domain
  - Knowledge Graphs
Typical legal knowledge
- Events occurred in the cases
- Decision-making elements
- Legal logic
- Legal regulations
LegalAI Applications
- Legal Judgement Prediction -> Given the fact description, legal judgement prediction aims to predict the judgement results, such as relevant law articles, charges, prison terms
- Legal Event Knowledge
  - Key of Legal Case Analysis: Identifying occurred events and causal relations between these events
  - Legal events can serve as high-quality case representations
- Existing Legal Event Datasets
  - Incomprehensive event schema
    - Limited coverage: only contain tens of event types with a narrow scope of charges
    - Inappropriately defined: only contain charge-oriented charges and ignore general events
  - Limited data annotations
    - Only contain thousands of event mentions
Our Goal
- Large-scale: 8,116 legal documents with 118 criminal charges and 150,977 mentions
- High coverage: 108 event types, including 64 chargeoriented events and 44 general events
Legal Events for Downstream Tasks
- Combine the pretrained models with the legal event knowledge
- Add occurred events as additional features to generate the document representation
Legal Events for Judgement Prediction
- Combine the pretrained models with the legal event knowledge
- Utilize occurred events as features to represent legal cases
  - low-resource setting
  - full-data setting
Legal Events for Similar Case Retrieval
- Combine the pretrained models with the legal event knowledge
- Utilize occurred events as features to represent legal cases
  - unsupervised setting
  - supervised setting
Legal Element Knowledge
- Legal elements refer to crucial attributes of legal cases, which are summarized by legal experts
- Long-tail distribution -> Top 10 charges cover 78.1% cases
- Confusing charges -> Theft vs. Robbery
Legal Elements for few-shot and confusing charges
- Combine data-driven deep learning methods with legal element knowledge
- Utilize elements as additional supervision signals to improve the performance on low-frequency charges
Legal Elements for interpretable prediction
- Existing methods usually suffer from the lack of interpretability, which may lead to ethical issues
- Following the principle of elemental trial, QAJudge is proposed to visualize the prediction process and give interpretable judgments
- QAJudge can achieve comparable results with SOTA models, and provide explanation for the prediction results
Legal Logic Knowledge
- Topological dependencies between subtasks
- There exists a strict order among the subtasks of legal judgment
- Capture the dependencies with recurrent neural network unit
Legal Regulations
- Legal regulations are one of the most important knowledge bases for legal intelligence systems
- Compared to structured legal knowledge, unstructured legal regulations do not require manual knowledge summarization, so the cost of acquiring such knowledge is much lower
Legal Regulations for Judgement Prediction
- The judgement results are predicted based on both the fact descriptions and relevant law articles
- The aggregation is performed via the attention mechanism
Legal Regulations for Question Answering
- Textual legal regulations and cognitive reasoning are required for legal QA
- Cognitive reasoning are required for legal QA
- Semantic retrieval and cognitive reasoning are required for legal QA
Legal Knowledge-Guided Methods
- Legal Event Knowledge
- Legal Element Knowledge
- Legal Logic Knowledge
- Legal Regulation Knowledge
- ……
Advantages
- Learn from limited labelled data
- Improve the reasoning ability
Demo: https://law.thunlp.org/

Quantitative Analysis for Legal Theory

Mining patterns from a large number of case documents to improve or supplement legal theory
Common Law System
- The outcome of a new case is determined mostly by precedent cases, rather than by existing statutes
- Halsbury believes that the arguments of the precedent are the main determinant of the outcome.
- Goodhart believes that what matters most is the precedent’s facts.
Mutual information test
Legal Fairness Analysis
- Motivation: Fairness is one of the most important principles of justice. The ability to quantitatively analyze the fairness of cases can help to implement judicial supervision and promote fairness and justice.
- Goal: To perform fairness analysis on large-scale realworld data
- Similar cases should be judged similarly!
- Train different virtual judges (sentence prediction models) and calculate their disagreements using standard deviations
- Synthetic datasets: we construct biased datasets by keeping facts the same and perturbing the term of penalty randomly with $\beta$ as the inconsistency factor
- The proposed method can achieve high correlation between the golden inconsistency factor
- Inconsistency is negatively correlated with the severity of the charges, i.e., felonies are sentenced more consistently than misdemeanors

Future Directions

More Data
- Legal Case Documents 120 Millions
- Trademarks and Patents Tens of millions
- Legal Consultation Tens of millions of LegalQA
More Knowledge
- Laws and Regulations 1000+
- Judicial Interpretations 1000+
- Legal Literature Hundreds of legal journals
More Interpretability: Providing explanation for answers
More Intelligence: Manipulating tools for cognitive intelligence

L9 BM x Brain Science

Magic of Sahred by Brain and PLM

Knowledge: Language-derived representation -> Sensory-derived representation
Shared computational principles for language processing
- Principle 1: next-word prediction before
  word onset.
- Principle 2: pre-onset predictions are used to calculate post-word-onset surprise.
- Principle 3: contextual vectorial representation in the brain.
Revealing the magic of language
- Function
- Representation: Note that the semantic representations derived from language input do not possess feelings or experiences of the world. Such representations do reflect perceptual (size), abstract (danger), and even affective (arousal and valence) properties of concepts. Semantic Representation is similar to human Mental Representation.
- Structure: machine model vs. human brain model
The next question – Towards an understanding of intelligence
- computational models
  - symbolic models
  - connectionist models
  - biological neural models
- brain-activity data
  - cell recordings
  - fMRI
  - EEG,MEG
- behavioral data
  - reaction time
  - errors
  - explicit judgements

Neuron Activation

Prompt-Transferability

Neurons in PLMs

Background: Neurons in FFNs
- Transformer Architecture
- Feed Forward Neural Network
Sparse activation phenomenon
- Sparse Activation Phenomenon in Large PLMs
- 80% inputs only activate less than 5% neurons of FFNs
- No useless neuron that keeps inactive for all inputs
- Related to Conditional Computation
  - Constrains a model to selectively activate parts of the neural network according to input

Cumulative distribution function (CDF) of the ratio of activated neurons in FFNs. Use T5-large (700 million parameters).

Conditional computation
- Deep Learning of Representations: Looking Forward (Bengio, 2013)
- Pathways (Jeff Dean, 2021)
  - Today’s models are dense and inefficient
  - Pathways will make them sparse and efficient
MoEfication
- Mixture-of-experts (MoE)
- Use MoE to increase model parameters with tiny extra computational cost
- Split existing models into multiple experts while keeping model size unchanged
- Expert Construction
- Group the neurons that are often activated simultaneously
- Parameter Clustering Split
  - Treat the columns of $W_1$! as a collection of vectors
  - K-means
- Co-Activation Graph Split
  - Construct a coactivation graph
  - Each neuron is represented by a node
  - Edge weight between two nodes is their co-activation value
- Assign a score to each expert and select the experts with high scores
- Groundtruth Selection: Calculate the number of positive neurons in each expert as $s_i$
- Parameter Center: Average all columns of $W_1$ and use it as the center
- Learnable Router: Learn a router from the groundtruth on the training set
- Sparsity of Different T5 Models, MoEfication with Different T5 Models -> 20% of the xlarge model parameters can have 98% performance. The effect if better with the increasing of the xlarge model.
- Observations on routing patterns: Some experts are commonly selected but some are not, not balance. Experters are different.

Analyze PLMs through neurons

Specific Function

Expert units
- Identify whether the activation of a specific neuron can classify a concept
- Nc+ positive sentences that contain concept c and Nc− negative sentences that do not contain concept c.
Concept expertise: Give each unit an index m and treat a unit as a binary classifier for the input sentences to compute AP
Concept distribution
Expertise and generalization results: Detect the model’s ability without fine-tuning
Concept overlap: Let the overlap between concepts q and v be…
Conditioned text generation: Selected expert units to compute
Compositional explanations of neurons
- Neurons learn compositional concepts
- Compositional explanations allow users to predictably manipulate model behavior
Find neurons
- For an individual neuron, thresholding its activation
- For an individual neuron, thresholding its activation
- Compare with the mask of concepts
- Search for the most similar concept
- Find logical forms induced from the concepts Compose these concepts via compositional operations: AND OR NOT
Tasks
- Image Classification
  - Scene Recognition
  - ResNet-18
- NLI
  - SNLI
    - BiLSTM+MLP
    - Probe neurons in MLP, input is premise-hypothesis pairs
  - Concepts:
    - Penn Treebank POS tags + 2000 most common words
      - Appear in premise or hypothesis
    - Whether premise and hypothesis have more than 0%, 25%, 50%, or 75% word overlap
  - Additional Operator
    - NEIGHBORS(C), the union of 5 most close words with C
    - Judged by cosine similarity of Glove embeddings

Neuron Activation

Transferability Indicator

Recap prompt tuning:
- Training
- Transferability: Cross-Task Transfer
Prompt transfer -> Cross-Task Transfer (Zero-shot) -> For the tasks within the same type, transferring prompts between them can generally perform well.
Transferability indicator
- Motivation: Explore why the soft prompts can transfer across tasks and what decides the transferability between them
- Embedding Similarity
  - Euclidean similarity
  - Cosine similarity
- Model Stimulation Similarity (ON)
  - Activated Neurons
  - ON has the higher Spearman’s correlation with the transferability
  - ON works worse on the larger PLMs because of the higher redundancy
Activated neurons in a PLM -> Distribution of Activated Neuron -> The activated neurons are common in the bottom layers but more task-specific in top layers.

Activated Neurons Can Reflect Human-Like Emotional Attributes

Question: Whether PLMs can learn human-like emotional attributes during the pre-training ?
How do humans recognize different emotions ?
- Human
- PLM (Activated Neurons)
Correlation -> Represent 27 emotions with human attributes and activated neurons
Activated neurons for every attribute
Remove neurons for an attribute
Demo: https://github.com/thunlp/Prompt-Transferability Find: Activated Neurons Demo [Colab link]
Activated Neurons
- Load Pre-trained Language Model (Roberta)
- Load the prompts (checkpoints) - 27 Emotion Tasks
- Activate Neurons
- Activated neurons in each layers -> Input: [‘realization’, ‘surprise’, …, ‘remorse’]
- Cosine Similarity of Activated Neurons -> Input: [‘realization’, ‘surprise’, …, ‘remorse’]

Cognitive Abilities of Big Models

Task generalization of PLM
- Question: why can PLMs easily adapt to various NLP tasks even with small-scale data?
- PLM acquires versatile knowledge during pre-training, which can be leveraged to solve various tasks
Cognitive abilities of PLMs
- Recent studies have shown that PLMs also have cognitive abilities, and can manipulate existing tools to complete a series of complex tasks
Fundamentals & framework
- Imitation learning in RL
  - Learning from behaviors instead of accumulating rewards
  - State-action pairs
  - State as features and action as labels
  - Target: imitate the trajectory of behaviors
- Large-scale pre-trained models
  - Universal knowledge learned from pre-training
- Interactive space
  - An environment that models could interact with
  - State space: display states and changes
  - Action space: a set of pre-defined actions in the environment
- Given a goal, we model each action and state to achieve the goal in a unified space by a PLM
- Tokenization
  - Tokenize human behaviors (actions in the action space) and states in the state space to a same space
  - The tokenized information could be processed by PLM
- Directly training
  - The behaviors could be autoregressively predicted
Interactive space:
- Search engine
- WebShop
- Sandbox
Search engine
- A rising challenge in NLP is long-form QA -> A paragraph-length answer is generated in response to an open-ended question
- The task has two core components: information retrieval and information synthesis
- WebGPT
  - Outsource document retrieval to the Microsoft Bing Web Search API
  - Utilize unsupervised pre-training to achieve high-quality document synthesis by fine-tuning GPT-3
  - Create a text-based web-browsing environment that both humans and language models can interact with
- Text-based web-browser
- WebGPT-produced answers are more preferred than human-generated ones
- Better coherence and factual accuracy
- An example -> How does neural networks work?
WebShop
- WebShop for online shopping (HTML mode)
- Simple mode which strips away extraneous meta-data from raw HTML into a simpler format
- Actions in WebShop
- Item rank in search results when the instruction is directly used as search query
- Model implementation
- Results
Sandbox
- Video pre-training of MineCraft -> Sandbox like Minecraft is a good interactive space
- Video pre-training of MineCraft -> Define discrete actions in the interactive space
- Cost
  - Use behavior model to annotate unlabeled 70 hours video
  - Reduce the cost: 1,400,000 -> 130,000
- Annotation trick
  - At first, casually playing MineCraft
  - Play specific tasks (Equip Capabilities)
- Results -> VPT accomplishes tasks impossible to learn with RL alone, such as crafting planks and crafting tables (tasks requiring a human proficient of ∼970 consecutive actions)
- Results -> An example for killing a cow
Challenges & limitations
- Building interactive space is time-consuming
- Labeling is expensive and labor-intensive
- The goal must be clear and simple
- Only discrete actions and states are supported
- A clean interactive space is required

References

#研0自学

Tsinghua OpenBMB NLP

https://alexanderliu-creator.github.io/2023/12/07/tsinghua-openbmb-nlp/

作者

Alexander Liu

发布于

2023年12月7日

许可协议

PKU-操作系统与虚拟化安全上一篇

AI-Prompt Learning 下一篇

Tsinghua OpenBMB NLP

Outline

L1 NLP Basics

词的表示

语言模型

大模型

L2 NN Basics

How NN works

Training NN

NN Example: Word2Vec

RNN

GRU

LSTM

Bidirectional RNNs

CNN

Pytorch Demo

L3 Transformer and PLM

Transformer

Attention Machanism

Transformer Structure

Input

Encoder Block

Decoder Block

Summary of Transformer

PLM

LM

PLM

GPT

Bert

Summary

MLM

Cross-lingual LM Pre-training

Cross-Modal LM Pre-training

Summary

Frontiers

GPT-3

Summary

Transformers Tutorial

Introduction

Pipeline

Tokenization

Frequently-used APIs

Demo

L4 Prompt Delta

Fine Tuning

Prompt-learning

PTM Selection

Template

Verbalizer

Learning Strategy

Applications

Delta-Tuning

Addition-based

Specification-based

Reparameterization-based

Connections

Futhur Reading

OpenPrompt

OpenDelta

Collaboration

L5 BMSystem

BMTrain

Data Parallel

Data Parallel

Model Parallel

ZeRO

ZeRO-Stage 2:- Each replica handles a portion of the input.- Forward.- Backward (Average gradients using Reduce Scatter).

ZeRO-Stage 3- Each replica handles a portion of the input.- Forward (Share parameters using All Gather).- Backward (Average gradients using Reduce Scatter).

Pipeline Parallel

Techniques

Mixed Precision

Offloading

Overlapping

Checkpointing

Performance

BMCook

Knowledge Distillation

Model Pruning

Model Quantization

Other Methods

ZeRO-Stage 2:
- Each replica handles a portion of the input.
- Forward.
- Backward (Average gradients using Reduce Scatter).

ZeRO-Stage 3
- Each replica handles a portion of the input.
- Forward (Share parameters using All Gather).
- Backward (Average gradients using Reduce Scatter).