Tsinghua OpenBMB NLP
本文最后更新于:8 个月前
这里是大模型相关的内容,是Tsinghua和OpenBMB社区合办的课程昂!!!Talk is cheap, show me the code.
Outline
- 课程大纲:
- Basic Knowledge of Big Models
- L1-NLP Big Model Basics(GPU server,Linux,Bash,Conda,…)
- L2-Neural Network Basics(PyTorch)
- L3-Transformer and PLMs(Huggingface Transformers)
- Key Technology of Big Models
- L4-Prompt Tuning Delta Tuning (OpenPrompt,OpenDelta)
- L5-Efficient Training Model Compression(OpenBMB suite)
- L6-Big-Model-based Text understanding and generation
- Interdisciplinary Application of Big Models
- L7-Big Models X Biomedical Science
- L8-Big Models X Legal Intelligence
- L9-Big Models X Brain and Cognitive Science
- Basic Knowledge of Big Models
L1 NLP Basics
- NLP的一些任务:
- 词性标注:把一句话中每个词的词性标注出来
- 句子中的命名实体识别:一句中的命名实体
- 共指消解:代词和哪个实体是同一个对象
- 句子中各种句法和依赖关系的识别
- 中文分词
- text matching
- query engine
- 知识图谱
- machine reading
- machine translation
- 人机对话
- Personal Assistant
- Sentiment Analysis and Opinion Mining
- Computational Social Science
- 社会变迁
- 心理变化
- …
词的表示
基本问题:词的表示
让机器了解,词的表示,词的相似度计算。
让机器了解,词之间的语义关系。
one-hot representation(词表)
- All the vectors are orthogonal. No natural notion of similarity for one-hot vectors
Use context words to represent current word.(用上下文去描述当前词)
- Increase in size with vocabulary
- Require a lot of storage
- Sparsity issues for those less frequent words -> Subsequent classification models will be less robust
Word Embedding: Distributed Representation
- Build a dense vector for each word learned from large-scale text corpora
- Learning method: Word2Vec (We will learn it in the next class)
语言模型
- 两个能力
- 判断一系列词出现的联合概率。
- 通过前文,去预测后文的单词。
- Assumption: 后文概率只受前文概率的影响,单纯概率相乘法。
- N-gram Model:
- 简单统计,利用出现频度来进行预测(哪个越多,我就选这个)。马尔可夫假设!根据这个词之前有限的词去进行统计频度,并得出结果。
- Not considering contexts farther than 1 or 2 words
- Not capturing the similarity between words
- Neural Language Model:
- A neural language model is a language model based on neural networks to learn distributed representadons of words
- Associate words with distributed vectors
- Compute the joint probability of word sequences in terms of the feature vectors
- Opbmize the word feature vectors (embedding matrix E) and the parameters of the loss funcbon (map matrix W)
- A neural language model is a language model based on neural networks to learn distributed representadons of words
大模型
- Why Big Models
- Size up, data up, 性能和各种功能有显著提升。
- 能力:
- World Knowledge
- Common Sense
- Logical Reasoning
- 关注度也一直在往上走
- Why LLM works: Large-scale Unlabeled Data(Model Pre-training) -> Task-specific Training Data(Model Fine-tuning) -> Data(Final Model)
- The basic paradigm of pre-training and fine-tuning can be traced back to transfer learning Humans can apply previously learned knowledge to handle new problems faster, and we want machines to have similar abilities.
- Prerequisites
- GPU
- You own
- Rent
- Use Google colab
- SSH
- Linux command
- Vim
- Tmux
- Virtual environment & conda & pip
- Vscode + remote connection
- Git
- Bash
- GPU
L2 NN Basics
Outline
Neural Network Components
Simple Neuron; Multilayer; Feedforward; Non-linear; …
How to Train
- Objective; Gradients; Backpropogation
Word Representation: Word2Vec
- Common Neural Networks
- RNN
- Sequential Memory; Language Model
- Gradient Problem for RNN
- Variants: GRU; LSTM; Bidirectional;
- CNN
- RNN
- Common Neural Networks
NLP Pipeline Tutorial (PyTorch)
How NN works
A single layer neural network: Hooking together many simple neurons. Multilayer Neural Network: Stacking multiple layers of neural networks.
Forward Propagation & Backward Propagation.
Without non-linearities, deep neural networks cannot do anything more than a linear transform. Extra layers could just be compiled down into a single linear transform. With non-linearities, neural networks can approximate more complex functions with more layers!
Input -> Hidden -> Output, Output depends on the task:
Linear output: 用于预测连续的值。
Sigmoid output: 把输出压到0-1之间,可以用于二分类问题。
Softmax output:多分类问题。
Choices of non-linearities: Sigmoid, Tang, ReLU
Summary
- Simple neuron
- Single layer neural network
- Multilayer neural network
- Stack multiple layers of neural networks
- Non-linearity activation function
- Enable neural nets to represent more complicated features
- Output layer
- For desired output
Training NN
- 损失函数: 均方误差函数(MSE),可以用于判定回归问题的拟合效果。
- 损失函数: 交叉熵(Cross-entropy),可以用于判定模型对于多分类问题的正确率,衡量模型分类正确的负log概率。
- 最小化损失函数:Stochastic Gradient Descent,梯度下降法。
- 链式法则:用于神经网络中求梯度。
- Backpropagation
- Compute gradients algorithmically
- Used by deep learning frameworks (TensorFlow, PyTorch, etc.)
- Computational Graphs: Representing our neural net equations as a graph
- Source node: inputs
- Interior nodes: operations
- Edges pass along result of the operation
- Go backwards along edges: Pass along gradients
- Single Node:
- Node receives an “upstream gradient”
- Goal is to pass on the correct “downstream gradient”
- Each node has a local gradient: The gradient of its output with respect to its input. [downstream gradient] = [upstream gradient] x [local gradient]
- Summary:
- Forward pass: compute results of operation and save intermediate values
- Backpropagation: recursively apply the chain rule along computational graph to compute gradients
- [downstream gradient] = [upstream gradient] x [local gradient]
NN Example: Word2Vec
Word2vec uses shallow neural networks that associate words to distributed representations.
Typical Models: Word2vec can utilize two architectures to produce distributed representations of words:
- Continuous bag-of-words (CBOW)
- Continuous skip-gram
Sliding Window:
Word2vec uses a sliding window of a fixed size moving along a sentence
In each window, the middle word is the target word, other words are the context words
- Given the context words, CBOW predicts the probabilities of the target word
- While given a target word, skip-gram predicts the probabilities of the context words
一个是Context -> Word,另外一个是Word -> Context
Continuous Bag-of-Words
- In CBOW architecture, the model predicts the target word given a window of surrounding context words
- According to the bag-of-word assumption: The order of context words does not influence the prediction
Continuous Skip-Gram: In skip-gram architecture, the model predicts the context words from the target word
Problems of Full Softmax: When the vocabulary size is very large
- Softmax for all the words every step depends on a huge number of model parameters, which is computationally impractical
- We need to improve the computation efficiency
Improving Computational Efficiency
- In fact, we do not need a full probabilistic model in word2vec
- There are two main improvement methods for word2vec:
- Negative sampling
- As we discussed before, the vocabulary is very large, which means our model has a tremendous number of weights need to be updated every step
- The idea of negative sampling is, to only update a small percentage of the weights every step
- Then we can compute the loss, and optimize the weights (not all of the weights) every step
- Suppose we have a weight matrix of size 300×10,000, the output size is 5
- We only need to update 300×5 weights, that is only 0.05% of all the weights
- Hierarchical softmax
- Negative sampling
Other Tips for Learning Word Embeddings
- Sub-Sampling. 平衡常见词和罕见词出现的概率
- Soft sliding window:非固定的滑动窗口,随机采样一个范围中的词,作为window size
RNN
- Key concept for RNNs: Sequential memory during processing sequence data
- Definition: a mechanism that makes it easier for your brain to recognize sequence patterns
- RNNs update the sequential memory recursively for modeling sequence data
- Application Scenarios
- Sequence Labeling
- Given a sentence, the lexical properties of each word are required
- Sequence Prediction
- Given the temperature for seven days a week, predict the weather conditions for each day
- Photograph Description
- Given a photograph, create a sentence that describes the photograph
- Text Classification
- Given a sentence, distinguish whether the sentence has a positive or negative emotion
- Sequence Labeling
- Advantages & Disadvantages
- Advantages
- Can process any length input
- Model size does not increase for longer input
- Weights are shared across timesteps
- Computation for step i can (in theory) use information from many steps back
- Disadvantages
- Recurrent computation is slow
- In practice, it’s difficult to access information from many steps back.
- Gradient vanish or explode
- Advantages
GRU
Introduce gating mechanism into RNN
- Update gate
- Reset gate
Gates are used to balance the influence of the past and the input
If reset is close to 0. Ignore previous hidden state, which indicates the current activation is irrelevant to the past.
Update gate controls how much of past state should matter compared to the current activation.
LSTM
Long Short-Term Memory network (LSTM) . LSTM is a special kind of RNN, capable of learning long-term dependencies like GRU
cell state
t
- Extra vector for capturing long-term dependency
- Runs straight through the entire chain, with only some minor linear interactions
- Easy to remove or add information to the cell state
Steps:
- The first step is to decide what information to throw away from the cell state: forget gate
- The next step is to decide what information to store in the cell state
- Update the old cell state. Combine the results from the previous two steps.
- The final step is to decide what information to output -> Adjust the sentence information for a specific word representation.
Powerful especially when stacked and made even deeper (each hidden layer is already computed by a deep internal network) . Very useful if you have plenty of data.
Bidirectional RNNs
In traditional RNNs, the state at time t only captures information from the past. Problem: in many applications, we want to have an output depending on the whole input sequence. E.g. handwriting recognition & speech recognition
Recurrent Neural Network
- Sequential Memory
- Gradient Problem for RNN
RNN Variants
- Gated Recurrent Unit (GRU)
- Long Short-Term Memory Network (LSTM)
- Bidirectional Recurrent Neural Network
CNN
- Convolutional Neural Networks
- Generally used in Computer Vision
- Achieve promising results in a variety of NLP tasks:
- Sentiment classification
- Relation classification
- CNNs are good at extracting local and positioninvariant patterns
- CNNs extract patterns by:
- Computing representations for all possible n-gram phrases in a sentence.
- Without relying on external linguistic tools (e.g., dependency parser)
- Architecture: Input Layer -> Convolutional Layer -> Max-pooling Layer -> Non-linear Layer
- Input Layer: Transform words into input representations x via word embeddings
- Extract feature representation from input representation via a sliding convolving filter.
- Application Scenarios: Object Detection, Video Classification, Speech Recognition, Text Classification
- CNN vs RNN
- CNN:
- Extracting local and position-invariant features
- Less parameters
- Better parallelization within sentences
- RNN:
- Modeling long-range context dependency
- More parameters
- Cannot be parallelized within sentences
- CNN:
Pytorch Demo
Pipeline for Deep Learning: prepare data -> build model -> train model -> evaluate model -> test model
Context
- target: to predict next word
- input: never too old to learn
- output: too old to learn English
- model: LSTM
- loss: cross_entropy
- target: to predict next word
L3 Transformer and PLM
- Transformer
- Attention Mechanism
- Transformer Structure
- Pretrained Language Models
- Language Modeling
- Pre-trained Langue Models (PLMs)
- Fine-tuning Approaches
- PLMs after BERT
- Applications of Masked LM
- Frontiers of PLMs
- Transformers Tutorial
- Introduction
- Frequently-used APIs
- Quick Start
- Demo
Transformer
Attention Machanism
The Bottleneck Problem
- The single vector of source sentence encoding needs to capture all information about the source sentence
- The single vector limits the representation capacity of the encoder: the information bottleneck
Attention
- Attention provides a solution to the bottleneck problem
- Core idea: at each step of the decoder, focus on a particular part of the source sequence
A more general definition of attention: Given a query vector and a set of value vectors, the attention technique computes a weighted sum of the values according to the query
Intuition:
- Based on the query, the weighted sum is a selective summary of the values.
- We can obtain a fixed-size representation of an arbitrary set of representations via the attention mechanism.
Attention Variants: Attention has a lot of variants.
Insights:
- Attention solves the bottleneck problem: The decoder could directly look at source
- Attention helps with vanishing gradient problem: By providing shortcuts to long-distance states
- Attention provides some interpretability:
- We can find out what the decoder was focusing on by the attention map:
- Attention allows the network to align relevant words
Transformer Structure
- Motivations
- Sequential computation in RNNs prevents parallelization
- Despite using GRU or LSTM, RNNs still need attention mechanism which provides access to anuuiy state
- Maybe we do not need RNNs? -> Attention is all you need
- Transformer
- Architecture: encoder-decoder
- Input: byte pair encoding + positional encoding
- Model: stack of several encoder/decoder blocks
- Output: probability of the translated word
- Loss function: standard crossentropy loss over a softmax layer
Input
Byte Pair Encoding (BPE)
- A word segmentation algorithm
- Start with a vocabulary of characters
- Turn the most frequent n-gram to a new n-gram
Byte Pair Encoding (BPE)
- Solve the OOV (out of vocabulary) problem by encoding rare and unknown words as sequences of subword units
- In the example above, the OOV word “lowest” would be segmented into “low est”
- The relation between “low” and “lowest” can be generalized to “smart” and “smartest”
Positional Encoding
- Byte Pair Encoding (BPE): Dimension: d
- Positional Encoding (PE): The Transformer block is not sensitive to the same words with different positions
Input = BPE + PE
Encoder Block
- Two sublayers
- Multi-Head Attention
- Feed-Forward Network (2-layer MLP)
- Two tricks
- Residual connection
- Layer normalization
- Changes input to have mean 0 and variance 1
- General Dot-Product Attention
- Inputs
- A query q and a set of key-value (k, v) pairs
- Queries and keys are vectors with dimension
- Values are vectors with dimension
- Output
- Weighted sum of values
- Weight of each value is computed by the dot product of the query and corresponding key
- stack multiple queries q in a matrix Q
- Scaled Dot-Product Attention
- Problem
- 梯度可能会越来越小,模型更新慢
- The softmax gets very peaked; Gradient gets smaller
- Solution
- Scale by the length of the query/key vectors
- Problem
- Self-attention
- Let the word vectors themselves select each other
- Q, K, V are derived from the stack of word vectors from a sentence
- Multi-head Attention
- Different head: same computation component & different parameters
- Concatenate all outputs and feed into the linear layer
- Two sublayers
- Multi-head attention
- 2-layer feed-forward network
- Two tricks
- Residual connection
- Layer normalization
- Changes input to have mean 0 and variance 1
- In each layer, Q, K, V are the same as the previous layer’s output
Decoder Block
- Two changes:
- Masked self-attention: The word can only look at previous words
- Encoder-decoder attention: Queries come from the decoder while keys and values come from the encoder
- Blocks are also repeated 6 times
- Other tricks
- Checkpoint averaging
- ADAM optimizer
- Dropout during training at every layer just before adding residual
- Label smoothing
- Auto-regressive decoding with beam search and length penalties
- Multi-head Demo
Summary of Transformer
- Advantage:
- The Transformer is a powerful model and proven to be effective in many NLP tasks
- The Transformer is suitable for parallelization
- It proves the effectiveness of the attention mechanism
- It also gives insights to recent NLP advancements such as BERT and GPT
- Disadvantage:
- The architecture is hard to optimize and sensitive to model modifications
- $O(n^2)$ per-layer complexity makes it hard to be used on extremely long document (usually set max length to be 512)
PLM
- Language Modeling
- Pre-trained Langue Models (PLMs)
- Fine-tuning Approaches
- GPT and BERT
- PLMs after BERT
- Applications of Masked LM
- Cross-lingual and Cross-modal LM Pre-training
- Frontiers of PLMs
- GPT-3, T5 and MoE
LM
- Language Modeling is the task of predicting the upcoming word
- Language Modeling: the most basic and important NLP task
- Contain a variety of knowledge for language understanding, e.g., linguistic knowledge and factual knowledge
- Only require the plain text without any human annotations
- The language knowledge learned by language models can be transferred to other NLP tasks easily
- There are three representative models for transfer learning of NLP
- Word2vec
- Pre-trained RNN
- GPT&BERT
PLM
- We have mentioned several PLMs in the last section: word2vec, GPT, BERT, …
- PLMs: language models having powerful transferability for other NLP tasks
- word2vec is the first PLM
- Nowadays, the PLMs based on Transformers are very popular (e.g. BERT)
- Two Mainstreams of PLMs
- Feature-based approaches
- The most representative model of feature-based approaches is word2vec
- Use the outputs of PLMs as the inputs of our downstream models
- Fine-tuning approaches
- The most representative model of fine-tuning approaches is BERT.
- The language models will also be the downstream models and their parameters will be updated
- Feature-based approaches
GPT
GPT-1:
Inspired by the success of Transformers in different NLP tasks, GPT is the first work to pre-train a PLM based on Transformer
Transformer + left-to-right LM
Fine-tuned on downstream tasks
GPT-2:
- A huge Transformer LM
- Trained on 40GB of text
- SOTA perplexities on datasets it’s not even trained on
More than LM
- Zero-Shot Learning: Ask LM to generate from a prompt
- Reading Comprehension
- Summarization
- Question Answering
A very powerful generative model
Also achieve very good transfer learning results on downstream tasks
- Outperform ELMo significantly
The key to success
- Big data (Large unsupervised corpus)
- Deep neural model (Transformer)
Bert
Problem: Language models only use left context or right context, but language understanding is bidirectional
Why are LMs unidirectional
- Reason 1: Directionality is needed to generate a wellformed probability distribution
- Reason 2: Words can “see themselves” in a bidirectional encoder
Unidirectional vs.Bidirectional Models
- Unidirectional context: Build representation incrementally
- Bidirectional context: Words can “see themselves
Solution: Mask out k% of the input words, and then predict the masked words. k=15% in BERT
- Too little masking: too expensive to train
- Too much masking: not enough context
Masked LM
- Problem: [Mask] token never seen at fine-tuning
- Solution: 15% of the words to predict
- 80% of the time, replace with [MASK]
- went to the store → went to the [MASK]
- 10% of the time, replace with a random word
- went to the store → went to the running
- 10% of the time, keep the same went to the store → went to the store
Next Sentence Prediction
- To learn relationships between sentences, predict whether Sentence B is the actual sentence that proceeds Sentence A, or just a random sentence
- Input Representation
- Use 30,000 WordPiece vocabulary on input.
- Each token is the sum of three embeddings
- Single sequence is much more efficient.
Effect of Pre-training Task:
- Masked LM (compared to left-to-right LM) is very important on some tasks
- Next Sentence Prediction is important for other tasks
Effect of Model Size
- Big models help a lot
- Going from 110M -> 340M params helps even on datasets with 3,600 labeled examples
Empirical results from BERT are great, but biggest impact on the field is: With pre-training, bigger == better, without clear limits (so far)
Excellent performance for researchers and companies building NLP systems
Summary
Feature-based approaches transfer the contextualized word embeddings for downstream tasks
Fine-tuning approaches transfer the whole model for downstream tasks
Experimental results show that fine-tuning approaches are better than feature-based approaches
Hence, current research mainly focuses on fine-tuning approaches
Is BERT really perfect?
- Any optimized pre-training paradigm?
- The gap between pre-training and fine-tuning
- [MASK] token will not appear in fine-tuning • The efficiency of Masked Language Model
- Only predict 15% words
RoBERTa
- Explore several pre-training approaches for a more robust BERT
- Dynamic Masking
- Model Input Format
- Next Sentence Prediction
- Training with Large Batches
- Text Encoding
- Massive experiments
- Explore several pre-training approaches for a more robust BERT
ELECTRA
- Recall: the efficiency of bi-directional pre-training
- Masked LM: 15% prediction
- Premutation LM: 1/6~1/7 prediction
- Traditional LM: 100% prediction
- Single direction
- Replaced Token Detection
- A new bi-directional pre-training task
- 100% prediction
- Recall: the efficiency of bi-directional pre-training
MLM
- Basic idea: to use bi-direction information to predict the target token
- Beyond token: use multi-modal or multi-lingual information together by masking
- Input the objects from different domains together and predict the target object based on the input objects
Cross-lingual LM Pre-training
- Translation Language Modeling (TLM)
- The TLM objective extends MLM to pairs of parallel sentences (e.g., English-French)
- To predict a masked English word, the model can attend to both the English sentence and its French translation, and is encouraged to align English and French representations.
- The translation language modeling (TLM) objective improves cross-lingual language model pretraining by leveraging parallel data
Cross-Modal LM Pre-training
- Pairs of videos and texts from automatic speech recognition (ASR)
- Generate a sequence of “visual words” by applying hierarchical vector quantization to features derived from the video using a pre-trained model
- Encourages the model to focus on high-level semantics and longer-range temporal dynamics in the video
Summary
- Masked LM inspired a variety of new pre-training tasks
- What’s your idea about transferring Masked LM?
Frontiers
GPT-3
A super large-scale PLM
Excellent few-shot/in-context learning ability
GPT-3: Doesn’t know when to say “I do not know”
T5
Reframe all NLP tasks into a unified text-to-text-format where the input and output are always text strings
Encoder-decoder architecture
Larger Model with MoE
- Enhance encoder-decoder with MoE (Mixture of Experts) for billions of parameters
- Gshard 600B parameters
- Switch Transformer 1,571B parameters
Summary
- The technique of PLMs is very important for NLP (from word2vec to BERT).
- Fine-tuning approaches are widely used after BERT.
- The idea of Masked LM inspired the research on unsupervised learning.
- Consider PLMs first when you plan to construct a new NLP system.
Transformers Tutorial
Introduction
Various pre-trained language models are being proposed
Introduction
- Various pre-trained language models are being proposed
- Is there any package that helps us:
- Reproduce the results easily
- Deploy the models quickly
- Customize your models freely
Hugging Face:
- Transformers is a package:
- Providing thousands of models
- Supporting PyTorch, TensorFlow, JAX
- Hosting pre-trained models for text, audio and vision
- Fairly easy to use. Low barrier to entry for researchers.
- Almost all the researches on pre-trained models are built on Transformers!
- Transformers is a package:
Pipeline
- I want to directly use the off-the-shelf model on down-stream tasks -> Use pipeline!
1 |
|
Pipeline automatically uses a fine-tuned model and perform the downstream task.
Tokenization
- Pre-trained language models have different tokenization
- BPE (Byte-Pair Encoding): GPT, Roberta, …
- WordPiece: BERT, Electra, …
- SentencePiece: ALBERT, T5, …
1 |
|
The tokenizer automatically uses the tokenization strategy of the given model to tokenize your text.
Frequently-used APIs
- Load the pre-trained models in a few lines
1 |
|
- Tokenize the texts
1 |
|
- Run the model
1 |
|
- Save the fine-tuned model in one line
1 |
|
from_pretrained也可以把这个预训练模型提取出来。
- Train the model with Trainer
1 |
|
Demo
- We have provided a demo, which fine-tunes BERT for sentiment analysis task.
- You will be able to use Transformers after going through this demo.
- See https://colab.research.google.com/drive/1tcDiyHIKgEJp4TzGbGp27HYbdFWGolU_?usp=sharing, video is: https://www.bilibili.com/video/BV1UG411p7zv?p=40
L4 Prompt Delta
- Background & Overview
- Prompt -learning
- Template
- Verbalizer
- Learning Strategy
- Applications
- Delta Tuning
- Addition -based Methods
- Specification -based Methods
- Reparameterization -based Methods
- Advanced Topics
- OpenPrompt
- OpenDelta
- Pre-trained Language Models are Infrastructure in NLP. There are Plenty of NLP tasks. How to adapt PLMs to them?
Fine Tuning
Example: BERT
- Token representations for sequence tagging
- [CLS] for text classification
- Feed appropriate representations to output layers
Example: Relation Extraction
- Extract the relation between two marked entities
Example: GPT
- Feed the last hidden state to a linear output layer
Example: T5
- Encoder-decoder with 11 billion parameters
- Cast tasks to seq2seq manner with simple demonstrations
- A decoder is trained to output the desired tokens
不同的下游任务的分类器之类的东西 -> seq2seq的任务 + 一点合适的demonstration, labels
- When it Comes to GPT-3
- Huge model with 175 billion parameters
- No parameters are updated at all
- Descriptions (Prompts) + Few-shot examples to generate tokens
不去微调模型了,第一次提出了prompt的概念。in-context learning,few show/zero shot,通过prompt去让大模型微调 or 学习。
An Irreversible Trend: Model Scaling, Larger PLMs Tend to Lead Better Performance.
- Better natural language understanding capability
- Better quality for natural language generation
- Better capacity to continually learn novel knowledge
An Irreversible Trend: Difficult Tuning. How to Adapt Large-scale PLMs?
- A Predominant Way — Fine-tuning
- Prohibitive Computing: update all the parameters;
- Prohibitive Storage: retaining separate instances for different tasks;
- Poor generalization with supervision is insufficient
- Results in scarce use for large-scale PLMs in research
Advanced Model Adaptation, Effective Model Adaptation.
- Task&Data-wise: Use prompt-learning to enhance the few-shot learning capability by bridging the gap between model tuning and pre-training.
- Optimization-wise: Use delta tuning to stimulate models with billions of parameters with optimization of a small portion of parameters.
Prompt-learning
Fine-turing
- Use PLMs as base encoders
- Add additional neural layers for specific tasks
- Tune all the parameters
- There is a GAP between pre-training and fine-tuning
Prompt-learning
- Use PLMs as base encoders
- Add additional context (template) with a [MASK] posistion
- Project labels to label words (verbalizer)
- Bridge the GAP between pre-training and fine-tuning
Fill the gap between Fine-turing and Prompt-learning, use prompt fill the GAP.
- Sentiment Classification
- Prompting with a Template
- Input: x = “I love this movie”
- Template: [x] Overall, it was a [z] movie
- Prompting: x’ = “I love this movie. Overall it was a [z] movie.”
- Predict an answer
- Predicting: x’ = “I love this movie.Overall it
was a fantastic movie.”
- Predicting: x’ = “I love this movie.Overall it
- Map the answer to a class label with a Verbalizer
- Mapping:fantastic =Positive
- Prompting with a Template
- Prompt-learning: Considerations
- Pre-trained Model
- Auto-regressive(GPT-1,GPT-2,GPT-3;OPT…)
- Masked Language Modeling(BERT,RoBERTa,DeBERTa)
- Encoder-Decoder (T5,BART)
- Template
- Manually Design
- Auto Generation
- Textual or Continuous…
- Verbalizer
- Manually Design
- Expanding by external knowledge…
- Pre-trained Model
PTM Selection
- Auto-regressive (GPT-1, GPT-2, GPT-3; OPT…) -> Encoder
- Suitable for super-large pre-trained models
- Autoregressive Prompt
擅长生成
- Masked Language Modeling (BERT, RoBERTa, DeBERTa) -> Decoder
- Suitable for natural language understanding(NLU)
- Cloze-style Prompt
擅长NLU
- Encoder-Decoder (T5, BART) -> Encoder + Decoder
- Bidirectional attention for encoder
- Autoregressive for decoder
通用,两种都可以。
Template
Template Construction
- Manually Design based on the characteristics of the task
- Auto Generation with search or optimization
- Textual or Continuous
- Structured, incorporating with rules
Template: Extract World Knowledge
- Copy the entity in the Template
- Predict fine-grained entity types
- Extract world knowledge
Template: Incorporating Rules and Logic
- Prompt-learning with logic-enhanced templates
Structured Template
- Key-value Pairs for all the prompts
- Organize different tasks to a structured format
Ensembling Templates
- Use multiple different prompts for an input instance
- Alleviate the cost of prompt engineering
- Stabilize performance on tasks
Methods
- Uniform Averaging
- Weighted Averaging
Template: Automatic Search
- Gradient-based search of prompts based on existing words
- Use a encoder-decoder model to generate prompts
本质是:Prompt -> Tokens,那很多人类不理解的,复杂的Prompt,可能对人没有含义,但是可能比人定义的Prompt更能work.
也许可以训练一个模型,去训练,更好的Prompt,template比人更好?又或者生成的人能理解,并且效果好!
Optimization of Continuous Prompts
- Generative models for NLU by optimizing continuous prompts
- P-tuning v1: prompts to the input layer (with Reparameterization)
- P-tuning v2: prompts to every layer (like prefix-tuning)
Performance of Prompt-learning
- Exdraordinary few-shot learning performance
- Huge impact from the templates
Verbalizer
Verbalizer
Mapping: Answer -> Unfixed Labels
Tokens: One or more tokens in the pre-trained language model vocabulary
Chunks: Chunks of words made up of more than one tokens
Sentence: Sentences in arbitrary length
Construction
- Hand-crafted
- Auto-generation
Verbalizer Construction
- Manually design with human prior knowledge
- Start with an initial label word, paraphrase & expand
- Start with an initial label word, use external knowledge & expand
- Decompose the label with multiple tokens
- Virtual token and optimize the label embedding
Knowledgeable Prompting
- Label -> Words
- Use External Knowledge to expand the label words
Virtual Tokens as Label Words
- Project the hidden states of [MASK] tokens to the embedding space and learn prototypes
- The learned prototypes constitute the verbalizer and map the PLM outputs to corresponding labels.
Learning Strategy
- The Evolvement
- Traditional: Learning from scratch;
- After BERT: Pre-training-then-fine-tuning;
- T5: Pre-training-then-fine-tuning with text-to-text format;
- GPT: Pre-training, then use prompt & in-context for zero- and few- shot;
- Prompt-learning Introduces New Learning Strategies
- Pre-training, prompting, optimizing all the parameters (middle-size models, few-shot setting)
- Pre-training, adding soft prompts, freezing the model and optimizing the prompt embeddings (delta tuning perspective)
- Pre-training with prompted data, zero-shot inference (Instruction tuning& T0)
- Prompt-Tuning
- Injecting soft prompts (embeddings) to the input layer
- Extraordinary power of scale
- Comparable results to fine-tuning conditioned on 11B PLM
- Essentially a parameter efficient (delta tuning) method
Delta-tuning: 不符合fine-tuning本身的intuition, 小参数驱动大模型。
Prompting Prompt Tuning
- Injecting Prompts to Pre-training.
- Full data:fine-tuning and prompt-tuning are comparable.
- Few data:only tuning prompts have poor performance.
- The vanilla prompt tuning cannot generalize effectively in low-data situation.
- Injecting soft prompts to pre-training improve the generalization of prompt tuning
Fine-tuning with Prompted Data
- Multi-task Pre-training with Hand-crafted Prompts
- Finetuning a 130B PLM with prompts on 60 tasks
- Substantially improve the zero-shot capability
- Multi-task Pre-training with Hand-crafted Prompts
- Use manually written prompts to train encoder-decoder model
- Multi-task Pre-training with Hand-crafted Prompts
- Use manually written prompts to train encoder-decoder model
- Zero-shot generalization on unseen tasks
- Multi-task Pre-training with Hand-crafted Prompts
Applications
Biomedical Prompt-learning: Prompt-learning can support Clinical Decision
- Big models in general domain (like GPT-3) can’t perform well on specific domain like biomedical
- Prompt-learning shows significantly effectiveness
Cross-Modality Prompt-learning: Cross-Modal Prompt-learning
- Create colorful frames in images
- Add color-wise textual prompts to input data
Summary: Prompt-learning
- A comprehensive framework that considers PLMs, downstream tasks, and human prior knowledge
- The design of Template & Verbalizer is crucial
- Prompt-learning has promising performance in low-data regime, and high variance with the select of templates
- Prompt-learning has broad applications
Delta-Tuning
How to Adapt Large-scale PLMs?
- An Efficient Way — Delta Tuning
- Only updating a small amount of parameters of PLMs
- Keeping the parameters of the PLM fixed
How to Adapt Large-scale PLMs?
- An Efficient Way — Delta Tuning
- Only updating a small amount of parameters of PLMs
- Keeping the parameters of the PLM fixed
Why Parameter Efficient Work?
- In the Past Era
- Parameter efficient learning can’t be realized in the past
- Because all the parameters are randomly initialized
- With Pre-training
- Pre-training can learn Universal Knowledge
- Adaptation of downstream
- Imposing universal knowledge to specific tasks
- In the Past Era
Delta Tuning: Parameter Efficient Model Tuning
- Addition-based methods introduce extra trainable neural modules or parameters that do not exist in the original model;
- Specification-based methods specify certain parameters in the original model or process become trainable, while others frozen;
- Reparameterization-based methods reparameterize existing parameters to a parameter-efficient form by transformation.
Addition-based
- Adapter
- Adapter-Tuning
- Injecting small neural modules (adapters) into Transformer Layer
- Only fine-tuning adapters and keeping other parameters frozen
- Adapters are down-projection and up-projection
- Tunable parameters: 0.5%~8% of the whole model
- Move the Adapter Out of the Backbone
- Bridge a ladder outside the backbone model
- Save computation of backpropagation
- Save memory by shrinking the hidden size
- Adapter-Tuning
- Prefix-Tuning
- Inject prefixes (soft prompts) to each layer of the Transformer
- Only optimizing the prefixes of the model
- Prompt-Tuning
- Injecting soft prompts (embeddings) only to the input layer
- Extraordinary power of scale
- Comparable results to fine-tuning conditioned on 11B PLM
Specification-based
- BitFit
- A simple strategy: only updating the bias terms
- Comparable performance of full fine tuning
Reparameterization-based
Intrinsic Prompt Tuning
假设:优化过程本质上可以在一个低纬的空间中完成。
The Model tuning is mapped into a low-dimensional subspace
89% of the full-parameter fine-tuning performance could be achieved in as low tasks as 5-dimensional a subspace in 120 NLP tasks
Manipulate NLP in Low-dimension Space
- 本质是一个“低秩”的,做矩阵分解,例如1000 x 1000分解为1000 x 2和2 x 1000。
- LoRA: Low-Rank Adaptation
- Freeze the model weights
- Injects trainable rank-decomposition matrices to each Transformer layer
- LoRA tunes 4.7 million paramters of the 175 billion parameters of the GPT-3 model
Connections
The Reparameterization-based Methods Are Connected
- Based on similar hypothesis
- The optimization process could be transformed to a parameter efficient version
A Unified View
- Adapter, Prefix Tuning and LoRA could be connected
- Function form
- Insertion form
- Modified Representation
- Composition Function
- Adapter, Prefix Tuning, and LoRA could be connected in form
- New variants could be derived under this framework
大一统 -> 推导更多更常见的方法
- Deep Analysis of Delta Tuning
- Theoretical Analysis
- From optimization
- Low-dimensional representation in solution space
- Low dimensional representation in functional space
- From optimal control
- Seek the optimal controller
- A Rigorous Comparison of Performance
- Experiments on 100+ NLP tasks
- There is no way to gain an absolute advantage for delta tuning, fine-tuning is still the best model tuning method;
- Power of Scale: The power of scale is observed in all the methods, even random tuning
- A Rigorous Comparison of Performance
- Combination of different delta tuning methods
- Implies the existence of Optimal Structure which is not defined manually
- Automatically search the structure
- 1/10000 parameters could work
- Transferability
- Delta Tuning shows non-trivial task-level transferability
- Implies the possibility to construct a sharing platform
- Efficient Tuning with low GPU RAM
- Tune T5-large on 11G single GPU (Nvidia 1080Ti, 2080, etc.)
- Tune T5-3b on 24G single GPU (Nvidia 3090 and V100)
- Tune T5-11b on 40G single GPU (Nvidia A100, with BMTrain)
- Summary
- Delta tuning could effectively work on super-large models -> Optimizing only a small portion of parameters could stimulate big models.
- The structure may become less important as the model scaling
- What’s NeXT?
- Theoretical Analysis
Futhur Reading
- Paper List
- PromptPapers: https://github.com/thunlp/PromptPapers
- DeltaPapers: https://github.com/thunlp/DeltaPapers
- Programming Toolkit
- OpemPrompt: https://github.com/thunlp/OpenPrompt
- OpenDelta: https://github.com/thunlp/OpenDelta
OpenPrompt
Plz see the Video first
API design
- Modularity
- Flexibility
- Uniformity
How to use OpenPrompt 78 https://github.com/thunlp/OpenPrompt
- Step 1: Define a task
- Think about what’s your data looks like and what do you want from the data!
- Step 2: Obtain a PLM
- Choose a PLM to support your task;
- Different models have different attributes;
- Essentially obtain a modeling strategy with pre-trained tasks;
- Support , more coming…
- Step 3: Define a Template: A Template is a modifier of the original input text, which is also one of the most important modules in prompt-learning.
- Step 4: Define a Verbalizer (optional): A Verbalizer projects the original labels to a set of label words.
- Step 5: Define a PromptModel
- A PromptModel is responsible for training and inference
- It defines the (complex) interactions of mentioned modules
- Step 6: Train and Inference
- Train and evaluate the PromptModel in PyTorch fashion
- Step 1: Define a task
Mixed Template
- Basic hard and soft template
- Incorporation of meta information
- Soft template initialized with textual tokens
- Post-processing
- Fast token duplication
Generation Verbalizer
- Label words defined as part of input -> Similar fashion with mixed template
- Especially powerful in transforming ALL NLP tasks to generation tasks
Newly Designed Template Language - Mixed Template -> Write Template in a flexible way
Implement All Kinds of Prompt-Learning Pipelines
- Modify separate modules and create new methods
- Apply existing methods to other scenarios
1.7k stars for our Github repository
Along with 2.0k stars for referenced paper list
OpenDelta
Plz see the Video first
OpenDelta: Toolkit for Delta Tuning
- Clean: No need to edit the backbone PTM’s codes.
- Simple: Migrating from full-model tuning to delta-tuning needs as little as 3 lines of code.
- Sustainable: Evolution in external libraries doesn’t require update.
- Extendable: Various PTMs can share the same delta-tuning codes.
- Flexible: Able to apply delta-tuning to (almost) any position.
Apply OpenDelta to Various Models
- Supported models
Adapter Hub
- Need to modify the backbone code.
- Need reimplementation for EVERY PTM.
- Codes frozen at transformers version 4.12
- Need constant update to suit Huggingface’s update (to suit new feature)
- Can only apply Adapter under existing mode (e.g. not supporting adding adapters to a fraction of layers or other places in the model)
How do we achieve it?
- Key based addressing: Find the module according to the module/parameter key.
- Three modification operations can cover most delta tuning:
- Replace, Insert after, Insert before.
- The modified model will have the same doc & I/O & address & Signature etc. to the original model.
- Create pseudo data to automatically determine the parameter size of delta models.
How do we achieve it?
- Alternating the flow of tensor.
- Use a wrapper function to wrap the original forward function to let the tensor pass the delta models as well.
More than aggregating delta models …
- Visualize the parameters’ location in the PTM.
- Insert delta modules in arbitrary layers.
- Delta center to save fine-tuned delta models
AutoDelta Feature
- Automatically load and define delta moduels from configuration
- Automatically load and define delta moduels from pre-trained
Multitask Serving
Collaboration
- Collaboration of OpenDelta & OpenPrompt
- OpenDelta is a toolkit for Delta Tuning
- Collaborated with OpenDelta, there is a loop to efficiently stimulate LMs
Demos also exist on Github
L5 BMSystem
BMTrain
- CPU vs. GPU
- CPU: small number of large cores.
- GPU: large number of small cores.
- GPU Memory component
- Parameter
- Gradient
- Intermediate
- The input of each Linear Module needs to be saved for backward
- Each with Shape [Batch, SeqLen, Dim]
- Optimizer:
- Commonly used Adam Optimizer needs to store extra states.
- The number of states is greater than 2 times the number of parameters.
Data Parallel
There is a parameter server.
Forward:
- The parameter is replicated on each device.
- Each replica handles a portion of the input.
Backward
- Gradients from each replica are averaged.
- Averaged gradients are used to update the parameter server.
Collective Communication.
- Broadcast: Send data from one GPU to other GPUs
- Reduce: Reduce (Sum/Average) data of all GPUs, send to one GPU.
- All Reduce: Reduce (Sum/Average) data of all GPUs, send to all GPUs.
- Reduce Scatter: Reduce (Sum/Average) data of all GPUs, send portions to all GPUs.
- All Gather: Gather data of all GPUs, send all GPUs.
Methods;
- Data Parallel
- Model Parallel
- ZeRO
- Pipeline Parallel
Data Parallel
- There is a parameter server.
- Forward:
- The parameter is replicated on each device.
- Each replica handles a portion of the input.
- Backward:
- Gradients from each replica are averaged.
- Averaged gradients are used to update the parameter server.
Distributed Data Parallel
- There is no parameter server.
- Forward:
- Each replica handles a portion of the input.
- Backward:
- Gradients from each replica are averaged using All Reduce.
- Each replica owns optimizer and update parameters itself.
- Since gradients are shared, parameters are synced.
The input of each Linear Module needs to be saved for backward. Each with Shape:
- Without Data Parallel [Batch, Len, Dim]
- With Data Parallel -> [Batch/n, Len, Dim]
Batch/n >= 1
Model Parallel
- Partition the matrix parameter into sub-matrices.
- Sub-matrices are separated into different GPUs.
- Each GPU handle the sample input.
Intermediates are not partitioned.
ZeRO
Zero Redundancy Optimizer
ZeRO-Stage 1:
- Each replica handles a portion of the input.
- Forward
- Backward
- Average all gradients using Reduce Scatter
- Each replica owns part of optimizer & update part of params
- Updated parameter are synced using All Gather
ZeRO-Stage 2:
- Each replica handles a portion of the input.
- Forward.
- Backward (Average gradients using Reduce Scatter).- Each replica owns part of optimizer & update part of params.
- Updated parameter are synced using All Gather.
ZeRO-Stage 3
- Each replica handles a portion of the input.
- Forward (Share parameters using All Gather).
- Backward (Average gradients using Reduce Scatter).- Each replica owns part of optimizer & update part of params.
Pipeline Parallel
- Transformer are partitioned layer by layer.
- Different layers are put on different GPUs.
- Forward : Layer i -> Layer i+1
- Backward: Layer i -> Layer i-1
Techniques
- Mixed precision
- Offloading
- Overlapping
- Checkpointing
Mixed Precision
FP32: 1.18e-38~3.40e38 with 6–9 significant decimal digits precision. FP16: 6.10e−5 ~65504 with 4 significant decimal digits precision. -
Advantages:
- Math operations run much faster.
- Math operations run even more faster with Tensor Core support.
- Data transfer operations require less memory bandwidth.
- Smaller range but not overflow.
Disadvantages:
- Weight update ≈ gradient x lr Smaller range, especially underflow
Keep a master FP32 parameters in optimizer.
训练的时候多存一个FP32,并进行训练累积。可以累积到一定的数量之后,再作用于FP16。也可以在后续推理的时候,就用FP16推理,这样速度快一些昂!!!
Offloading
- Bind each GPU with multiple CPUs.
- Offload the partitioned optimizer states to CPU.
- Send Gradients from GPU to CPU.
- Update optimizer states on CPU ( using OpenMP + SIMD ).
- Send back updated parameters from CPU to GPU.
Overlapping
- Memory operations are asynchronous.
- Thus, we can overlap Memory operations with Calculations.
Checkpointing
- Forward:
- Some hidden states (checkpoint) are reserved.
- All other intermediate results are immediately freed.
- Backward:
- Freed intermediates are recomputed.
- And released again after obtaining gradient states.
Performance
Speedup, Simple replacement,
BMCook
The model size of PLMs has been growing at a rate of about 10x per year
Huge Computational Cost: The growing size comes with huge computational overhead
- Limits the application of large PLMs in real-world scenarios
- Leads to large carbon emissions
Towards Efficient PLMs
- Model Compression: Compress big models to small ones to meet the demand of real-world scenarios
- Existing Methods
- Knowledge Distillation
- Model Quantization
- Model Pruning
Knowledge Distillation
- Proposed by Hinton on NIPS 2014 Deep Learning Workshop
- Problem of Ensemble Model
- Cumbersome and may be too computationally expensive
- Similar to current PLMs
- Solution
- The knowledge acquired by a large ensemble of models can be transferred to a single small model
- We call “distillation” to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment.
- What is knowledge: In a more abstract view, knowledge is a learned mapping from input vectors to output vectors.
Knowledge Distillation
- Proposed by Hinton on NIPS 2014 Deep Learning Workshop
- Problem of Ensemble Model
- Cumbersome and may be too computationally expensive
- Similar to current PLMs
- Solution
- The knowledge acquired by a large ensemble of models can be transferred to a single small model
- We call “distillation” to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment.
- What is knowledge -> In a more abstract view, knowledge is a learned mapping from input vectors to output vectors.
- Soft targets provide more information than gold labels.
- Key research question: how to build more soft targets -> Previous methods only use the output from the last layer
- Learn from multiple intermediate layers of the teacher model
- Mean-square loss between the normalized hidden states
- Learn from multiple intermediate layers
- Learn from the embedding layer and output layer
- Learn from attention matrices
Model Pruning
模型剪枝
- Remove the redundant parts of the parameter matrix according to their important scores
- Unstructured pruning and structured pruning
- Weight pruning (unstructured)
- 30-40% of the weights can be discarded without affecting BERT’s universality (prune pre-train)
- Fine-tuning on downstream tasks does not change the nature (prune downstream)
- Attention head pruning (structured)
- Ablating one head
- Define the importance scores of attention heads
- Iteratively prune heads on different models(blue line)
- Layer pruning (structured)
- Extend dropout from weights to layers
- Training: randomly drop layers
- Test: Select sub-networks with any desired depth
Model Quantization
- Reduce the number of bits used to represent a value -> Floating point representation -> Fixed point representation
- Three steps: 1. Linear scaling 2. Quantize 3. Scaling back
- Models with different precisions -> Extreme quantization (1 bit) is difficult
- Loss landscapes are sharper
- Train a half-sized ternary model
- Initialize a binary model with the ternary model by weight splitting
- Fine-tune the binary model
Other Methods
Weight Sharing
- ALBERT: Two parameter reduction techniques
- Decompose the large vocabulary embedding matrix into two small matrices
- Cross-layer parameter sharing
Low-rank Approximation
- Low-rank Approximation
- Difficult to directly conduct low-rank approximation
- View more at: Here
Architecture Search
- Is the architecture of Transformer perfect?
- Neural architecture search based on Transformer
- Pre-define several simple modules
- Training several hours with each architecture
- Two effective modifications
- Multi-DConv-Head Attention(MDHA)
- Squared ReLU in Feed Forward Block
- Primer learns faster and better
Summary
- Large-scale PLMs are extremely over-parameterized
- Several methods to improve model efficiency
- Knowledge Distillation
- Model Pruning
- Model Quantization
- …
- Our model compression toolkit: BMCook -> Includes these methods for extreme acceleration of big models
Usage Intro
Github link is Here
Compared to existing compression toolkits, BMCook supports all mainstream acceleration methods for PLMs
Implement different compression methods with just a few lines of codes
Compression methods can be combined in any way towards extreme acceleration
Core of BMCook: Compression Configuration File
Implement various methods with few lines. The GitHub demos have multiple demos about how to use BMCook to supports all mainstream acceleration methods for PLMs.
BMInf
BMInf is the first toolkit released by OpenBMB.
Github repo: https://github.com/OpenBMB/BMInf
BMInf has received 270 stars (hope more after this course XD).
In June 2021, we released CPM-2 with 10 billion parameters.
It is powerful in many downstream tasks.
Background
- high hardware requirements
- For each demo we used 4xA100s for inference.
- inefficient
- Each request takes about 10 seconds to handle.
- costly
- The cost of 4xA100s is ¥1200 per day.
- Another thought: serve demo on our server -> make it possible for everyone to run big models on their own computers.
Difficulties
- How difficult is it?
- High Memory Footprint
- The checkpoint size of CPM-2 model is 22GB.
- It takes about 2 minutes to load the model from disk.
- High Computing Power
- Generating 1 token with A100 takes 0.5 seconds.
- High Memory Footprint
Linear Layer
- The linear layer is actually matrix multiplication.
- using lower precision for speedup. -> FP64 -> FP32 -> FP16 -> FP8? INT8
- INT 8
- samller range
- precise value
Quantization
Using integers to simulate floating-point matrix multiplication
find the largest value in the matrix
scale to 127 for quantification
multiply scaling factor for dequantization
Matrix multiplication after quantization
Row-wise matrix quantization:
- calculate the scaling factor for each row/column
- scale each row/column to -127~127
We quantized the linear layer parameters of CPM-2
- model size is reduced by half
- 22GB -> 11 GB
- still too large for GTX 1060 ( 6GB memory )
Memory Scheduling
虚拟内存的想法,只将当前用到的参数,加在到CPU/GPU上。
Not all parameters need to be placed on GPU.
- Move parameters that won’t be used in a short time to CPU.
- Load parameters from CPU before use.
- Calculation and loading are performed in parallel.
Implemented in CUDA 6: Unified Memory
We only need to store two layers of parameters in the GPU.
- one for calculating
- the other for loading
It’s about 500MB for CPM-2 .
- In fact, it is much slower to load than to calculate.
- It takes a long time if we only place two layers on GPU.
- Put as many layers as possible on the GPU.
- Assuming that up to n layers can be placed on the GPU.
- n - 2 layers are fixed on GPU that will not be moved to the CPU.
- 2 layers are used for scheduling.
Which layers are fixed on GPU?
- Consider two layers need to be placed on the CPU.
- A larger interval is always better than smaller one.
- Maximize the interval between two layers.
Usage
BMInf runs up CPM-2 on GTX 1060. It also achieves good performance on better GPUs.
Installation: pip install bminf
Hardware Requirements: GTX 1060 or later
OS: both Windows and Linux
L6 BM Application in NLP
Big-model-based Text Understanding and Generation
Introduction
- Typical NLP applications: understanding and generation
- Big models bring revolutions
- NLP Key applications:
- NLU(Natural Language Understanding): Information Retrieval
- NLG(Natural Language Generation): Text Generation
- NLU + NLG: Question Answering
Information retrieval
- Find relevant documents given queries.
- Big models can provide more intelligent and accurate search results.
- PLM-based methods ranked high
Question answering
- Big models can answer more complex questions
Text generation
- Machine translation; poetry generation; dialogue systems…
- Big models can generate more fluent and natural texts
Information Retrieval(IR)
Background
- Information explosion:
- Amount: 40ZB, 50% annual growth rate
- Variety: Update period in minutes
- Rising demand for automatic information retrieval
- 4.39 billion information users
- Annual growth rate of 6~21%
- Requirement: Query -> A sea of information -> A few relevant information
- Application
- Typical application: Search Engine. Public opinion analysis / Fact verification, QA system, Retrieval-Augment Text Generation
- Examples
- Document Ranking Query
- Question Answering
- Information explosion:
Formulation
How to formulate?
- Given a query
- Given a document collection
- IR system computes the relevance score and ranks all documents based on the scores
Retrieval -> Re-Ranking
Evaluation Metrics
- MRR@k
- MAP@k
- NDCG@k
Only care the k the system retrieves.
MRR (Mean Reciprocal Rank): MRR is the average of the reciprocal ranks of the first relevant results for a query set.
MAP (Mean Average Precision): MAP is the mean of the average precision score for a set of queries.
NDCG (Normalized Discounted Cumulative Gain): divides docs into different levels according to the relevance with the query.
Discounted Cumulative Gain (DCG): You get five results for a query search and classify them into three grades: Good (3), Fair (2) and Bad (1)
Traditional IR
- BM25 (Best Matching 25)
- Lexical exact-match model
- Given a query and a document collection
- BM25 computes the relevance score
- TF (Term Frequency): The weight of a term that occurs in a document is simply proportional to the term frequency.
- IDF (Inverse Document Frequency): The specificity of a term can be quantified as an inverse function of the number of documents in which term t appears.
- Problems:
- Vocabulary mismatch: Different vocabulary, same semantics
- Semantic mismatch: Same vocabulary, different semantics
- BM25 (Best Matching 25)
Neural IR
- Neural IR can mitigate traditional IR problems
- Query + Document -> Neural Network -> Vector Space -> Relevance Score
- Neural IR outperform traditional IR significantly
- Being neural has become a tendency for IR
- Architecture
- Re-ranking, Cross-encoder, Model finer semantics of qry and doc; Superior performance; Higher computational cost
- Retrieval, Dual-encoder, Independent representations for qry/doc; Reduce computation cost
- Cross-Encoder
- Given a query q and a document d
- They are encoded to the token-level representations H
- Get the ranking score
- Training: Training data + Training loss
- Dual-Encoder
- DPR: embed query and documents using dual encoders
- Negative log likelihood (NLL) training loss
- Offline computation of doc representations
- Nearest neighbor search supported by FAISS: Batching & GPU can greatly improve retrieval speed (~1ms per q for 10M documents, KNN)
- Retrieval Performance
- More training examples (from 1k to 59k) further improves the retrieval accuracy consistently
- Bigger model size, better retrieval performance
Advanced Topics
- How to mine negative?
- In-batch negative
- Random negative
- BM25 negative
- Self-retrieved hard negative (ICLR 2021)
- Negative-enhanced Fine-tuning
- ANCE (Approximate nearest neighbor Negative Contrastive Learning) -> Asynchronous Index Refresh: document index goes stale after every gradient update → Refresh the index every k steps
- ANCE (Approximate nearest neighbor Negative Contrastive Learning) -> Performance Beat other dense retrieval
- RocketQA (NAACL 2021) -> Uses cross-encoder to filter hard negatives. Performance beats ANCE.
- IR-oriented Pretraining
- SEED-Encoder (EMNLP 2021)
- pre-trains the autoencoder using a weak decoder to push the encoder to provide better text representations.
- The encoder and decoder are connected only via [CLS]. The decoder is restricted in both param size and attention span.
- beats standard pretrained models.
- ICT (Inverse Cloze Task)
- Given a passage consisting of n sentences
- The query is a sentence randomly drawn from the passage, and the document is the rest of sentences
- ICT pre-training improves retrieval performance
- SEED-Encoder (EMNLP 2021)
- Few-Shot IR
- Many real-world scenarios are “few-shot” where large supervision is hard to obtain
- Weak supervision generation
- Weak supervision selection
- Reinforcement data selection (ReinfoSelect) -> Learn to select training pairs that best weakly supervise the neural ranker
- Meta-learning data selection (MetaAdaptRank) -> Learn to reweight training pairs that best weakly supervise the neural ranker
- MetaAdaptRank beats ReinfoSelect
- Generalizable T5-based dense Retrievers (GTR)
- Conversational IR -> Models multiple rounds of query
- How to use big model to retrieve long documents? -> Long-range dependency
- How to mine negative?
Demo: Vedio, Load Document Representations -> Load Query Representations -> Batch Search -> Visualize retrieved results.
Question Answering(QA)
Background
Why do we need question answering (QA) ?
- When we search for something in Google, it’s usually hard to find answers from the document list
- With QA systems, answers are automatically found from large amount of data
Better search experience
Applications of QA
- IBM Watson: 2011 Winner in Jeopardy
- Defeat two human players (Ken and Brad)
- Intelligent assistants
History
- Template-based QA Expert System
- IR-based QA
- Community QA
- Machine Reading Comprehension KBQA
Types of QA
- Machine Reading Comprehension: Read specific documents and answer questions
- Open-domain QA: Search and read relevant documents to answer questions
- Knowledge-based QA: Answer questions based on knowledge graph
- Conversational QA and dialog: Answer questions according to dialog history
- …
Reading Comprehension(RC)
- Reading Comprehension
- Task Definition and Dataset
- Definition of RC
- Documents, Questions, Candidate answers
- Types of RC
- Cloze test: CNN/Daily Mail (93k CNN articles, 220k Daily Mail articles)
- Cloze test: CBT (Children’s Book Test), Context: 20 continuous sentences, Question: the 21st sentence, with an entity masked, Answer: the masked entity, 10 candidates
- Multiple choice -> RACE: 100k multiple choice questions collected from English exams in China.
- Extractive RC: Predict a span in documents -> SQuAD: 10k human-annotated questions and 536 articles from Wikipedia. Every answer is a span in the article
- Definition of RC
- Datasets
- Task Definition and Dataset
- Traditional Pipeline
- Model Framework -> General framework in RC: embed, encode, interact, and predict
- Bilinear, Pointer Network. Attention: d2q, q2d. LSTM, GRU, Attention. GloVe, ELMo, Char Embedding.
- An Example of RC Model: BiDAF. Four layers
- Prediction Layer
- Attention Based Interaction Layer
- Context-aware Encoding Layer
- Word Embedding Layer
- Big-model-based Methods
- Use PLMs (like BERT) to replace the first three layers -> BERT model has no RNN modules
- Model chang: Pre-trained Representation Model -> Prediction Layer
- Using BERT for RC:
- Feed the concatenation of the question and the context to BERT. Get the question-aware context representation to predict the start/end of answers.
- Excellent performance on SQuAD
- UnifiedQA, Unifying different QA formats
- Four types: extractive, abstractive, multiple-choice, yes/no
- Text-to-text format
- Single QA system is on-par with, and often out-performs dedicated models
- Using prompt, we can do it easily!
Open-domain QA
Task Definition
RC assumes that any question has a short piece of relevant text, which is not always true
- In open-domain QA, the model should be able to find relevant texts from a corpus and read them
- Wikipedia can be viewed as a large-scale corpus for factoid question
- In open-domain QA, the model should be able to find relevant texts from a corpus and read them
Goal: build an end-to-end QA system that can use full Wikipedia to answer any factoid question
Generation-based Methods
- Answer Questions with Big Models:
- GPT-3, T5, etc. can generate answers directly
- Fine-tune T5 on open-domain QA
- Achieve competitive performance
- Bigger models perform better
- “Power of scale”
- Answer Questions with Big Models:
Retrieval-based Methods
Document Retriever + Document Reader
- Document retriever: finding relevant articles from 5 million Wikipedia articles
- Document reader (reading comprehension system): identifying the answer spans from those articles
Document Retriever
- Return 5 Wikipedia articles given any question
- Features:
- TF-IDF bag-of-words vectors
- Efficient bigram hashing (Weinberger et al., 2009)
- Better performance than Wikipedia search: (hit@5)
Document Reader
- Simple reading comprehension model
- Features:
- Word embeddings
- Exact match features: whether the word appears in question
- Token features: POS, NER, term frequency
- Aligned question embedding
- Using Shared-Norm for multiple documents
Distance Supervision: For a given question, automatically associate paragraphs including the answer span to this question.
Results
- Reasonable performance across all four datasets
- Models using DS outperform models trained on SQuAD -> Multi-task: Training on SQuAD + DS data
Retrieval-Augmented Language Model PreTraining, REALM:
- Augment language pre-training with a neural knowledge retriever that retrieves knowledge from a textual knowledge corpus (e.g., Wikipedia)
- Allow the model to attend documents from a large corpus during pre-training, fine-tuning and inference
- Pre-training of REALM: The knowledge retriever and knowledge-augmented encoder are jointly pre-trained on the unsupervised language modeling task
- Fine-tuning of REALM: The pre-trained retriever (θ) and encoder (φ) are fine-tuned on a task of primary interest, in a supervised way
- Excellent performance for open-domain QA
Document Retrieval and Synthesis with GPT3
- WebGPT
- Outsource document retrieval to the Microsoft Bing Web Search API
- Utilize unsupervised pre-training to achieve high-quality document synthesis by fine-tuning GPT-3
- Create a text-based web-browsing environment that both humans and language models can interact with
- Pipeline:
- Fine-tune GPT-3 to imitate human behaviors when using the web-browser
- Write down key references when browsing
- After browsing, generate answers with references
- WebGPT-produced answers are more preferred than human-generated ones
- Better coherence and factual accuracy
- WebGPT
Demo
- QA with T5 using OpenPrompt: Zero-shot inference. vedio is here.
- QA with T5 using OpenPrompt and OpenDelta: Delta tuning.
Text Generation(TG)
TG
Introduction to text generation
- Formal Definition: Produce understandable texts in human languages from some underlying non-linguistic representation of information. [Reiter et al., 1997]
- Text-to-text generation and data-to-text generation are both instances of TG [Reiter et al., 1997]
- Applications under umbrella of text generation
Tasks of text generation: Data-To-Text (image, table, graph), Dialogue, Machine Translation, Poetry Generation, Style Transfer, Storytelling, Summarization
Data-to-Text -> Various of data forms: image; table; graph……
Dialogue -> Generate conversations that meet the purpose in response to specific user input
Machine Translation -> Translate natural language sentences into a target language
Poetry Generation -> Generate texts that meet the rhythmic requirements of the poem, based on keywords, or emotional control, etc
Style Transfer -> Control the style of input text while preserve the the meaning
Storytelling -> Generate a story that meets the attribute requirements based on the given keywords, story line, etc.
Summarization -> Summarize the input text with selected part of input text (extractive) or with generated text (abstractive)
Neural text generation
- Language Modeling
- Predict next word given the words so far
- A system that produces this probability distribution is called a Language Model
- We use language models every day, such as …
- Conditional Language Modeling
- The task of predicting the next word, given the words so far, and also some other input
- x input/source
- y output/target sequence
- Seq2seq(Encoder -> Decoder)
- Seq2seq is an example of conditional language model
- Encoder produces a representation of the source sentence
- Decoder is a language model that generates target sentence conditioned on encoding
- seq2seq can be easily modeled using a single neural network and trained in an end-to-end fashion
- seq2seq training by teacher forcing
- Training: predict next word based on previous ground-truth tokens, instead of predicted tokens
- Testing: predict next word based on previous predicted tokens
- Exposure Bias: The gap between training & testing distribution
- Text-to-Text-Transfer-Transformer (T5):
- A Shared Text-To-Text Framework: reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings
- Training objective -> Colossal Clean Crawled Corpus (C4) dataset, a cleaned version of Common Crawl (deduplication, discarding incomplete sentences, and removing offensive or noisy content), Unlabeled data.
- Autoregressive Generation: Generate future values from past values.
- Generative Pre-Trained Transformer (GPT)
- GPT-1: Improving language understanding by generative pretraining
- GPT-2: Language models are unsupervised multitask learners
- GPT-3: Language models are few shot learners
- GPT-2
- GPT-2: Language models are unsupervised multitask learners
- Train the language model with unlabeled data, then fine-tune the model with labeled data according to corresponding tasks
- Non-Autoregressive Generation: Given a source, Generate in parallel.
- Language Modeling
Decoding
- Greedy decoding
- Beam search
- Sampling methods
- Pure sampling
- Top-n sampling
- Nucleus sampling
- Greedy Decoding: Generate the target sentence by taking argmax on each step of the decoder.
- Beam Search Decoding:
- Find a high-probability sequence
- Beam search
- On each step of decoder, keep track of the k most probable partial sequences
- After you reach some stopping criterion, choose the sequence with the highest probability
- Not necessarily the optimal sequence
- What’s the effect of changing beam size k
- Small k has similar problems to greedy decoding
- Ungrammatical, unnatural, nonsensical, incorrect
- Larger k means you consider more hypotheses
- Reduces some of the problems above
- More computationally expensive
- But increasing k can introduce other problems
- For neural machine translation (NMT): Increasing k too much decreases BLEU score (Tu et al., Koehn et al.)
- chit-chat dialogue: Large k can make output more generic
- Small k has similar problems to greedy decoding
- Sampling-based Decoding:
- Pure sampling: On each step t, randomly sample from the probability distribution $P_{t}$ to obtain your next word
- Top-n sampling:
- On each step t, randomly sample from $P_{t}$, restricted to just the top-n most probable words
- $n = 1$ is greedy search, $n = V$ is pure sampling
- Nucleus sampling (Top-p sampling)
- On each step t, randomly sample from $P_{t}$, restricted to the top words that cover probability ≥ $p$
- $p = 1$ is pure sampling
- Sample with temperature: Before applying the final softmax, its inputs are divided by the temperature τ
- Increase n/p/temperature to get more diverse/risky output
- Decrease n/p/temperature to get more generic/safe output
- Both of these are more efficient than Beam search
In summary
- Greedy decoding
- A simple method
- Gives low quality output
- Beam search
- Delivers better quality than greedy
- If beam size is too high, it will return unsuitable output (e.g. Generic, short)
- Sampling methods
- Get more diversity and randomness
- Good for open-ended/creative generation (poetry, stories)
- Top-n/p/temperature sampling allows you to control diversity
- Greedy decoding
Controllable Text Generation
- Control text generation: avoid repeating, more diverse, …
- Prompt methods
- Horror xxx
- Reviews xxx
- 加个prefix,去训练这个prefix。P-tuning, Prefix + LM,训练prefix。
- Modifying probability distribution:贴近天使模型,远离魔鬼模型,进行概率控制。
- Reconstructing model architecture:
- 修改模型结构,添加部分transformer结构,专门用于编码控制信号/关系。在对source的文本进行cross-attention之前,首先和guidance signal进行cross-attention,线对于控制信号进行感知。
- Specialized encoder for guidance signal
- Decoder: self-attention -> (+guidance signal)cross-attention -> (+source document)cross-attention -> FFN
Text generation evaluation
- Common metrics
- BLEU (Bilingual evaluation understudy)
- easy to compute
- doesn’t consider semantics & sentence structure
- PPL (perplexity)
- Evaluate how well a probability model predicts a sample.
- BLEU (Bilingual evaluation understudy)
- Overlap-based Metric
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): solve the problem of missed flipping (low recall rate)
- NIST: consider the amount of n-gram information
- METEOR: based on the harmonic mean of precision and recall
- Distance-based Metrics
- Edit Dist(cosine similarity);SMD(embedding distance);
YISI (weighted similarity)
- Edit Dist(cosine similarity);SMD(embedding distance);
- Diversity Metrics
- Distinct (n-gram diversity);Entropy;KL_divergence
- Task-oriented Metrics
- SPICE(Semantic propositional image caption evaluation)
- Human Evaluation
- Intrinsic (fluency,internal relevance,correctness)
- Extrinsic(performance on the downstream subtasks)
TG Tasks: Challenges
Challenges
Training model strategy
- Always generate repeated words
- Exposure bias
Commonsense
- Lack of logical consistency
Controllability
- Difficult to ensure both language quality and control quality
Evaluation:Reasonable metrics and datasets
Demo: GPT-2
- Task
- The WebNLG challenge consists in mapping data to text
- The training data consists of Data/Text pairs where the data is a set of triples extracted from DBpedia and the text is a verbalization of these triples.
- Example:
- a. (John_E_Blaha birthDate 1942_08_26) (John_E_Blaha birthPlace San_Antonio) (John_E_Blaha occupation Fighter_pilot) b. John E Blaha, born in San Antonio on 1942-08-26, worked as a fighter pilot
- Text generated with untuned GPT-2
- Loss
- Text generated with tuned GPT-2
- Task
L7 BM x Biomedical
Introduction
Outline
- Brief Introduction of Biomedical NLP
- Biomedical Text Mining: Tasks, PLMs, Knowledge, Application
- Diagnosis Assistance: Text Classification, Conversation
- Substance Representation: DNA, Protein, Chemicals
- Project: BioSeq PLMs and Benchmark
- Biomedical NLP: Future Directions
What does biomedical NLP study?
- Search and read long literature in large number? → Obtain ready-made knowledge directly!
- Line up at the door of consulting room? → Ask automatic diagnosis system for efficiency!
- Predict the properties of some organic substance? → Use AI model to get deeper insights into biomedical substances!
What does biomedical NLP study?
- For knowledge and efficiency: biomedical literature, drug instructions, clinical records, experimental operation guide, …
- For practical applications: diagnosis assistance, meta-analysis, exploration for new drugs, pharmacy, …
- For insights into domain-specific data: molecules, proteins, DNA, …
Biomedical NLP can go far beyond the traditional ‘language’.
What characteristics does biomedical NLP have?
- Mass of raw data / Little golden annotated data
- Unsupervised and Weakly supervised / Supervised
- Resources: PubMed, ChemProt
What characteristics does biomedical NLP have?
- High knowledge threshold
- knowledge-enhanced learning
Text Mining: Tasks
Entities, Entities -> BioNER/BioNEN
Traditional: Dictionary-based; Semantic; Statistical. DL-based: End2end. https://www.ncbi.nlm.nih.gov/research/pubtator/
Rule-based; CRF…
Highlighted words are recognized entity mentions.
- Link entities to various KBs.
Literatures, Literatures -> topic recognition/indexing
- Supervised machine learning models;
- Ranking models; Ontology matching models.
- PubMed literature search interface
Relations & Events, Relations & Events -> BioRE/RD, Event Extraction
- Template/rule-based; Statistical
- NLP(parsing)-based; Sequence Labeling
Pathways & Hypothesis, Pathways & Hypothesis -> pathway extraction/literature-based discovery
- Rule-based; ML-based; Hybrid.
- ABC co-occurrence model based
A common pipeline of biomedical text mining
- Named entity recognition (NER) -> Named entity normalization (NEN) -> Relation Extraction (RE)
- Simple but work baselines for NER (include entity typing):CNNs,BiLSTM CRF
- With PLMs as backbone:BERTs CRF, BERTs + Prompt
- Common scenario of NEN:representation “distance”
- Key for NEN: entity disambiguation (context + knowledge in KB)
- SciSpacy: a python package tailored for biomedical semantic analysis, including NER and NEN pipelines
- PubTator: a Web-based system providing automatic NER and NEN annotations (PubMed + PMC)
BERT + BiLSTM + CRF (A common Method for NER)
- BERT + Prompt (Entity Typing)
A common pipeline of biomedical text mining
- Named entity recognition (NER) → Named entity normalization (NEN) → Relation Extraction (RE)
- RE: sentence-level / document-level
- Benchmarks: ChemProt, PPI / BC5CDR, GDA
- Common Methods: BERT-based and graph-based methods
- Relation types: from binary to complex
A simple BERT-based document-level RE model
A GCN-based document-level RE model
Data characteristics of biomedical text mining
- The cost of professional data labeling is extremely high
- Problems concerned with data: small scale and incomplete categories
- ChemProt: chemical – proteins, 1820 / BC5CDR: chemical – diseases, 1500
- Unsupervised: PLMs; Weakly Supervised: distant supervision (denoise)
- An example of labeling PubMed with CTD
- Common labeling strategy: NER + NEN tools + KG; model-based methods
Model-based denoising
Self-Training denoising
Text Mining: PLMs
- PLMs have shown their power in various of tasks (the power of unsupervised learning)
- Domain-specific PLM:
- domain corpus (Sci-BERT, BioBERT, clinical BERT, …)
- special pretraining task (MC-BERT, KeBioLM, …)
Text Mining: Knowledge
- Knowledge Bases (KBs)/Knowledge Graphs (KGs)
- An important application of text mining: unstructured -> structured
- Famous KBs:MeSH,UMLS,NCBI Gene,UniProt,..
- KGs:CTD,DisGeNet,HuRl,..
- Challenges:KBs all have their own limitations and are far from unified;KGs are small in scale and incomplete
- Conversely, KBs/KGs can also help the model to better handle downstream tasks
- Knowledge-Enhanced:
- shallow (entity disambiguation)
- deep (semantic information in intricate KGs)
- Methods to integrate knowledge into PLMs: Adapters, Customized pretraining tasks, Prompt Tuning, Delta Tuning, …
- Enhanced NER for proteins and genes
- SMedBERT: Enhanced PLM
Text Mining: Application
NER and NEN:
- Easy access to knowledge when reading literature
- Bridge the gap between documents and KBs/KGs
- Correspond colloquial expressions (e.g. patient consultation) to standard technical terminology
- triage / QA assistance
Building of KBs/KGs:
- Obtain Knowledge within several clicks
- Is that enough?
- Search for entity “aspirin” in CTD
- Diseases and evidences related to “aspirin”
Relation Extraction:
- Building of knowledge graphs
- Relation-aware literature retrieval
NER + NEN + RE (sometimes Event Extraction, …):
- Clinical analysis: Automatically extract and analyze valid information from clinical records and integrate experimental conclusions
- Lead to new biomedical discovery and hypothesis
30 patients with type 2 diabetes mellitus who showed poor glycemic control with glimepiride (4 mg/d) were randomized to rosiglitazone (4 mg/d) and metformin (500 mg bid) treatment groups. The plasma concentrations of resistin were measured at baseline and at 6 months of treatment for both groups. The resistin levels decreased in rosiglitazone group (2.49 F 1.93 vs 1.95 F 1.59 ng/ml; P b .05) but increased in metformin group (2.61 F 1.69 vs 5.13 F 2.81 ng/ml; Pb.05)…
Diagnosis Assistance
- Biomedical NLP for the crowd
- Scarce medical resources / Flourishing online services
- Reduce the pressure on doctors and improve the work efficiency of hospital systems
Diagnosis Assistance: Text Classification
- Common tasks: automatic triage&medicine prescription
- Datasets: annotated entities prediction
- Backbones: SVM, LSTM; BERT; GPT…
- Common tasks: automatic triage&medicine prescription
- Classify as a matching/retrieval process
we may try to inject more knowledge (e.g. description from KBs)
Diagnosis Assistance: Dialogue
AI systems: replace the doctor’s role to complete more operations including communicating with the patients
Datasets: MedDialog ( large-scale Chinese dataset )
Dialogue as a typical text generation task:
- Different from QA: usually multi-turn; no candidate answer
- Chat-box; task-oriented …… many practical systems
- Dialogue as a typical text generation task:
- Different from QA: usually multi-turn; no candidate answer
- Chat-box; task-oriented …… many practical systems
Retrieval-based Dialogue System: traditional method
Fluent but not always related
Combine with generation-based DS
Knowledge-based Dialogue System: More logical
In the real world: …
Incorporate knowledge
Human thinking process
- Language models capture knowledge and generate language
- Dialogue Generation from KGs with Graph Transformers
Medical Dialogue: Safety( Plenty of knowledge + Interpretability)
A typical application for medical knowledge interactivity:
- Users->Models: extract emperical knowledge
- Models->Users: query existing knowledge
Stylized language: gap between colloquial style of patients and thestandard terms and structured items in KB/KGs
- Entity Linking / Standarlization for diagnosis
- Privacy protection
Summarize the key concepts
Ready for the further KB enhancing
Patient states & Physician policies
KL loss for state distribution
Clear and understandable
- 1st: States training
- 2nd: States+Actions training
Our exploration:
- Multi-task & soft prompt learning during pre-training
- 2-stage framework for the medical dialogue task
Diagnosis Assistance
- Something about Big Models:
- Externally,we integrate KB/KGs during the encoding of medical dialogue text
- Internally, we regard the PLM itself as a KB,hoping to query corresponding information from it
- Prompt/Cloze?CoT?
- How to protect privacy?
Substance Representation
- NLP systems can process natural language text
- What if we want to process biomedical substances?
- NLP systems can process not only natural language text
- To represent biomedical substances as linear text
- Background knowledge review
- Nucleic acid sequence: A, G, C, T (U)
- Amino acid sequence: 20 for human
- Protein: Quaternary structure
Substance Representation: DNA
Major research object: non-coding DNA
Tasks:
- predict gene expression
- predict proximal and core promoter regions
- identify transcription factor binding sites
- figure out important regions, contexts and sequence motifs
- …
Datasets: plenty of open-access resources
- Homo sapiens genome assembly (CRCh38/hg38)
- Cap Analysis Gene Expression (CAGE) Databases
- Descartes: Human Chromatin Accessibility During Development
- ……
Natural language models are good at capturing patterns from mass of sequence data
From simple frameworks (e.g. CNN&LSTM) to Transformer
“Tokens” are fewer than natural language -> less information in word embeddings
- position is important
- k-mer sliding window input
Substance Representation: Protein
- We mainly focus on the amino acid sequences
- Tasks:
- Structure Prediction
- Evolutionary Understanding
- Protein Engineering
- …
- Datasets:
- Uniref: provide clustered sets of sequences from the UniProt Knowledgebase
- GO annotations: capture statements about how a gene functions at the molecular level
- Protein Data Bank……
- Methods: BiLSTM + CRFs, Autoencoder models …
- Big Model:
- Models with larger-scale parameters are better at capturing the features from the biologic sequences.
- Pre-training is proved to be especially helpful!
- Alpha-Fold: One of the most inspiring research results!
- Predict 3D structure with the help of molecular dynamics
- MSA + EvoFormer +End2end training: perfect combination for biomedical knowledge and NLP technique
- A breakthrough for the 3D structure prediction accuracy (comparable to human level)
- Inspired by AlphaFold: MSA Transformer
- Column/Row attention structure
- Mean attention better than individual?
- EvoFormer: unsupervised MSA mask learning for initialization
- Structure: annotated data for the initial network; predict the unannotated data and noisestudent training
- MSA row/column attention; templates
- A representation for each pair of residues
- pairwise repr graph iterative update
- single repr and pair repr; blackhole initialize; Peptide bond angles and distances
- Interaction is here
Substance Expression: Chemicals
- Molecular fingerprints:essential cheminformatics tools for virtual screening and mapping chemical space
- Get fingerprint representation by deep-learning models?
- Molecular graphs -GCNs; SMILES strings -LMs
- Tasks:molecule property classification,chemical reaction classification,…
- Datasets:MoleculeNet,USPTO 1k TPL,..
- Case:KV-PLM
- Bridging chemicals with general text
- Complementary features of heterogeneous data
- Inspired by human observing and learning mapping correlation
- PLM intergrating chemical sturcture & text
- Comprehensively processing both SMILES strings and general text
- Model finishing chemical exam: property prediction
- Conversely, it provides help for drug discovery
Project: BioSeq PLMs and Benchmark
- Background
- NLP technologies are widely introduced to processing biological sequences
- There exist differences between natural language and Bio-Seq.Better PLMs are expected to be proposed.
- Long-term Goals
- Propose a robust and comprehensive benchmark for DNA data process
- Explore better model structure and pre-train method for DNAs
- Projects
- Reproduce and improve DNA pre-trained baseline methods
- Build down-stream DNA tasks from open-source databases
Biomedical NLP: Future Directions
- Knowledgeable big model: models with more expert knowledge achieving better performance
- Al for science: user-friendly assistant tools with lower barriers to entry;unleash human researcher productivity
- Cross-modal processing: bridging vision language information or different forms of data (e.g.graphs)
- Low-resource learning: lack of annotated data
L8 BM x Legal Intelligence
Background
Challenges
- In US, roughly 86% of low-income individuals with civil legal problems report receiving inadequate or no legal help
- In China, roughly 80% of cases have no access to the support of lawyers
Legal Artificial Intelligence (LegalAI)
- AI for Law: Apply the technology of artificial intelligence, especially natural language processing, to benefit tasks in the legal domain
- Law for AI: Use laws to regulate the development, deployment, and use of AI
AI for Law
- Reduce the time consumption of tedious jobs and improve work efficiency for legal professionals
- Provide a reliable reference to those who are unfamiliar with the legal domain
Challenges
- Lack of labeled data -> There are only limited high-quality human-annotated data for legal tasks, and data labeling is costly
- High demand for professional knowledge -> Legal tasks usually involve many legal concepts and knowledge
Legal Intelligence Applications
Legal Judgement Prediction -> Given the fact description, legal judgement prediction aims to predict the judgement results, such as relevant law articles, charges, prison terms
Legal Judgement Prediction
- Multiple subtasks
- Criminal cases: relevant law article prediction, charge prediction, prison term prediction, fine prediction …
- Civil cases: relevant law article prediction, cause of action prediction, ……
- Task formalization
- Inputs: the fact description
- Relevant law article: classification
- Charge/Cause of action: classification
- Prison term/Fine: regression
- Challenges
- Confusing charges
- Interpretability
- Multiple subtasks
Similar Case Retrieval
- Given a query case, similar case retrieval aims to retrieve relevant supporting cases
- Task formalization
- Query case: q
- Candidate cases: C
- Outputs: relevance score for each query-candidate pair $(q, c_i)$
- Challenges
- Long document matching
- Relevance definition
- Diverse user intention
Legal Question Answering
- Legal question answering aims to provide explanations, advice, or answers for legal questions.
- Task formalization
- Inputs: question
- Step 1: retrieve the relevant knowledge (law articles, legal concepts) from the knowledge base
- Step 2: answer the question based on the relevant knowledge
- Challenges
- Concept-Fact Matching
- Multi-hop reasoning
- Numerical reasoning
Court View Generation
- Given the fact description and plaintiff’s claim, court view generation aims to generate the rationales and results of the cases.
- Task formalization
- Inputs: claim and fact description
- Outputs: The decisions (Accept/Reject) and the corresponding reasons
Other applications
- Legal Cases Retrieval
- Legal Information Recommendation
- Risk Warning
- Legal Judgment Prediction
- Legal Documents Translation
- Legal Text Mining
- Legal Documents Generation
- Legal Question-Answering
- Compliance Review
Two Lines of Research
- Data-Driven Methods
- Legal cases
- Trademarks Patents
- Court Trial
- Knowledge-Guided Methods
- Legal Regulations
- Judicial Interpretation
- Legal Literatures
- Data-Driven Methods
Data-Driven Methods
- Utilize deep neural networks to capture semantic representations from large-scale data
- Large-scale open-source legal corpora
- 130 millions legal case documents
- 160 millions patents/trademarks documents
- 19 millions court trial data
- Typical data-driven methods
- Word embeddings
- Pre-trained language models
- Open-domain PLMs is suboptimal for the legal domain
- Differences in narrative habits and writing styles
- Many terminology and concepts in legal documents
- Train PLMs based on large-scale unlabeled legal documents
- Masked Language Model
- PLMs in the legal domain
- Don’t stop pre-training!
- Additional pre-training on target corpora can lead to performance improvement
- Legal-BERT: pretrained on English legal documents
- OpenCLaP: pretrained on Chinese legal documents
- PLMs for long documents in the legal domain
- Legal documents usually involve complex facts and consist of 1260.2 tokens on average
- Most existing PLMs can only handle documents with no more than 512 tokens
- PLMs for legal long documents in the legal domain
- Lawformer utilizes the sparse self-attention mechanism instead of full self-attention mechanism to encode the long documents
- Pre-training Data, Model Parameters, Tasks
- Lawformer can achieve significant performance improvement
- Legal PLMs: Learning Responsible Data Filtering from the Law
- Privacy Filtering
- the law provides a number of useful heuristics that researchers could deploy to sanitize data
- juvenile names, dates of birth, account, and identity number
- Privacy Filtering
Knowledge-Guided Methods
Knowledge-Guided Methods
- Enhance the data-driven neural models with the legal domain knowledge to improve the performance and interpretability on downstream tasks
- Knowledge in open-domain
- Knowledge Graphs
Typical legal knowledge
- Events occurred in the cases
- Decision-making elements
- Legal logic
- Legal regulations
LegalAI Applications
- Legal Judgement Prediction -> Given the fact description, legal judgement prediction aims to predict the judgement results, such as relevant law articles, charges, prison terms
- Legal Event Knowledge
- Key of Legal Case Analysis: Identifying occurred events and causal relations between these events
- Legal events can serve as high-quality case representations
- Existing Legal Event Datasets
- Incomprehensive event schema
- Limited coverage: only contain tens of event types with a narrow scope of charges
- Inappropriately defined: only contain charge-oriented charges and ignore general events
- Limited data annotations
- Only contain thousands of event mentions
- Incomprehensive event schema
Our Goal
- Large-scale: 8,116 legal documents with 118 criminal charges and 150,977 mentions
- High coverage: 108 event types, including 64 chargeoriented events and 44 general events
Legal Events for Downstream Tasks
- Combine the pretrained models with the legal event knowledge
- Add occurred events as additional features to generate the document representation
Legal Events for Judgement Prediction
- Combine the pretrained models with the legal event knowledge
- Utilize occurred events as features to represent legal cases
- low-resource setting
- full-data setting
Legal Events for Similar Case Retrieval
- Combine the pretrained models with the legal event knowledge
- Utilize occurred events as features to represent legal cases
- unsupervised setting
- supervised setting
Legal Element Knowledge
- Legal elements refer to crucial attributes of legal cases, which are summarized by legal experts
- Long-tail distribution -> Top 10 charges cover 78.1% cases
- Confusing charges -> Theft vs. Robbery
Legal Elements for few-shot and confusing charges
- Combine data-driven deep learning methods with legal element knowledge
- Utilize elements as additional supervision signals to improve the performance on low-frequency charges
Legal Elements for interpretable prediction
- Existing methods usually suffer from the lack of interpretability, which may lead to ethical issues
- Following the principle of elemental trial, QAJudge is proposed to visualize the prediction process and give interpretable judgments
- QAJudge can achieve comparable results with SOTA models, and provide explanation for the prediction results
Legal Logic Knowledge
- Topological dependencies between subtasks
- There exists a strict order among the subtasks of legal judgment
- Capture the dependencies with recurrent neural network unit
Legal Regulations
- Legal regulations are one of the most important knowledge bases for legal intelligence systems
- Compared to structured legal knowledge, unstructured legal regulations do not require manual knowledge summarization, so the cost of acquiring such knowledge is much lower
Legal Regulations for Judgement Prediction
- The judgement results are predicted based on both the fact descriptions and relevant law articles
- The aggregation is performed via the attention mechanism
Legal Regulations for Question Answering
- Textual legal regulations and cognitive reasoning are required for legal QA
- Cognitive reasoning are required for legal QA
- Semantic retrieval and cognitive reasoning are required for legal QA
Legal Knowledge-Guided Methods
- Legal Event Knowledge
- Legal Element Knowledge
- Legal Logic Knowledge
- Legal Regulation Knowledge
- ……
Advantages
- Learn from limited labelled data
- Improve the reasoning ability
Demo: https://law.thunlp.org/
Quantitative Analysis for Legal Theory
Mining patterns from a large number of case documents to improve or supplement legal theory
Common Law System
- The outcome of a new case is determined mostly by precedent cases, rather than by existing statutes
- Halsbury believes that the arguments of the precedent are the main determinant of the outcome.
- Goodhart believes that what matters most is the precedent’s facts.
Mutual information test
Legal Fairness Analysis
- Motivation: Fairness is one of the most important principles of justice. The ability to quantitatively analyze the fairness of cases can help to implement judicial supervision and promote fairness and justice.
- Goal: To perform fairness analysis on large-scale realworld data
- Similar cases should be judged similarly!
- Train different virtual judges (sentence prediction models) and calculate their disagreements using standard deviations
- Synthetic datasets: we construct biased datasets by keeping facts the same and perturbing the term of penalty randomly with $\beta$ as the inconsistency factor
- The proposed method can achieve high correlation between the golden inconsistency factor
- Inconsistency is negatively correlated with the severity of the charges, i.e., felonies are sentenced more consistently than misdemeanors
Future Directions
More Data
- Legal Case Documents 120 Millions
- Trademarks and Patents Tens of millions
- Legal Consultation Tens of millions of LegalQA
More Knowledge
- Laws and Regulations 1000+
- Judicial Interpretations 1000+
- Legal Literature Hundreds of legal journals
More Interpretability: Providing explanation for answers
More Intelligence: Manipulating tools for cognitive intelligence
L9 BM x Brain Science
Magic of Sahred by Brain and PLM
Knowledge: Language-derived representation -> Sensory-derived representation
Shared computational principles for language processing
- Principle 1: next-word prediction before
word onset. - Principle 2: pre-onset predictions are used to calculate post-word-onset surprise.
- Principle 3: contextual vectorial representation in the brain.
- Principle 1: next-word prediction before
Revealing the magic of language
- Function
- Representation: Note that the semantic representations derived from language input do not possess feelings or experiences of the world. Such representations do reflect perceptual (size), abstract (danger), and even affective (arousal and valence) properties of concepts. Semantic Representation is similar to human Mental Representation.
- Structure: machine model vs. human brain model
The next question – Towards an understanding of intelligence
- computational models
- symbolic models
- connectionist models
- biological neural models
- brain-activity data
- cell recordings
- fMRI
- EEG,MEG
- behavioral data
- reaction time
- errors
- explicit judgements
- computational models
Neuron Activation
Neurons in PLMs
Background: Neurons in FFNs
- Transformer Architecture
- Feed Forward Neural Network
Sparse activation phenomenon
- Sparse Activation Phenomenon in Large PLMs
- 80% inputs only activate less than 5% neurons of FFNs
- No useless neuron that keeps inactive for all inputs
- Related to Conditional Computation
- Constrains a model to selectively activate parts of the neural network according to input
Cumulative distribution function (CDF) of the ratio of activated neurons in FFNs. Use T5-large (700 million parameters).
Conditional computation
- Deep Learning of Representations: Looking Forward (Bengio, 2013)
- Pathways (Jeff Dean, 2021)
- Today’s models are dense and inefficient
- Pathways will make them sparse and efficient
MoEfication
- Mixture-of-experts (MoE)
- Use MoE to increase model parameters with tiny extra computational cost
- Split existing models into multiple experts while keeping model size unchanged
- Expert Construction
- Group the neurons that are often activated simultaneously
- Parameter Clustering Split
- Treat the columns of $W_1$! as a collection of vectors
- K-means
- Co-Activation Graph Split
- Construct a coactivation graph
- Each neuron is represented by a node
- Edge weight between two nodes is their co-activation value
- Assign a score to each expert and select the experts with high scores
- Groundtruth Selection: Calculate the number of positive neurons in each expert as $s_i$
- Parameter Center: Average all columns of $W_1$ and use it as the center
- Learnable Router: Learn a router from the groundtruth on the training set
- Sparsity of Different T5 Models, MoEfication with Different T5 Models -> 20% of the xlarge model parameters can have 98% performance. The effect if better with the increasing of the xlarge model.
- Observations on routing patterns: Some experts are commonly selected but some are not, not balance. Experters are different.
Analyze PLMs through neurons
Specific Function
Expert units
- Identify whether the activation of a specific neuron can classify a concept
- Nc+ positive sentences that contain concept c and Nc− negative sentences that do not contain concept c.
Concept expertise: Give each unit an index m and treat a unit as a binary classifier for the input sentences to compute AP
Concept distribution
Expertise and generalization results: Detect the model’s ability without fine-tuning
Concept overlap: Let the overlap between concepts q and v be…
Conditioned text generation: Selected expert units to compute
Compositional explanations of neurons
- Neurons learn compositional concepts
- Compositional explanations allow users to predictably manipulate model behavior
Find neurons
- For an individual neuron, thresholding its activation
- For an individual neuron, thresholding its activation
- Compare with the mask of concepts
- Search for the most similar concept
- Find logical forms induced from the concepts Compose these concepts via compositional operations: AND OR NOT
Tasks
- Image Classification
- Scene Recognition
- ResNet-18
- NLI
- SNLI
- BiLSTM+MLP
- Probe neurons in MLP, input is premise-hypothesis pairs
- Concepts:
- Penn Treebank POS tags + 2000 most common words
- Appear in premise or hypothesis
- Whether premise and hypothesis have more than 0%, 25%, 50%, or 75% word overlap
- Penn Treebank POS tags + 2000 most common words
- Additional Operator
- NEIGHBORS(C), the union of 5 most close words with C
- Judged by cosine similarity of Glove embeddings
- SNLI
- Image Classification
Neuron Activation
Transferability Indicator
Recap prompt tuning:
- Training
- Transferability: Cross-Task Transfer
Prompt transfer -> Cross-Task Transfer (Zero-shot) -> For the tasks within the same type, transferring prompts between them can generally perform well.
Transferability indicator
- Motivation: Explore why the soft prompts can transfer across tasks and what decides the transferability between them
- Embedding Similarity
- Euclidean similarity
- Cosine similarity
- Model Stimulation Similarity (ON)
- Activated Neurons
- ON has the higher Spearman’s correlation with the transferability
- ON works worse on the larger PLMs because of the higher redundancy
Activated neurons in a PLM -> Distribution of Activated Neuron -> The activated neurons are common in the bottom layers but more task-specific in top layers.
Activated Neurons Can Reflect Human-Like Emotional Attributes
Question: Whether PLMs can learn human-like emotional attributes during the pre-training ?
How do humans recognize different emotions ?
- Human
- PLM (Activated Neurons)
Correlation -> Represent 27 emotions with human attributes and activated neurons
Activated neurons for every attribute
Remove neurons for an attribute
Demo: https://github.com/thunlp/Prompt-Transferability Find: Activated Neurons Demo [Colab link]
Activated Neurons
- Load Pre-trained Language Model (Roberta)
- Load the prompts (checkpoints) - 27 Emotion Tasks
- Activate Neurons
- Activated neurons in each layers -> Input: [‘realization’, ‘surprise’, …, ‘remorse’]
- Cosine Similarity of Activated Neurons -> Input: [‘realization’, ‘surprise’, …, ‘remorse’]
Cognitive Abilities of Big Models
Task generalization of PLM
Question: why can PLMs easily adapt to various NLP tasks even with small-scale data?
PLM acquires versatile knowledge during pre-training, which can be leveraged to solve various tasks
Cognitive abilities of PLMs
- Recent studies have shown that PLMs also have cognitive abilities, and can manipulate existing tools to complete a series of complex tasks
Fundamentals & framework
- Imitation learning in RL
- Learning from behaviors instead of accumulating rewards
- State-action pairs
- State as features and action as labels
- Target: imitate the trajectory of behaviors
- Large-scale pre-trained models
- Universal knowledge learned from pre-training
- Interactive space
- An environment that models could interact with
- State space: display states and changes
- Action space: a set of pre-defined actions in the environment
- Given a goal, we model each action and state to achieve the goal in a unified space by a PLM
- Tokenization
- Tokenize human behaviors (actions in the action space) and states in the state space to a same space
- The tokenized information could be processed by PLM
- Directly training
- The behaviors could be autoregressively predicted
- Imitation learning in RL
Interactive space:
- Search engine
- WebShop
- Sandbox
Search engine
- A rising challenge in NLP is long-form QA -> A paragraph-length answer is generated in response to an open-ended question
- The task has two core components: information retrieval and information synthesis
- WebGPT
- Outsource document retrieval to the Microsoft Bing Web Search API
- Utilize unsupervised pre-training to achieve high-quality document synthesis by fine-tuning GPT-3
- Create a text-based web-browsing environment that both humans and language models can interact with
- Text-based web-browser
- WebGPT-produced answers are more preferred than human-generated ones
- Better coherence and factual accuracy
- An example -> How does neural networks work?
WebShop
- WebShop for online shopping (HTML mode)
- Simple mode which strips away extraneous meta-data from raw HTML into a simpler format
- Actions in WebShop
- Item rank in search results when the instruction is directly used as search query
- Model implementation
- Results
Sandbox
- Video pre-training of MineCraft -> Sandbox like Minecraft is a good interactive space
- Video pre-training of MineCraft -> Define discrete actions in the interactive space
- Cost
- Use behavior model to annotate unlabeled 70 hours video
- Reduce the cost: 1,400,000 -> 130,000
- Annotation trick
- At first, casually playing MineCraft
- Play specific tasks (Equip Capabilities)
- Results -> VPT accomplishes tasks impossible to learn with RL alone, such as crafting planks and crafting tables (tasks requiring a human proficient of ∼970 consecutive actions)
- Results -> An example for killing a cow
Challenges & limitations
- Building interactive space is time-consuming
- Labeling is expensive and labor-intensive
- The goal must be clear and simple
- Only discrete actions and states are supported
- A clean interactive space is required