Tan, Zhen and Beigi, Alimohammad and Wang, Song and Guo, Ruocheng and Bhattacharjee, Amrita and Jiang, Bohan and Karami, Mansooreh and Li, Jundong and Cheng, Lu and Liu, Huan ::: Large Language Models for Data Annotation: A Survey

Table of Contents

_20240309_194815screenshot.png

1. core aspects

  • LLM-Based Data Annotation
  • Assessing LLM-generated annotations
  • learning with LLM-generated annotations

2. other contents

  • taxonomy of methods of using LLMs for data annotation
  • review of learning strategies with models using LLM-generated annotations
  • challenges and limitations associated with using LLM for data annotation

3. typical Data annotation tasks

  1. [basic classification] categorizing raw data -> class/task label
  2. [depth] intermediate labels for contextual depth (Yu et al. 2022) [1]
  3. [reliability] assigning confidence score to gauge annotation reliability (lin at al 2022) [2]
  4. [output engineering] apply alignment/preference labels to tailor outputs to requirement(industrial criteria, user need)
  5. annotating entity relationships
  6. marking semnatic roles - role of entity in a sentence
  7. tagging temporal sequences to capture order of events

4. prompt and tuning techniques for LLMs

  • Input-Output prompting - use prompt
  • In-Context learning - give demonstration
  • Chain-ofThourhgt Prompting - give reasoning pathway
  • Instruction Tuning - give instruction at the beginning of prompt
  • Alignment Tuning - generate a bunch of output, and huamn label the good ones.

5. LLM-Based Data Annotation

The annotation task could be described as \(F(x) = y\), where

  • \(F\) represent the process of LLM recieving a prmopt \(x\) and generate an output \(y\)
  • \(x\) would be the data set being annotated, or a data point from the dataset (1,22.3,0.1)
  • \(y\) would be the label LLM produced for the data point(s) class 1 or (class 1,class 1, class 2)

The keypoint here is that the generated label should align with common sense. i.e., the annotation does split the dataset into some kind of categorization that human would categorize them into, like “angry driver” and “sunday driver”.

5.1. manually engineered prompts

5.1.1. zero-shot

no demonstration

  • NO_ITEM_DATA:yeZeroGenEfficientZeroshot2022 - generate a dataset from scratch with PLMs()

5.1.2. few-shot

with demonstration, with

  • selection of demo samples is crucial
    • let GPT-3 to select random samples from the training set as demonstrations [4]
    • use another LLM to score potential usefulness of demonstration samples
    • incorporta other types of annotations into ICL
      • superICL - confidence scores (from a small language model) -> demonstrations

5.2. alignment via pairwise feedback

align LLM behaviour with human behaviours.

5.2.1. human feedback

  • feedback/rate on LLM response - quite expensive, lots of efforts

5.2.2. automated feedback

  • a LLM functioning as a reward model

Furthermore, Askell et al. (2021) evaluated different training goals for the reward model, discovering that ranked preference modeling tends to improve with model size more effectively than imitation learning.

[5]

6. Assessing LLM-Generated Annotations

6.1. Evaluating LLM-generated annotations

6.1.1. general approaches

6.1.2. task-specific evaluations

6.2. data selection via active learning

6.2.1. LLM as Acquisition Functions

6.2.2. LLM as Oracle Annotators

7. learning with LLM-generated annotations

7.1. target domain inference

7.1.1. Predicting Labels

7.1.2. inferring additional attributes

7.2. knowledge distillation

7.2.1. model enhencement

7.2.2. KD innovations

7.3. harnessing LLM annotation for fine-tuning and prompting

7.3.1. In-Context Learning

7.3.2. Chain-of-Throught Prompting

7.3.3. Instruction Tuning

7.3.4. Alignment Tuning

Bibliography

[1]
W. Yu et al., “Generate rather than retrieve: Large language models are strong context generators,” Arxiv, vol. abs/2209.10063, 2022, Available: https://api.semanticscholar.org/CorpusID:252408513
[2]
S. C. Lin, J. Hilton, and O. Evans, “Teaching models to express their uncertainty in words,” vol. 2022, 2022, Available: https://api.semanticscholar.org/CorpusID:249191391
NO_ITEM_DATA:yeZeroGenEfficientZeroshot2022
[4]
R. Shin et al., “Constrained Language Models Yield Few-Shot Semantic Parsers,” Proceedings of the 2021 conference on empirical methods in natural language processing, pp. 7699–7715, 2021, doi: 10.18653/v1/2021.emnlp-main.608.
[5]
A. Askell et al., “A general language assistant as a laboratory for alignment,” Arxiv, vol. abs/2112.00861, 2021, Available: https://api.semanticscholar.org/CorpusID:244799619
[6]
P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing,” Acm computing surveys, vol. 55, no. 9, pp. 1–35, Sep. 2023, doi: 10.1145/3560815.

Backlinks

survey

(example)

Author: Linfeng He

Created: 2024-04-03 Wed 20:16