Maxime Méloux Christophe Cerisara
Université de Lorraine, CNRS, LORIA, Nancy, France
Abstract
Teaching new information to pre-trained large language models (PLM) is a crucial but challenging task. Model adaptation techniques, such as fine-tuning and parameter-efficient training have beenshown to store new facts at a slow rate; continual learning is an option but is costly and prone to catastrophic forgetting.This work studies and quantifies how PLM may learn and remember new world knowledge facts that do not occur in their pre-training corpus, which only contains world knowledge up to a certain date.To that purpose, we first propose Novel-WD, a new dataset consisting of sentences containing novel facts extracted from recent Wikidata updates, along with two evaluation tasks in the form of causal language modeling and multiple choice questions (MCQ). We make this dataset freely available to the community, and release a procedure to later build new versions of similar datasets with up-to-date information.We also explore the use of prefix-tuning for novel information learning, and analyze how much information can be stored within a given prefix. We show that a single fact can reliably be encoded within a single prefix, and that the prefix capacity increases with its length and with the base model size.
Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning
Maxime Méloux and Christophe CerisaraUniversité de Lorraine, CNRS, LORIA, Nancy, France
1 Introduction
Pre-trained language models (PLM or LLM)(Chiang etal., 2022) are typically trained on raw texts with a self-supervised loss and furtheradapted to downstream tasks with, e.g., finetuning(Dai and Le, 2015; Howard and Ruder, 2018; Radford etal., 2019).Hence, the world knowledge that PLM have acquired is prior to the cut-off date of their pretraining corpus(Alivanistos etal., 2022; Kucharavy etal., 2023).A major challenge is then how to reliably teach PLMs novel factual knowledge.Fine-tuning has been one of the main proposed approaches to adapt pre-trained models to new tasks and domains. However, full model fine-tuning can lead to catastrophic forgetting(French, 1999; Kirkpatrick etal., 2017), and can be costly when performed on large models(Strubell etal., 2020). Furthermore, Wei etal. (2023) showed that when fine-tuning a model on a small corpus with new information, the model may instead learn to hallucinate unseen facts.Parameter-efficient fine-tuning (PEFT) methods have emerged as an lightweight alternative to full model fine-tuning, in which only a fraction of the parameters of the original model are modified. PEFT allows for efficiently modifying a small fraction of model parameters using methods such as prefix-tuning(Li and Liang, 2021), adapter-tuning(He etal., 2021) or LoRA(Hu etal., 2021). In-context learning(LoganIV etal., 2022), prompting(Liu etal., 2023b) and prompt-tuning(Lester etal., 2021) are currently amongst the most reliable ways to inject new knowledge in PLM.
In this study, we focus on prefix-tuning(Li and Liang, 2021), a fine-tuning method in which the pre-trained model parameters are kept frozen, but a few small continuous vectors called the prefix are optimized. Based on the idea that context can steer a language model without changing its parameters, prefix-tuning optimizes the model’s context as one or several continuous vectors corresponding to either embeddings or to key-query pairs in attention layers, whose effects will be propagated to all activation layers and subsequent tokens.
Wang etal. (2022) and Liu etal. (2022a) showed that novel knowledge can efficiently be contextually fed into large language models through prompting. However, the size of a prompt in a given model is limited by the context size of that model. In this paper, we view prefix-tuning as a generalized form of prompting taking continuous values, and having controllable depth and length, and as such, we hypothesize that this method can reliably store significant amounts of factual information. This is backed by the findings of Kossen etal. (2023), which argue that in-context learning enables a model to learn information. Our goal is therefore to investigate this question in the case of prefix-tuning, and more specifically how much knowledge can be compressed into the prefix. In addition, by using prefix-tuning rather than LoRA, fine-tuning or adapters, we hope to avoid the hallucination problem mentioned in Wei etal. (2023) by working with (generalized) prompts without modifying the existing model weights.
Figure1 summarizes our proposed approach, which exploits recent Wikidata updates to automatically generate a corpus of new facts: Novel-WD. We then propose a nearly automatic procedure to create a dynamic benchmark fromthis corpus of facts that evaluates updated LLMs in terms of perplexity, new facts generation and accuracy on multiple-choice question-answering. We thenevaluate and show that prefix-tuning performs better than LoRA for new facts learning on this dataset.
2 Related work
Adapting models to new tasks is a relatively old problem. Yoon etal. (2018) showed that dynamically expandable networks can obtain good performance in this setting by slowly increasing model capacity. Lin etal. (2022a) explored the task of improving accuracy of Transformer models on out-of-data streams using continual model refinement (CMR) to maximize the diversity of training samples in a non-stationary distribution. Razdaibiedina etal. (2023) showed that using a collection of progressively growing prompts alleviates catastrophic forgetting and increases model generalization capacities across tasks.
Many studies have explored how information storage functions within the Transformer architecture. Elhage etal. (2022) gave a comprehensive overview of the Transformers architecture under the lens of mechanistic intepretability. Geva etal. (2021) showed that the feedforward layers of Transformers models act similarly to key-value memories in information retrieval systems. Based on that work, Mitchell etal. (2021) introduced MEND, a framework that leverages a group of small networks to successfully perform local factual edits within the feedforward layers of a large Transformers model. Meng etal. (2022b, a) expanded on this idea by using causal inference to locate the attention feedforward layer containing a given fact and editing the corresponding matrix as a constrained optimization problem.
In contrast, several approaches for storing new information within a language model have been proposed. One such approach is the use of flexible external memories, as exemplified in Wu etal. (2021, 2022). Another, dynamic method is that of retrieval systems, which can leverage external knowledge bases, including the Web, to that purpose. Examples of such works include Guu etal. (2020), Lewis etal. (2020), Borgeaud etal. (2021) and Liu etal. (2023a). Finally, new information can be stored in the short-term through methods such as prompt-tuning (Liu etal., 2021, 2022b).
In terms of evaluation, Petroni etal. (2019) is an early attempt at measuring relational and factual knowledge within PLMs. Zhu etal. (2020) proposed new, information-theory based evaluation metrics for factual knowledge. Kadavath etal. (2022) and Lin etal. (2022b) focused on measuring model uncertainty as a way to distinguish known facts from hallucinated ones. Jang etal. (2021, 2022) introduced the framework TemporalWiki, which like us, includes a process to generate datasets and benchmarks from information extracted from Wikipedia. However, their framework targets large scale continual learning while we focus on the factual knowledge acquisition point of view (detailed next). This difference in perspective leads to important differences in terms of types of inputs (facts vs texts), number of inputs, type and learning efficiency of the tested adaptation methods with respect to the number of parameters, and evaluation metrics (perplexity vs. factual MCQs accuracy). Yu etal. (2023) detailed the creation of a large and refined benchmark, specifically tailored to measure world knowledge within PLMs. Kasai etal. (2022) proposed a continual MCQ benchmark for world knowledge, updated every week with new questions about recent events extracted from news websites.Yang and Liu (2021) successfully used prefix-tuning to adapt a PLM for text classification, while Ma etal. (2022) used the same method for speech-to-text translation. Prefix-tuning was also shown to obtain good performance in natural language understanding (Lester etal., 2021), summarization (Chen etal., 2023) and sentiment analysis (Balakrishnan etal., 2022) inter alia. Zhao etal. (2022) showed that prefix-tuning may also be used for efficient domain adaptation.
Parameter-efficient training methods, such as LoRa and prefix-tuning, are often used bothto continue pretrain an LLM and to adapt it to a domain. However,recent works suggest that, with LoRa and full finetuning, very few new factual knowledge are actually learned(Liu etal., 2024).We propose in this work to investigate this question with prefix-tuning, which is based on similar principles thanin-context learning, a method that is known to be able to inject new knowledge.Compared to the past litterature on prefix tuning, we focus on its properties with regard tofactual knowledge learning,and give concrete answers to the questions of whether and when does prefix tuning learn new factual knowledge.
3 Methodology
3.1 Research questions
As shown in the related works Section, there is still not a clear understanding about what is really learnt by finetuning methods like LoRa.In this study, we argue that prefix tuning is a better solution to inject a small number of new facts into the LLM, which may potentially be extended (in a future work) to support many facts either by retrievingthe best prefix from a prefix-store (à la RAG), or by selecting prefixes with gating networks (à la mixture-of-experts)or by generating prefixes with a dedicated model.Concretely, the target research questions of this work are:(i) Can a single prefix vector on the first layer learn a single fact? Does this learning generalize to reformulations of this fact?(ii) Can a longer prefix () learn multiple facts? What effect does prefix size have on learning and generalization? In-context learning suggests that the answer to this question and the previous one are positive.(iii) In the existing literature, the prefix is usually spread across all layers of the model. However, Simoulin and Crabbé (2021) suggest that the deeper layers in Transformer models are associated with abstract and high-level capabilities, while factual information is stored in the lower layers. Does restricting the prefix depth therefore affect the learning and generalization capacities of the model?(iv) Do the answers to the previous questions remain true with bigger models?
3.2 Facts learning
We model a fact as a semantic triple of the form (subject, predicate, object), in which the subject and object are typically noun phrases, and the predicate a verb phrase.We consider the following important properties, largelyadapted from Meng etal. (2022a):
Learning: The updated LLM has learnt the fact when it can predict the object from a sentence containing the subject and predicate after being updated, while it could not predict the object before;
Generalization: The LLM is able to generalize the learned fact when it can predict the object from a paraphrase of the subject and predicate.
Specificity: The updated LLM is specific when it correctly generates another expected object that is different from the learned triplet from a slightly different subject and predicate input.
Non-forgetting: The updated LLM generates the correct objects that were already known by the baseline LLM.
3.3 Evaluation
Let be a baseline LLM and a list of recent facts (triples).We first build a training setcontaining a list of simple sentences generated from the triples in (see Figure1).We then update the model on this training set, either with prefix-tuning (our proposal) or LoRA (the baseline).The perplexity of the updated LLMs are computed on the same training set and compared:although it is largely debated in the community, we nevertheless consider that this perplexity is a relevantindicator of whether the LLM has learnt this training set or not.We then evaluategeneralization by measuring the perplexity of the updated LLMson complex, creative sentences created by reformulating the training sentences.We finallymeasure specificity and non-forgetting by evaluating the LLMs on existing MCQ benchmarks.
4 Dataset
In this section, we describe the steps used to create Novel-WD and give an overview of the resulting dataset. A sample output of each step of the full process is given in Table 1.
Element | Value |
---|---|
Triple | (Frances Allen, spouse, Jacob Schwartz) |
Training sentence | Frances Allen is married to Jacob Schwartz. |
Test sentence 1 | Frances Allen’s spouse is |
Test sentence 2 | The spouse of Frances Allen was |
Test sentence 3 | Frances Allen was married to |
Test sentence 4 | Frances Allen has been married to |
Test sentence 5 | The name of Frances Allen’s spouse is |
Question | Who was Frances Allen’s spouse? |
Distractor 1 | Charles Householder |
Distractor 2 | David Padua |
Distractor 3 | John co*cke |
Triple extraction
We begin by extracting RDF triples that were newly added to Wikidata. To do so, we retrieve new triples from a daily incremental database dump. We restrict ourselves to items and exclude lexemes, which represent lexicographical data. We also do not take into account complex triples, in which the subject or object is a Wikimedia template, as well as triples in which the subject is a numerical identifier, a filename or a URI. We then resolve eventual internal Wikidata links in the subject, predicate or object by replacing them with the English name of the associated item. Finally, when multiple triples share the same subject and predicate, we randomly select one such triple and discard the other ones, so as to limit the risk of models trying to learn multiple conflicting facts.
Training set
To generate a training set, we convert each triple into a simple sentence, by querying a 8-bit quantized version of Vicuna-13b (Chiang etal., 2023) with a two-shots prompt. For each triple, we generate one such sentence.
Two evaluation tasks
The first evaluation is a causal language modeling task (perplexity): for each triple, we ask 8-bit Vicuna-13b in a two-shots setting to generate 5 sentences in which the object of the triple is missing. In order to test for generalization capabilities and to avoid repeating the training sentence, we specifically prompt Vicuna for "creative sentences". Manual editing may then be applied to the output sentences in the infrequent situation (occurring for less than 10 facts) where full sentences are generated rather than incomplete one.
The second task is a multiple choice question answering task (MCQ). For each triple, a two-shots 8-bit Vicuna-13b prompt is first applied to generate a question asking for the object of the triple. Then, a similar prompt is applied to generate 4 "likely answers" to the question. Among the 4 generated answers, we remove the ground-truth one if it is present, and select the 3 first remaining ones as distractors. After manually checking and editing the generated answers in rare cases (3 occurrences) where they semantically overlap, we then add in the correct answer. We therefore obtain a question with 4 possible choices, exactly one of which being correct.
After all the steps above have been applied, Novel-WD consists of 338 distinct triples, and each triple contains one associated training sentence, five incomplete validation sentences, one question and three distractors.
5 Experimental setup
The baseline model chosen for our experiments is BLOOMZ-7.1b (Muennighoff etal., 2023).BLOOMZ-7.1b is a relatively old LLM, but which was particularly well designed:all the fundamental architectural choices that equip recent LLMs were already there, including a large vocabulary sizethat has also been adopted for instance in Gemma2. The few differences, such as grouped query attention,are designed to improve speed not performance,so it is reasonable to assume that the behaviour observed for BLOOMZ-7.1b translates to similar LLMs.Recent studies have also shown that, when appropriately finetuned, its performances matches those of state-of-the-artLLMsLi etal. (2023).Its main drawback is it’s small training data, but this is largely compensated by the fact that, in our view,all of its data is known, which is a major advantage when aiming at rigorous scientific research.
The training was ran for up to 450 epochs using the AdamW optimizer with a weight decay of 0.1 and an initial learning rate of , decreasing by a factor of 10 after 10 epochs of non-decreasing training loss. We did not project the prefix through an intermediate MLP as mentioned in Li and Liang (2021), as we found that it did not increase training stability and generally resulted in lower performance.For all of our models, prefix-tuning was implemented by learning the value of the previous key and value vectors in attention layers, resulting in two vectors per layer and per virtual token being learned, for a total of vectors.
For each macro-experiment and number of facts , we divided the =338 facts of Novel-WD into non-fully overlapping subsets of length , and trained one copy of the baseline model on each subset. For a given , the number of subsets was computed as . For example, for , we sampled 112 subsets of 3 facts, and trained a separate copy of BLOOMZ-7b1 on each of those 112 subsets. Training subsets were generated for values of in .
5.1 Evaluation
To evaluate our models in the text prediction setting, we prompt them with each of the five incomplete sentences associated with each fact from the training set, and generate the following ten tokens without sampling and with a temperature of 1. We only count an answer as correct if the model’s output contains the exact answer’s text, capitalization excepted, and we report the accuracy over every sentence of the test set for a given model. We also measure the proportion of learning models for a given , by selecting only facts of the test set for which the baseline model does not output any correct prediction, and counting the proportions of the prefix-tuned models trained on those questions for which the test set accuracy is non-zero. In other words, learning models are models which are able to correctly predict at least one sentence completion for facts that were not known by the baseline.
To perform regression tests, we selected the SciQ (Welbl etal., 2017) and MMLU (Hendrycks etal., 2020a, b) datasets. For SciQ, we measure the accuracy of the baseline and prefix-tuned models in the MCQ setting, by using the same prompt as for Novel-WD, and selecting the lowest per-token perplexity choice. We apply this method on all 1,000 questions of the test set. For MMLU, we append each of the possible four completions to each sentence, and then select the one with the lowest per-token perplexity as the model’s answer. This is applied to the test sets from each of the 57 categories found in the dataset. Due to computational costs, regression tests were ran on a random sample of 5 prefix-tuned models for each value of .
6 Results and analysis
6.1 Base setup
Our initial experiment focuses on a single prefix (), corresponding to 8,192 trainable parameters, or 0.000116% of the baseline model’s parameters. For comparison, we also perform the same experiment using LoRA (rank) instead of prefix-tuning. We use the same training hyperparameters for both LoRA and prefix-tuning.
The proportion of prefix-tuned models with increased accuracy in the prediction setting is given in Figure 2, along with the mean accuracy (see AppendixB Figure4) obtained in the prediction setting for different numbers of facts.
For , between 54.1% and 55.4% of the models are able to learn at least one information over the baseline. This amount stays stable for , with the proportion of learning models ranging from 40.5% to 55.4%. For , this proportion drops to 18.8%, and none of the models trained for achieved any accuracy gains over the baseline.Note that a recent work applying control theory to LLMs has shown that WikiText can be nearlyperfectly predicted (at 97%) with less than 10 additional prompt tokensBhargava etal. (2024), which also somehow confirms from a differentpoint of view this limit of tokens than we have found.
The baseline model obtains a consistent accuracy ranging from 3.0% to 6.3%, suggesting that a small number of facts found in the dataset are either already known or easily deducible by the model. In contrast, the prefix-tuned models obtain a mean accuracy peaking at 29.1% for , and gradually decreasing for until , for which the results are no longer significantly better than the baseline.This initial result suggests that during training, the prefix is usually able to select and remember 1 to 3 facts well, and up to 20 with decreasing accuracy. Furthermore, this learning is conditional on having a low enough number of facts present in the training data; having more than 10 facts seems to hamper the model’s ability to learn even a single fact.
In comparison, models trained with LoRA systematically underperform prefix-tuned ones for all values of , with a prediction accuracy reaching 20.4% for , and values ranging from 4.6% to 14.4% for other values of . Furthermore, they typically obtain pLM scores that are similar or lower than the ones of prefix-tuned models. This may be due to the low rank value of 8 used in our experiments; however, rank 8 LoRA adds 3,932,160 parameters to the base model, a number which is 480 times higher than the parameters contained in a single prefix. We therefore argue that while LoRA may outperform prefix-tuning at higher matrix ranks, it does so in a much less cost-efficient manner than prefix-tuning.
6.1.1 Error analysis
With , about half of the facts found in Novel-WD were not learned by a single prefix. While we could not identify meaningful semantic or content differences between the types of facts that were learned and those that were not, we report in Table 6 in appendixA quantitative statistics between those two categories. For each reported statistic, the non-learned value was found to be significantly larger than the learned one, as measured using a one-sided Welch’s t-test (p = 0.05).This suggests that the facts that were not successfully learned are typically longer and are farther from the baseline model’s distribution, both in their sentence form and in the text completion setting, which might result in an inability for prefix-tuning to sufficiently steer the model towards learning them.
6.2 Detecting overfitting and forgetting
We report the training loss in Figure 3 and norm of the two prefix vectors in AppendixB Figure5 measured post-training in each experiment.
We observe that for , almost all experiments end with a training loss approaching zero, with the exceptions of a few outliers for which the loss remains high. This confirms our previous finding that the prefix is almost always able to learn a single fact, but may not be able to generalize in the prediction setting. When increasing , the losses increase linearly up to (median value: ). For , the loss increases sharply and quickly approaches the baseline model’s loss of 4.38. We interpret this inflection as consistent with our previous observations, suggesting that a change of learning mode occurs in the vicinity of : For lower values, the model is efficiently able to learn and generalize novel information, while for higher values, the model may no longer able to store all facts and instead unsuccessfully attempt to learn a combined representation of the training set. These findings are also consistent with the evolution of the prefix norm given: For , we observe a linear increase in prefix norm, which may indicate that the model does not make full use of the available prefix capacity. For , the prefix norm is nearly constant and may signal increasing compression within the prefix. Finally, for , the prefix norm decreases rapidly.
SciQ acc. | MMLU acc. | |||
---|---|---|---|---|
k | Min | Max | Avg | |
Baseline | 0.757 | 0.130 | 0.463 | 0.307 |
1 | 0.833 | 0.184 | 0.512 | 0.343 |
2 | 0.864 | 0.189 | 0.517 | 0.341 |
3 | 0.840 | 0.189 | 0.517 | 0.340 |
4 | 0.838 | 0.184 | 0.517 | 0.339 |
5 | 0.827 | 0.191 | 0.509 | 0.339 |
8 | 0.833 | 0.184 | 0.509 | 0.341 |
10 | 0.834 | 0.193 | 0.509 | 0.341 |
20 | 0.808 | 0.185 | 0.515 | 0.328 |
50 | 0.835 | 0.190 | 0.518 | 0.335 |
100 | 0.826 | 0.192 | 0.512 | 0.340 |
200 | 0.828 | 0.189 | 0.524 | 0.342 |
Finally, we report in Table 2 the results of the evaluation over SciQ and MMLU, which shows that the prefix-tuned models do not seem to forget facts learned during pre-training or incur any loss of reasoning capabilities, for any value of . Surprisingly, our prefix-tuned models even perform consistently and significantly better than the baseline for all values of .Our hypothesis is that, by "finetuning" (through a prefix) the LLM on Wikipedia-like sentences, we specialize the LLM to interpret its inputs in a more "factual way" and in the Wikipedia domain, which is useful for the type of factual MCQ questions that occur in SciQ and MMLU.However, we did not study this hypothesis in detail and leave this question open for future work.
6.3 Impact of prefix size
Table 3 contains the results obtained when prefix-tuning BLOOMZ-7b1 while varying the number of virtual tokens contained in the prefix.
n=1 | n=20 | n=100 | ||||
k | Acc | pLM | Acc | pLM | Acc | pLM |
1 | 0.274 | 0.541 | 0.353 | 0.601 | 0.365 | 0.619 |
2 | 0.279 | 0.548 | 0.333 | 0.613 | 0.357 | 0.607 |
3 | 0.291 | 0.554 | 0.315 | 0.589 | 0.358 | 0.616 |
4 | 0.247 | 0.464 | 0.321 | 0.607 | 0.337 | 0.619 |
5 | 0.227 | 0.493 | 0.316 | 0.582 | 0.304 | 0.612 |
8 | 0.177 | 0.405 | 0.256 | 0.524 | 0.270 | 0.452 |
10 | 0.159 | 0.485 | 0.245 | 0.601 | 0.268 | 0.512 |
20 | 0.123 | 0.188 | 0.199 | 0.500 | 0.218 | 0.500 |
50 | 0.076 | 0 | 0.116 | 0.167 | 0.113 | 0.167 |
100 | 0.053 | 0 | 0.086 | 0.400 | 0.096 | 0.400 |
200 | 0.055 | 0 | 0.063 | 0 | 0.070 | 0 |
We observe significant improvement in accuracy for nearly all values of when increasing the prefix size from 1 to 20, as well as significant gains in the proportion of learning models for . Similar results are obtained when further increasing the prefix size from 1 to 100. However, none of the variation in accuracy or proportion of learning models between and are statistically significant.
We interpret those results as follows: Increasing the prefix size only modestly increases the chances for a model to be able to learn at least one fact. However, such an increase has a strong impact on the prediction capabilities of the model, which suggests that the model is able to learn more facts and to generalize better.
We hypothesize that the former may stem from the varying complexity of the facts in our dataset: for some facts, the base model may already contain information about the subject and predicate, and prefix-tuning might only be needed to learn the value of the object. A typical example of this situation can be found in facts of the type "[historical figure] was born on [date]". On the contrary, there exist more complex facts for which the subject and predicate themselves might be novel, and for which the base model might not contain information. We also note that increasing the prefix size past brings no further improvement to the learning and generalization capacities of our model, which may indicate that prefixes are inherently limited in terms of information capacity.
6.4 Impact of prefix depth
We report in Table 4 the results obtained by increasing the number of layers spanned by the prefix in our initial setup from (minimal depth) to (full-depth prefix).
d=1 | d=30 | |||
---|---|---|---|---|
k | Acc | pLM | Acc | pLM |
1 | 0.274 | 0.541 | 0.354 | 0.590 |
2 | 0.279 | 0.548 | 0.441 | 0.667 |
3 | 0.291 | 0.554 | 0.520 | 0.768 |
4 | 0.247 | 0.464 | 0.467 | 0.690 |
5 | 0.227 | 0.493 | 0.470 | 0.731 |
8 | 0.177 | 0.405 | 0.487 | 0.690 |
10 | 0.159 | 0.485 | 0.476 | 0.789 |
20 | 0.123 | 0.188 | 0.401 | 0.813 |
50 | 0.076 | 0 | 0.275 | 0.333 |
100 | 0.053 | 0 | 0.130 | 0.800 |
200 | 0.055 | 0 | 0.101 | 0.000 |
We observe that increasing the prefix depth has a significant effect on both the accuracy and the proportion of learning models. For all values of , the average accuracy is increased by 8 to 31%, with the highest increase reached for . The highest average accuracy is obtained for , which once more suggests that up to three facts can be efficiently stored within a prefix, but performance stays comparable up to .
The second main observation is the fact that the proportion of learning models significantly increases for all values of except , with gains of up to 80% for . We hypothesize that increasing the prefix depth allows for more complex information to be learned and enables the model to learn at least one information for all but the highest amount of training facts.Increasing the value of from to effectively multiplies the number of trainable parameters by 30, but far surpasses the results obtained by increasing the prefix length by a factor of 100. We therefore remark that prefix depth seems to have a much stronger effect on model performance than prefix length.
6.5 Impact of base model
To investigate the effect that the type and size of the base model may have on prefix-tuning, we repeat our initial experiments on BLOOMZ-1b7, the 1.7 billion parameter version of BLOOMZ, chosen for scale comparisons.We measure the accuracy of the baseline models in the prediction setting over the entirety of Novel-WD. BLOOMZ-1b7 obtained an overall accuracy of 4.4%, while BLOOMZ-7b1 reached a similarly low value of 5.0%.
The results obtained after prefix-tuning are reported in Table 5.
BLOOMZ-1b7 | BLOOMZ-7b1 | |||
k | Acc | pLM | Acc | pLM |
1 | 0.293 | 0.565 | 0.274 | 0.541 |
2 | 0.273 | 0.556 | 0.279 | 0.548 |
3 | 0.262 | 0.589 | 0.291 | 0.554 |
4 | 0.213 | 0.464 | 0.247 | 0.464 |
5 | 0.189 | 0.403 | 0.227 | 0.493 |
8 | 0.152 | 0.286 | 0.177 | 0.405 |
10 | 0.112 | 0.394 | 0.159 | 0.485 |
20 | 0.085 | 0.189 | 0.123 | 0.188 |
50 | 0.053 | 0 | 0.076 | 0 |
100 | 0.045 | 0 | 0.053 | 0 |
200 | 0.039 | 0 | 0.055 | 0 |
In terms of scaling, we first note that there are no significant improvements in terms of the proportion of learning models between BLOOMZ-1b7 and BLOOMZ-7b1. This strengthens the intuition that this may be due to the inherent complexity of some facts in the dataset, and to the fact that the ability to learn a fact is already present in smaller models. However, increasing the model size has a noticeable effect on the prediction accuracy, which increases by several percentage points for . We believe that this is partially due to the scaling generalization capabilities of the models. However, as the number of trainable parameters almost doubles between BLOOMZ-1b7 and BLOOMZ-7b1, these improvements may also be explained by an increase in prefix capacity.
Finally, to give an idea of the extracted facts, the quality of the synthetic generated sentences and which facts are correctly classified by the baseline model,Table7 in AppendixC shows a random extract of known facts and generated sentences:some facts may "leak" from the LLM pretraining corpus (e.g., Frederik Storm in Denmark),or may be guessed (e.g., Vitale Faliero, language spoken, Italian)or may be answered by chance (e.g., A View to a Kill, MPA rating, PG).This question of leakage vs. actual forecasting is discussed in more details inHalawi etal. (2024).
7 Conclusion
In this study, we have developed a dataset for novel fact learning in pre-trained language models. We have shown that prefix-tuning can be used to learn new facts, and investigated the effect of various factors on prefix-tuning performance. Our main recommendation is to use full-depth prefixes, but to limit the prefix length to 20 virtual tokens.
We see several major avenues for future research based on this work. While we measured the effect of different factors independently, their combined effect might be different. In particular, it is hard to predict how prefix length and depth may interact together. Another research direction is the use of different and more recent baseline architectures such as Mixtral (Jiang etal., 2024). Finally, a long-term goal could be to scale our approach to larger datasets, for example by using a mixture of prefixes at capacity along with a routing module. This could allow the use of a small, regular stream of new information to continually update a model.
8 Limitations
While this paper addresses the challenge of updating LLMs with novel facts, there are other types of "updates" thatshould be achieved to make the updated LLM as useful as a new LLM pretrained from scratch on an up-to-date corpus,such as language and topic drifts. The method described in this work can not solve this issue.More generally, representing knowledge with triples is very limited, and can hardly for instance encodetime-dependent and location-dependent cultural preferences, common sense and beliefs.This work is thus strongly limited in terms of the type of knowledge it can capture, but it is only a first steptowards a more general LLM updating paradigm.
Another limitation is that only a few facts are injected in the LLM with our method, while continual updating ofthe LLM would require a constantly increasing number of facts to be added. To achieve this, our method would requirean additional step to select or generate the appropriate prefixes, depending on the observed context, in a similar wayas what is done with RAG or alternatively mixture of experts. We have not tested in this work such an enhancement,and we have only focused so far on studying the usefulness of prefix tuning as an alternative to RAG and LoRA.
Finally, an apparent limitation may be the size of Novel-WD, which is quite small. However, this is mainlybecause of the high cost of running the large number of experiments required in this study.However, since 2020, Wikidata grows at a rate of 7 million entities per year (see https://en.wikisource.org/wiki/Wikidata_The_Making_Of), and the filtering that we apply leads to about 32000 remaining new facts per day (as checked for 14th March 2024), so getting data at scale should not be an issue.Furthermore, although we made a few manual interventions to check for generation errors when creating the dataset and benchmarks,we are convinced such interventions could be avoided when using better LLM, such as Llama3-70b or Qwen-72b.
References
- Alivanistos etal. (2022)Dimitrios Alivanistos, SeleneBáez Santamaría, Michael Cochez, Jan-ChristophKalo, Emile van Krieken, and Thiviyan Thanapalasingam. 2022.Prompting asProbing: Using Language Models for Knowledge BaseConstruction.Publisher: arXiv Version Number: 3.
- Balakrishnan etal. (2022)Sudhandar Balakrishnan, Yihao Fang, and Xiaodan Zhu. 2022.ExploringRobustness of Prefix Tuning in Noisy Data: A Case Study inFinancial Sentiment Analysis.In Proceedings of the Fourth Workshop on FinancialTechnology and Natural Language Processing (FinNLP), pages 78–88,Abu Dhabi, United Arab Emirates (Hybrid). Association for ComputationalLinguistics.
- Bhargava etal. (2024)Aman Bhargava, Cameron Witkowski, Manav Shah, and Matt Thomson. 2024.What’s the magic word? acontrol theory of llm prompting.Preprint, arXiv:2310.04444.
- Borgeaud etal. (2021)Sebastian Borgeaud, A.Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford,Katie Millican, George vanden Driessche, J.Lespiau, Bogdan Damoc, AidanClark, Diego deLas Casas, Aurelia Guy, Jacob Menick, Roman Ring,T.Hennigan, Saffron Huang, Lorenzo Maggiore, Chris Jones, Albin Cassirer,Andy Brock, Michela Paganini, G.Irving, Oriol Vinyals, Simon Osindero,K.Simonyan, JackW. Rae, Erich Elsen, and L.Sifre. 2021.Improving language models by retrieving from trillions of tokens.
- Chen etal. (2023)Chen Chen, WeiEmma Zhang, and AlirezaSeyed Shakeri. 2023.IncorporatingKnowledge into Document Summarization: an Application ofPrefix-Tuning on GPT-2.arXiv preprint.ArXiv:2301.11719 [cs].
- Chiang etal. (2022)Cheng-Han Chiang, Yung-Sung Chuang, and Hung-yi Lee. 2022.RecentAdvances in Pre-trained Language Models: Why Do They Work andHow Do They Work.In Proceedings of the 2nd Conference of the Asia-PacificChapter of the Association for Computational Linguistics and the 12thInternational Joint Conference on Natural Language Processing:Tutorial Abstracts, pages 8–15, Taipei. Association for ComputationalLinguistics.
- Chiang etal. (2023)Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, LianminZheng, Siyuan Zhuang, Yonghao Zhuang, JosephE. Gonzalez, Ion Stoica, andEricP. Xing. 2023.Vicuna: AnOpen-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality| LMSYS Org.
- Dai and Le (2015)AndrewM Dai and QuocV Le. 2015.Semi-supervised Sequence Learning.In Advances in Neural Information Processing Systems,volume28. Curran Associates, Inc.
- Elhage etal. (2022)Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, BenMann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, DawnDrain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, JacksonKernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark,Jared Kaplan, Sam McCandlish, and Chris Olah. 2022.AMathematical Framework for Transformer Circuits.
- French (1999)R.M. French. 1999.Catastrophicforgetting in connectionist networks.Trends in Cognitive Sciences, 3(4):128–135.
- Geva etal. (2021)Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021.TransformerFeed-Forward Layers Are Key-Value Memories.In Proceedings of the 2021 Conference on EmpiricalMethods in Natural Language Processing, pages 5484–5495, Online andPunta Cana, Dominican Republic. Association for Computational Linguistics.
- Guu etal. (2020)Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020.REALM: retrieval-augmented language model pre-training.In Proceedings of the 37th International Conference onMachine Learning, volume 119 of ICML’20, pages 3929–3938.JMLR.org.
- Halawi etal. (2024)Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt. 2024.Approaching human-levelforecasting with language models.Preprint, arXiv:2402.18563.
- He etal. (2021)Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and GrahamNeubig. 2021.Towards aUnified View of Parameter-Efficient Transfer Learning.
- Hendrycks etal. (2020a)Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song,and Jacob Steinhardt. 2020a.Aligning AI With Shared Human Values.
- Hendrycks etal. (2020b)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, DawnSong, and Jacob Steinhardt. 2020b.Measuring Massive Multitask Language Understanding.
- Howard and Ruder (2018)Jeremy Howard and Sebastian Ruder. 2018.Universal LanguageModel Fine-tuning for Text Classification.In Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: LongPapers), pages 328–339, Melbourne, Australia. Association forComputational Linguistics.
- Hu etal. (2021)EdwardJ. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, SheanWang, LuWang, and Weizhu Chen. 2021.LoRA:Low-Rank Adaptation of Large Language Models.
- Jang etal. (2022)Joel Jang, Seonghyeon Ye, Changho Lee, Sohee Yang, Joongbo Shin, Janghoon Han,Gyeonghun Kim, and Minjoon Seo. 2022.TemporalWiki: A Lifelong Benchmark for Training and EvaluatingEver-Evolving Language Models.In Proceedings of the 2022 Conference on EmpiricalMethods in Natural Language Processing, pages 6237–6250, Abu Dhabi,United Arab Emirates. Association for Computational Linguistics.
- Jang etal. (2021)Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, GyeonghunKim, StanleyJungkyu Choi, and Minjoon Seo. 2021.TowardsContinual Knowledge Learning of Language Models.
- Jiang etal. (2024)AlbertQ. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, BlancheSavary, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, EmmaBouHanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample,LélioRenard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock,Sandeep Subramanian, Sophia Yang, Szymon Antoniak, TevenLe Scao, ThéophileGervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed.2024.Mixtral of experts.Preprint, arXiv:2401.04088.
- Kadavath etal. (2022)Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, EthanPerez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, EliTran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage,Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, DeepGanguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, TomBrown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, andJared Kaplan. 2022.Language Models(Mostly) Know What They Know.arXiv preprint.ArXiv:2207.05221 [cs].
- Kasai etal. (2022)Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, RonanLe Bras, Akari Asai,Xinyan Yu, Dragomir Radev, NoahA. Smith, Yejin Choi, and Kentaro Inui. 2022.RealTime QA:What’s the Answer Right Now?arXiv preprint.ArXiv:2207.13332 [cs].
- Kirkpatrick etal. (2017)James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, GuillaumeDesjardins, AndreiA. Rusu, Kieran Milan, John Quan, Tiago Ramalho, AgnieszkaGrabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, andRaia Hadsell. 2017.Overcomingcatastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences,114(13):3521–3526.Publisher: Proceedings of the National Academy of Sciences.
- Kossen etal. (2023)Jannik Kossen, Tom Rainforth, and Yarin Gal. 2023.In-ContextLearning in Large Language Models Learns Label Relationshipsbut Is Not Conventional Learning.arXiv preprint.ArXiv:2307.12375 [cs].
- Kucharavy etal. (2023)Andrei Kucharavy, Zachary Schillaci, Loïc Maréchal, Maxime Würsch, LjiljanaDolamic, Remi Sabonnadiere, Dimitri PerciaDavid, Alain Mermoud, and VincentLenders. 2023.Fundamentals of Generative Large Language Models andPerspectives in Cyber-Defense.
- Lester etal. (2021)Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.The Powerof Scale for Parameter-Efficient Prompt Tuning.In Proceedings of the 2021 Conference on EmpiricalMethods in Natural Language Processing, pages 3045–3059, Online andPunta Cana, Dominican Republic. Association for Computational Linguistics.
- Lewis etal. (2020)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, VladimirKarpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, TimRocktäschel, Sebastian Riedel, and Douwe Kiela. 2020.Retrieval-Augmented Generation for Knowledge-Intensive NLPTasks.
- Li etal. (2023)Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian,Fang Luo, Qiang Yang, and Xing Xie. 2023.Large language modelsunderstand and can be enhanced by emotional stimuli.Preprint, arXiv:2307.11760.
- Li and Liang (2021)XiangLisa Li and Percy Liang. 2021.Prefix-Tuning: Optimizing Continuous Prompts for Generation.In Proceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the 11th InternationalJoint Conference on Natural Language Processing (Volume 1: LongPapers), pages 4582–4597, Online. Association for ComputationalLinguistics.
- Lin etal. (2022a)BillYuchen Lin, Sida Wang, XiLin, Robin Jia, Lin Xiao, Xiang Ren, and ScottYih. 2022a.On ContinualModel Refinement in Out-of-Distribution Data Streams.In Proceedings of the 60th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: LongPapers), pages 3128–3139, Dublin, Ireland. Association for ComputationalLinguistics.
- Lin etal. (2022b)Stephanie Lin, Jacob Hilton, and Owain Evans. 2022b.Teaching Modelsto Express Their Uncertainty in Words.Transactions on Machine Learning Research.
- Liu etal. (2024)James Liu, Guangxuan Xiao, Kai Li, JasonD. Lee, Song Han, Tri Dao, and TianleCai. 2024.Bitdelta: Your fine-tunemay only be worth one bit.Preprint, arXiv:2402.10193.
- Liu etal. (2022a)Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan LeBras,Yejin Choi, and Hannaneh Hajishirzi. 2022a.GeneratedKnowledge Prompting for Commonsense Reasoning.In Proceedings of the 60th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: LongPapers), pages 3154–3169, Dublin, Ireland. Association for ComputationalLinguistics.
- Liu etal. (2023a)Jiongnan Liu, Jiajie Jin, Zihan Wang, Jiehan Cheng, Zhicheng Dou, and Ji-RongWen. 2023a.RETA-LLM: A Retrieval-Augmented Large LanguageModel Toolkit.
- Liu etal. (2023b)Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, andGraham Neubig. 2023b.Pre-train, Prompt, andPredict: A Systematic Survey of Prompting Methods in NaturalLanguage Processing.ACM Computing Surveys, 55(9):195:1–195:35.
- Liu etal. (2022b)Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and JieTang. 2022b.P-Tuning:Prompt Tuning Can Be Comparable to Fine-tuning Across Scalesand Tasks.In Proceedings of the 60th Annual Meeting of theAssociation for Computational Linguistics (Volume 2: ShortPapers), pages 61–68, Dublin, Ireland. Association for ComputationalLinguistics.
- Liu etal. (2021)Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, andJie Tang. 2021.GPTUnderstands, Too.arXiv preprint.ArXiv:2103.10385 [cs].
- LoganIV etal. (2022)Robert LoganIV, Ivana Balazevic, Eric Wallace, Fabio Petroni, Sameer Singh,and Sebastian Riedel. 2022.CuttingDown on Prompts and Parameters: Simple Few-Shot Learning withLanguage Models.In Findings of the Association for ComputationalLinguistics: ACL 2022, pages 2824–2835, Dublin, Ireland. Associationfor Computational Linguistics.
- Ma etal. (2022)Yukun Ma, TrungHieu Nguyen, and Bin Ma. 2022.CPT:Cross-Modal Prefix-Tuning for Speech-To-Text Translation.In ICASSP 2022 - 2022 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), pages6217–6221.ISSN: 2379-190X.
- Meng etal. (2022a)Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022a.Locating and Editing Factual Associations in GPT.Advances in Neural Information Processing Systems,35:17359–17372.
- Meng etal. (2022b)Kevin Meng, ArnabSen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau.2022b.Mass-EditingMemory in a Transformer.arXiv preprint.ArXiv:2210.07229 [cs].
- Mitchell etal. (2021)Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and ChristopherD.Manning. 2021.Fast ModelEditing at Scale.
- Muennighoff etal. (2023)Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, StellaBiderman, Teven LeScao, MSaiful Bari, Sheng Shen, ZhengXin Yong, HaileySchoelkopf, Xiangru Tang, Dragomir Radev, AlhamFikri Aji, Khalid Almubarak,Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel.2023.CrosslingualGeneralization through Multitask Finetuning.In Proceedings of the 61st Annual Meeting of theAssociation for Computational Linguistics (Volume 1: LongPapers), pages 15991–16111, Toronto, Canada. Association forComputational Linguistics.
- Petroni etal. (2019)Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, AntonBakhtin, Yuxiang Wu, and Alexander Miller. 2019.Language Models asKnowledge Bases?In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association forComputational Linguistics.
- Radford etal. (2019)Alec Radford, Jeff Wu, Rewon Child, D.Luan, Dario Amodei, and Ilya Sutskever.2019.Language Models are Unsupervised Multitask Learners.
- Razdaibiedina etal. (2023)Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, andAmjad Almahairi. 2023.ProgressivePrompts: Continual Learning for Language Models.arXiv preprint.ArXiv:2301.12314 [cs].
- Simoulin and Crabbé (2021)Antoine Simoulin and Benoit Crabbé. 2021.How ManyLayers and Why? An Analysis of the Model Depth inTransformers.In Proceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the 11th InternationalJoint Conference on Natural Language Processing: StudentResearch Workshop, pages 221–228, Online. Association for ComputationalLinguistics.
- Strubell etal. (2020)Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2020.Energy and PolicyConsiderations for Modern Deep Learning Research.Proceedings of the AAAI Conference on Artificial Intelligence,34(09):13693–13696.Number: 09.
- Wang etal. (2022)Jianing Wang, Wenkang Huang, Minghui Qiu, Qiuhui Shi, Hongbin Wang, Xiang Li,and Ming Gao. 2022.KnowledgePrompting in Pre-trained Language Model for Natural LanguageUnderstanding.In Proceedings of the 2022 Conference on EmpiricalMethods in Natural Language Processing, pages 3164–3177, Abu Dhabi,United Arab Emirates. Association for Computational Linguistics.
- Wei etal. (2023)Jerry Wei, DaHuang, Yifeng Lu, Denny Zhou, and QuocV. Le. 2023.Simple syntheticdata reduces sycophancy in large language models.arXiv preprint.ArXiv:2308.03958 [cs].
- Welbl etal. (2017)Johannes Welbl, NelsonF. Liu, and Matt Gardner. 2017.CrowdsourcingMultiple Choice Science Questions.In Proceedings of the 3rd Workshop on NoisyUser-generated Text, pages 94–106, Copenhagen, Denmark. Association forComputational Linguistics.
- Wu etal. (2021)Yuhuai Wu, MarkusNorman Rabe, DeLesley Hutchins, and Christian Szegedy. 2021.MemorizingTransformers.
- Wu etal. (2022)Yuxiang Wu, YuZhao, Baotian Hu, Pasquale Minervini, Pontus Stenetorp, andSebastian Riedel. 2022.An EfficientMemory-Augmented Transformer for Knowledge-Intensive NLPTasks.pages 5184–5196.
- Yang and Liu (2021)Zonghan Yang and Yang Liu. 2021.On RobustPrefix-Tuning for Text Classification.
- Yoon etal. (2018)Jaehong Yoon, Eunho Yang, Jeongtae Lee, and SungJu Hwang. 2018.LifelongLearning with Dynamically Expandable Networks.
- Yu etal. (2023)Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin Lv, HaoPeng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan Zhang, YushiBai, Yantao Liu, Amy Xin, Nianyi Lin, Kaifeng Yun, Linlu Gong, Jianhui Chen,Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan, Kaisheng Zeng, JiQi, Hailong Jin,Jinxin Liu, YuGu, Yuan Yao, Ning Ding, Lei Hou, Zhiyuan Liu, Bin Xu, JieTang, and Juanzi Li. 2023.KoLA:Carefully Benchmarking World Knowledge of Large LanguageModels.arXiv preprint.ArXiv:2306.09296 [cs].
- Zhao etal. (2022)Lulu Zhao, Fujia Zheng, Weihao Zeng, Keqing He, Weiran Xu, Huixing Jiang, WeiWu, and Yanan Wu. 2022.Domain-Oriented Prefix-Tuning: Towards Efficient andGeneralizable Fine-tuning for Zero-Shot Dialogue Summarization.In Proceedings of the 2022 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics:Human Language Technologies, pages 4848–4862, Seattle, United States.Association for Computational Linguistics.
- Zhu etal. (2020)Chen Zhu, AnkitSingh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li,Felix Yu, and Sanjiv Kumar. 2020.ModifyingMemories in Transformer Models.
Appendix A Learned and non-learned facts
Table6 gives some statistics about the facts that have been learned or not learned in our experiments.
Train set facts | Test set facts | |||
Metric | Non-learned | Learned | Non-learned | Learned |
Length (characters) | 57.8 | 51.0 | 73.5 | 66.2 |
Length (tokens) | 15.5 | 13.3 | 18.2 | 15.9 |
Length of (characters) | 17.8 | 15.6 | - | - |
BLOOMZ-7b1 per-token perplexity | 4.56 | 4.30 | 4.26 | 4.18 |
Appendix B Impact from the number of novel facts
Figure4 complements Figure2 by showing the mean accuracy of the models as a function of the number of facts, confirming the diminushing returns when increasing the number of new facts beyond 10.
Figure5 complements Figure3 by showing the observation of two phaseswith less and more than 10 new facts.
Appendix C Qualitative examples
Table7 shows both examples of generated sentences and facts that are already known by the model.All of these samples have been randomly extracted, without any cherry picking.
The Lesser hairy-footed dunnart is also known as S. youngsoni. |
Milady de Winter died by homicide. |
Garden Warbler is also known as S. borin. |
Dylan and Cole Sprouse were born on 4 August 1992. |
Yannick Aguemon is 180 centimetres tall. |
Heinrich Hoffmann died of natural causes. |
Chen Lin, occupation, writer |
White Flag, language of work or name, English |
A View to a Kill, MPA rating, PG |
Corey Hart, language spoken, English |
Extinction, mitigated by, conservation efforts |
Frederik Storm, country for sport, Denmark |