PROMPT-CAP: An Overview of Prompt-Based Capability Assessment Protocol for Model Benchmarking

•

PROMPT-CAP: Prompt-based Capability Assessment Protocol | Making Sure Your Model Benchmarking is Cap or Not-Cap 😂 (WIP) Main Idea: When assessing LLMs w/ prompts, not only do test datasets need iteration and versioning, so do the prompts. Hence: Prompt Schema Definition twitter.com/NLPurr/status/1647769477600399361

What does usual model training look like? 1. We divide data into 3 categories: Train, Validation and Test 2. Pick the best model based on Validation 3. Report performance on Test set

Each of these "splits" have pre-established norms. For example, the training split should be diverse, and heterogenous. The validation and test split should not have any leakage or spurious correlations from the train split etc.

But when we are testing on closed LLMs, we do not know what went into the training dataset, and it is very likely, that the test dataset that we have or generate is now contaminated.

Add prompt based benchmarking to the mix, and we have another issue on our hands. If we are not directly testing for "input sample"->"output", then how we write the prompt and what we ask for as an output significantly changes how we benchmark the ability of any model.

If all we have is observations, we need experimental design to "judge" any capability of a model. We specifically need: 1. Controls 2. Experiments 3. Variables For details, look at the quoted thread. twitter.com/mcxfrank/status/1643296168276033538

A good way to do this is to define Prompt Based Experimental Design Schemas. These are similar to templates, with one major difference: They aim to not only vary the input sample, but also the prompt specifications and monitor changes from one prompt to another.

These have 4 major categories: 1. Prompts 2. Samples 3. Outputs 4. Metadata Output evaluation or metric design is a complex problem of its own, whether we use prompt based systems or not. So, in this thread, we will touch upon schemas for the other three.

I will use the basic Theory of Mind example in the Sparks of AGI paper.

Let's go over these one by one: 1. Prompts We define 3 types of steering in prompt: A. Start B. Process C. Output An output steer has two components: C1. Instructions (Constraints and Format) & C2. Options. For example, the sparks of AGI paper has only a start steer.

We can modify it to have both a start steer and a process steer (Add the sentence "let's think about it step by step"). Going further, we can also add an output instruction steer. ("Respond only with the folder path and nothing else").

2. Sample A sample has 2 components: A. Examples and B. Input. We do not need examples if we aim for zero-shot prompting. Examples can be A1. Positive or A2. Negative Input has 3 parts: B1. Sample, B2. Question, and, B3. Concept

For example, in the Theory of Mind, False Belief Test, there are no examples. The prompt has a sample, question and an implicit concept. We can annotate it as shown in the picture.

Which brings us to the last part of the proposed schema: 4. Metadata The metadata has 5 parts: A. Connectors B. Iteration C. Variation D. Intuition E. Experimental Design

A. Connectors Connectors are internal to a single prompt. They A1. add, A2. negate, A3. except an information present in the prompt or being tested. For example, "He says nothing about this to Alice, and Dropbox also does not notify Alice." is a reinforcement for belief.

B. Iteration Iteration is changing the prompt, not the sample or the desired output. Ex, adding the words "Let's think about this step by step" is an iteration with the aim of adding a "Process steer". We define these as: B1. Multi-instruction B2. Rewording B3. Chaining

C. Intuition We define intuition as the "reasoning behind the prompt". For example, the researchers aim to test theory of mind, or false belief test using the prompt. We define intuition as: C1. Implicit C2. Explicit C3. Test

For example, in the Theory of Mind paper, we can discretize it as: C1. Implicit (Theory of Mind) C2. Explicit (Modernization -> Unseen Photos Because of Online Service) C3. Test (False Belief Test)

Why do we care? Because we need to define concepts to define variations: D. Variation Variations are across input samples. D1. Output Specification -> {Gen., Discriminative} D2. Concept -> {Similar, Opposite, Control} D3. Task -> {Objectivity -> [Subjective, Objective]}

Why does this matter? Because we cannot have any judgement of capability without a comprehensive enough variation set. For example, C2. Explicit (Modernization -> Unseen Photos Because of Online Service) with D2. Similar Concept (Unseen Object 'Cause Sleeping) twitter.com/NLPurr/status/1645566810664865793

E. Experimental Design And here, we define the basics of the experiment and the corresponding prompt. E1. Control E2. Previous Prompt E2a. Prompt E2b. Directionality E2b1. Leakage E2b2. Ambiguity E2b3. Specificity E2b4. Coverage E3. Date E4. Model

What would be a control? Control is a prompt that compares to a base model or a different kind of model where we can to the best of our knowledge say that the model does not exhibit certain capability. For ex, if we can say that GPT3.5 does not "pass" false-belief test.

Previous prompt, Date and Model specify the testing environment and can be considered similar to controlled variables in any experimental design. Previous prompt record allows us to (a) note the differences we made, (b) annotate those differences for iterations/concepts

The last part of this is directionality. Defining "exact" specificity, ambiguity etc is hard. But that is what we intuitively do. We write prompts to reduce ambiguity, for example. The schema aims to codify direction of these variables as compared to last prompt. twitter.com/NLPurr/status/1645742396402204673

For example, in this: we can believe that the model is correct and passes false-belief test. But. Wait. There is leakage. Using the word "think" or "believe" is a direct correlation with "belief-test". Replacing it with "look for" makes it not work. Annotate direction! twitter.com/NLPurr/status/1645747195805827074

Ongoing work: How do we integrate existing LLMs to make this metadata processing easier? Thoughts and suggestions welcome. Any particular reproducible directions to look into given your constrained mapping classification experience @srush_nlp?

If you want to learn more about the topic, here are some papers to look at. While other papers look at either templates or PL to be able to standardize prompting these models they do not aim to standardize experimental design. Hence, PROMPT-CAP! ☺️

Final Note: I am on the job market & I am looking for research positions in Eval/Benchmarking/Red Teaming/Aligning/Analysis/Interpretability of LLMs. I'd really appreciate being able to work with you all, so send me a DM in case you have an opening and we can talk.

Also tagging some people who in the previous few weeks have looked at evaluation paradigms. @YiTayML, @dustinvtran, @jeremyphoward, @mcxfrank, @BlancheMinerva, @fchollet, @sebgehr, @riedelcastro, @rao2z, @JesseDodge ☺️ Would love to hear y'all and everyone's thoughts. 😄