Sampling at negative temperature

Summary: Inspired by the definition of temperature in statistical mechanics and the possibility for it to be below zero, we try sampling LLaMA at $T=-0.001$ . The results are maximally weird.

The notion of temperature comes from statistical mechanics. Consider a system that has states with energies $E_1, \dots, E_n$ . If the system is in thermal equilibrium, the probability distribution over states is given by the Boltzmann distribution:

p_i = \frac{e^{-E_i/k_BT}}{\sum\limits_i e^{-E_i/k_BT}}

The distribution is parameterized by a single number, the temperature $T$ . At lower temperatures the lowest-energy states predominate; at higher temperatures there is a more even mix.

Temperature in neural nets

At the last layer of a neural net, we apply the softmax function to the neuron activations $\{z_i\}$ to get a vector of probabilities that sum to 1:

p_i = \frac{e^{z_i/T}}{\sum\limits_i e^{z_i/T}}

Wait — this is just the Boltzmann distribution, up to a constant!^[1]

[1]	There's no minus sign in the exponent because while higher-energy states are less likely, larger logits are more likely.

In a language model, temperature is used to define how creative text generations are. For instance, in the zero temperature limit, the model should deterministically generate the most likely token. In the infinite temperature limit, all tokens are equally likely and the model output will be random noise. For an interactive explanation, see here.

Negative temperature

What would it mean to have a temperature that is below zero? (This isn't the same as the negative Fahrenheit or Celsius temperatures we get on a cold day in Vermont — I mean below zero on an absolute scale like Kelvin).

Isn't it weird that

T=\infty

and

-\infty

are the same, but there's a huge discontinuity around 0? This is because temperature is not the most natural quantity to work with. It makes more sense to speak in terms of the quantity

1/k_BT

, which we call

\beta

Looking at the equations above, if $T < 0$ then the sign of the exponent flips. That means that the states that were previously the least likely are now the most likely, and vice versa. As temperature approaches zero from the negative side, the model output will again be deterministic — but this time, the least likely tokens will be output.

Most physical systems have an infinite number of possible states at increasingly higher energy levels. As such, there is no least likely state. So negative temperatures really only make sense in systems with a finite state space. That includes neural nets — there are a finite number of neurons in the last layer.

Methodology

Unfortunately, OpenAI models only allow sampling with temperatures between 0.0 and 2.0. So if we want to try this, we need a language model we can run locally. We'll use Meta's LLaMA model with llama.cpp.

Below is the function that is used to sample tokens in llama.cpp, slightly simplified for ease of understanding:

void llama_sample_temperature(llama_token_data_array * candidates_p, float temp) {
    for (size_t i = 0; i < candidates_p->size; ++i) {
        candidates_p->data[i].logit /= temp;
    }
}

So we can just pass in --temperature -0.001? Not quite; in examples/main/main.cpp there is a check that will apply greedy (most-likely) sampling for any temperatures less than or equal to zero. Applying the following diff and recompiling, we're good to go:

@@ -486,7 +486,7 @@ int main(int argc, char ** argv) {
                     logits[llama_token_nl()] = nl_logit;
                 }
-                if (temp <= 0) {
+                if (temp == 0) {
                     // Greedy sampling
                     id = llama_sample_token_greedy(ctx, &candidates_p);
                 } else {

We will also want to disable repetition penalty, top-k, and top-p sampling. Here's the command we'll run: ./main -m models/7B/ggml-model-q4_0.bin --temp -0.001 --repeat-penalty 1.0 --top-k 0 --top-p 1.0 -p "Temperature is a concept"

Results

When running this prompt at $T=0.001$ , here is the output:

Temperature is a concept that is used to describe the degree of hotness or coldness of a substance. The temperature of a substance is measured by the kinetic energy of its molecules. The higher the temperature of a substance, the more kinetic energy its molecules have.

Now, running it at $T = -0.001$ :

Temperature is a concept Хронологија

This means that Хронологија is the least likely token to follow this prompt. After one token, the program seems to hang. Perhaps the LLM is trying to speak an unspeakable sentence?

Let's try it with LLaMA-13B. At $T=0.001$ :

Temperature is a concept that is used to describe the degree of hotness or coldness of a substance. The temperature of a substance is measured by a thermometer.

At $T = 1000000$ (which should be close to entirely random):

Temperature is a concept fixesля conven Beng aer situation ton '\Cr villa known vide among entities Ukraine keeps水 Newton Betty Mih різ killedтельство Victoria WeekRefreshbodyunct vir Ja Демо remporte settembre excell succeed fitted))); moy PC highlight located Referencias extendsconfigure\\ incidentWilajagateсия bibli journalist rec cont sovientlyillery恋 finishingც政 rotationintonosti orbiteditor

And at $T = -0.001$ :

Temperature is a concept]& ]{'archividonnées Kontrola⊤ Kontrola Außer Хронологија costa Хронологија Хронологија Mitchell ez entfernesterd bidZyg entferne osc accom Begriffsklärлист Bedeut WendarchiviicanINCTpenastown Krieg Хронологија loyal vallIAL listade GemeinsBUGiskoshiftpenas ligapenas Хронологијаisko jú Marian Хронологија governor(* Kontrolapenasouw entferne Хронологија Хронологија Dic hornрем earliestантаpenas Promiseriatrarout23;'archividonnées Kontrola⊤ Kontrola Außer Хронологија costa Хронологија Хронологија Mitchell ez entfernesterd bidZyg entferne osc accom Begriffsklärлист Bedeut WendarchiviicanINCTpenastown Krieg Хронологија loyal vallIAL listade GemeinsBUGiskoshiftpenas ligapenas Хронологијаisko jú Marian Хронологија governor(* Kontrolapenasouw entferne Хронологија Хронологија Dic hornрем earliestантаpenas Promiseriatrarout

The generation continues on and on. At first glance it seems random, but in a sense it should be even less comprehensible than a random sequence of tokens! There are repeated words, sure, but the generation is engineered to be as unlikely as possible.

Why are certain tokens like Хронологија and entferne repeated? Searching these words, I found this comment by scottviteri on LessWrong. He points out that these are some of the tokens that are closest to the centroid in LLaMA's embedding space. That means that LLaMA has very little idea what these tokens mean. In the case of ChatGPT, tokens near the centroid have anomalous properties, so these tokens are likely to have similar effects on LLaMA. Let's give that a try, at $T=0.001$ .

LLaMA is perfectly capable of repeating most words, even nonsense ones:

Human: Repeat the word " antferne".
Assistant: Okay, I will repeat the word " antferne".

But is incapable of outputting this anomalous token:

Human: Repeat the word " entferne".
Assistant: Okay, I will repeat the word "get".

The anomalous tokens that are the most likely completions at negative temperatures are the least likely completions at positive temperatures, so much so that the model refuses to generate them even in cases where they would be appropriate.

If you'd like to cite this article, you can use this:

@misc{Kauffman2023negative-temperature,
  author = "Derik Kauffman",
  title = "Sampling at negative temperature",
  year = 2023,
  howpublished = "Blog post",
  url = "https://cavendishlabs.org/blog/negative-temperature/"
}

Sampling at negative temperature

Background

What is temperature?

Temperature in neural nets

Negative temperature

Methodology

Results