Temperature is a hyperparameter that controls the model’s randomness.
Temperature is named after the concept of temperature in statistical mechanics, where higher temperature leads to more random states.
1. Temperature
Neural networks commonly use a Softmax layer to convert the model’s raw output scores (logits) into a probability distribution over classes.
Here, is an adjustable parameter known as temperature. The temperature hyperparameter is usually range from 0
to 1
, but in some case higher value (large than 1
) can also be a choice.
When , this formula represents the standard softmax function:
As the temperature increases, the differences between the exponentiated values become smaller. This results in a smoother probability distribution, where the probabilities are less concentrated on the most likely class.
For example, using , the calculation becomes
By increasing , the probability distribution becomes less peaked, and the gap between the highest probability and the others narrows.
Conversely, decreasing the temperature (towards 0) leads to a sharper probability distribution. The probability of the most likely class increases significantly, making the model appear more “confident” in its top prediction.
When sampling the next token, adjusting the temperature parameter controls the output:
- Lower temperatures lead to more conservative, predictable, and focused text.
- Higher temperatures produce more diverse, creative, and sometimes unexpected results.
The choice depends heavily on the specific generation task requirements.
For example:
- Set temperature to
0.9
, the model will generate some imaginative ideas and give you inspiration. - Set temperature to
0.1
, the model will generate straightforward, factual results. This kind of low temperature is suitable for technical documentation and customer support responses.
Set temperature value too high will cause the model generate less coherent or even nonsensical text.
There is no best temperature setting, it all depends on your expectations regarding the model’s output.