Skip to content

Commit

Permalink
Introduce /upload and /forget commands to chatbot
Browse files Browse the repository at this point in the history
The /upload command allows you to share an image or a text file with the
assistant. In the case of images, it'll be printed in the terminal, then
fed through LLaVA so that --mmproj vision models can analyze it for you.

The /forget command may be used to erase the oldest chat messages from a
context window. This is useful for salvaging a conversation that got too
long. When running out of context, the chatbot will no not exit anymore,
but instead return control to the REPL so you can free up the resources.
  • Loading branch information
jart committed Nov 10, 2024
1 parent 21af0bf commit d25c077
Show file tree
Hide file tree
Showing 20 changed files with 1,242 additions and 281 deletions.
7 changes: 4 additions & 3 deletions llama.cpp/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,8 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {

FLAGS_READY = true;
params.n_gpu_layers = llamafile_gpu_layers(params.n_gpu_layers);
FLAG_threads = params.n_threads; // [jart]
FLAG_threads_batch = params.n_threads_batch; // [jart]

return true;
}
Expand Down Expand Up @@ -275,7 +277,6 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
if (params.n_threads <= 0) {
params.n_threads = MIN(cpu_get_num_math(), 20); // [jart]
}
FLAG_threads = params.n_threads; // [jart]
return true;
}
if (arg == "-tb" || arg == "--threads-batch") {
Expand All @@ -284,7 +285,6 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
if (params.n_threads_batch <= 0) {
params.n_threads_batch = cpu_get_num_math(); // [jart]
}
FLAG_threads_batch = params.n_threads_batch; // [jart]
return true;
}
if (arg == "-td" || arg == "--threads-draft") {
Expand Down Expand Up @@ -691,9 +691,10 @@ bool gpt_params_find_arg(int argc, char ** argv, const std::string & arg, gpt_pa
params.control_vector_layer_end = std::stoi(argv[i]);
return true;
}
if (arg == "--mmproj") {
if (arg == "-mm" || arg == "--mmproj") { // [jart]
CHECK_ARG
params.mmproj = argv[i];
FLAG_mmproj = argv[i]; // [jart]
return true;
}
if (arg == "--image") {
Expand Down
5 changes: 3 additions & 2 deletions llama.cpp/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -194,8 +194,9 @@ struct gpt_params {
bool warmup = true; // warmup run
bool check_tensors = false; // validate tensor data

std::string cache_type_k = X86_HAVE(AVX512_BF16) ? "bf16" : "f16"; // KV cache data type for the K [jart]
std::string cache_type_v = X86_HAVE(AVX512_BF16) ? "bf16" : "f16"; // KV cache data type for the V [jart]
// [jart] warning: rope only supports f32 and f16
std::string cache_type_k = "f16"; // KV cache data type for the K
std::string cache_type_v = "f16"; // KV cache data type for the V

// multimodal models (see examples/llava)
std::string mmproj = ""; // path to multimodal projector
Expand Down
1 change: 1 addition & 0 deletions llama.cpp/ggml-cuda.cu
Original file line number Diff line number Diff line change
Expand Up @@ -13341,6 +13341,7 @@ void ggml_cuda_op_rope(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
float * dst_d = (float *)dst->data;
cudaStream_t stream = ctx.stream();

// TODO[jart]: support bf16
GGML_ASSERT(ggml_is_contiguous(src0));
GGML_ASSERT(src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16);
GGML_ASSERT( dst->type == GGML_TYPE_F32 || dst->type == GGML_TYPE_F16);
Expand Down
2 changes: 2 additions & 0 deletions llama.cpp/ggml.c
Original file line number Diff line number Diff line change
Expand Up @@ -13043,6 +13043,7 @@ static void ggml_compute_forward_rope(

const struct ggml_tensor * src0 = dst->src[0];

// TODO[jart]: support bf16
switch (src0->type) {
case GGML_TYPE_F16:
{
Expand All @@ -13067,6 +13068,7 @@ static void ggml_compute_forward_rope_back(

const struct ggml_tensor * src0 = dst->src[0];

// TODO[jart]: support bf16
switch (src0->type) {
case GGML_TYPE_F16:
{
Expand Down
296 changes: 230 additions & 66 deletions llama.cpp/main/README.md

Large diffs are not rendered by default.

190 changes: 137 additions & 53 deletions llama.cpp/main/main.1
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,33 @@ flag:
.Pp
Default: empty
.It Fl n Ar N , Fl Fl n-predict Ar N
Number of tokens to predict.
Sets number of tokens to predict when generating text.
.Pp
This option controls the number of tokens the model generates in
response to the input prompt. By adjusting this value, you can influence
the length of the generated text. A higher value will result in longer
text, while a lower value will produce shorter text.
.Pp
A value of -1 will enable infinite text generation, even though we have
a finite context window. When the context window is full, some of the
earlier tokens (half of the tokens after
.Fl Fl n-keep )
will be discarded. The context must then be re-evaluated before
generation can resume. On large models and/or large context windows,
this will result in significant pause in output.
.Pp
If the pause is undesirable, a value of -2 will stop generation
immediately when the context is filled.
.Pp
It is important to note that the generated text may be shorter than the
specified number of tokens if an End-of-Sequence (EOS) token or a
reverse prompt is encountered. In interactive mode text generation will
pause and control will be returned to the user. In non-interactive mode,
the program will end. In both cases, the text generation may stop before
reaching the specified `n-predict` value. If you want the model to keep
going without ever producing End-of-Sequence on its own, you can use the
.Fl Fl ignore-eos
parameter.
.Pp
.Bl -dash -compact
.It
Expand All @@ -164,56 +190,91 @@ mode, this value sets a hard limit on how long your conversation can be.
The default is 8192 tokens. If this value is zero, then it'll be set to
the maximum context size the model allows.
.It Fl b Ar N , Fl Fl batch-size Ar N
Batch size for prompt processing.
Set batch size for prompt processing.
.Pp
Default: 2048
.It Fl Fl top-k Ar N
Top-k sampling.
Limits next token selection to K most probable tokens.
.Pp
.Bl -dash -compact
.It
0 = disabled
.El
Top-k sampling is a text generation method that selects the next token
only from the top k most likely tokens predicted by the model. It helps
reduce the risk of generating low-probability or nonsensical tokens, but
it may also limit the diversity of the output. A higher value for top-k
(e.g., 100) will consider more tokens and lead to more diverse text,
while a lower value (e.g., 10) will focus on the most probable tokens
and generate more conservative text.
.Pp
Default: 40
.It Fl Fl top-p Ar N
Top-p sampling.
.Pp
.Bl -dash -compact
.It
1.0 = disabled
.El
Limits next token selection to a subset of tokens with a cumulative
probability above a threshold P.
.Pp
Top-p sampling, also known as nucleus sampling, is another text
generation method that selects the next token from a subset of tokens
that together have a cumulative probability of at least p. This method
provides a balance between diversity and quality by considering both the
probabilities of tokens and the number of tokens to sample from. A
higher value for top-p (e.g., 0.95) will lead to more diverse text,
while a lower value (e.g., 0.5) will generate more focused and
conservative text.
.Pp
Default: 0.9
.It Fl Fl min-p Ar N
Min-p sampling.
Sets minimum base probability threshold for token selection.
.Pp
.Bl -dash -compact
.It
0.0 = disabled
.El
The Min-P sampling method was designed as an alternative to Top-P, and
aims to ensure a balance of quality and variety. The parameter p
represents the minimum probability for a token to be considered,
relative to the probability of the most likely token. For example, with
p=0.05 and the most likely token having a probability of 0.9, logits
with a value less than 0.045 are filtered out.
.Pp
Default: 0.1
Default: 0.05
.It Fl Fl tfs Ar N
Tail free sampling, parameter z.
.Pp
.Bl -dash -compact
.It
1.0 = disabled
.El
.Pp
Default: 1.0
Enables tail free sampling with parameter z.
.Pp
Tail free sampling (TFS) is a text generation technique that aims to
reduce the impact of less likely tokens, which may be less relevant,
less coherent, or nonsensical, on the output. Similar to Top-P it tries
to determine the bulk of the most likely tokens dynamically. But TFS
filters out logits based on the second derivative of their
probabilities. Adding tokens is stopped after the sum of the second
derivatives reaches the parameter z. In short: TFS looks how quickly the
probabilities of the tokens decrease and cuts off the tail of unlikely
tokens using the parameter z. Typical values for z are in the range of
0.9 to 0.95. A value of 1.0 would include all tokens, and thus disables
the effect of TFS.
.Pp
Default: 1.0 (which means disabled)
.It Fl Fl typical Ar N
Locally typical sampling, parameter p.
Enables locally typical sampling with parameter p.
.Pp
Locally typical sampling promotes the generation of contextually
coherent and diverse text by sampling tokens that are typical or
expected based on the surrounding context. By setting the parameter p
between 0 and 1, you can control the balance between producing text that
is locally coherent and diverse. A value closer to 1 will promote more
contextually coherent tokens, while a value closer to 0 will promote
more diverse tokens. A value equal to 1 disables locally typical
sampling.
.Pp
Default: 1.0 (which means disabled)
.It Fl Fl repeat-penalty Ar N
Controls repetition of token sequences in generated text.
.Pp
.Bl -dash -compact
.It
1.0 = disabled
.El
This can help prevent the model from generating repetitive or monotonous
text. A higher value (e.g., 1.5) will penalize repetitions more
strongly, while a lower value (e.g., 0.9) will be more lenient.
.Pp
Default: 1.0
Default: 1.1
.It Fl Fl repeat-last-n Ar N
Last n tokens to consider for penalize.
Last n tokens to consider for penalizing repetition.
.Pp
This controls the number of tokens in the history to consider for
penalizing repetition. A larger value will look further back in the
generated text to prevent repetitions, while a smaller value will only
consider recent tokens. A value of 0 disables the penalty, and a value
of -1 sets the number of tokens considered equal to the context size.
.Pp
.Bl -dash -compact
.It
Expand All @@ -223,15 +284,15 @@ Last n tokens to consider for penalize.
.El
.Pp
Default: 64
.It Fl Fl repeat-penalty Ar N
Penalize repeat sequence of tokens.
.Pp
.Bl -dash -compact
.It
1.0 = disabled
.El
.Pp
Default: 1.1
.It Fl Fl no-penalize-nl
Disables penalization of newline tokens when applying the repeat
penalty.
.Pp
This option is particularly useful for generating chat conversations,
dialogues, code, poetry, or any text where newline tokens play a
significant role in structure and formatting. Disabling newline
penalization helps maintain the natural flow and intended formatting in
these specific use cases.
.It Fl Fl presence-penalty Ar N
Repeat alpha presence penalty.
.Pp
Expand All @@ -251,7 +312,16 @@ Repeat alpha frequency penalty.
.Pp
Default: 0.0
.It Fl Fl mirostat Ar N
Use Mirostat sampling. Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used..
Use Mirostat sampling.
.Pp
Mirostat is an algorithm that actively maintains the quality of
generated text within a desired range during text generation. It aims to
strike a balance between coherence and diversity, avoiding low-quality
output caused by excessive repetition (boredom traps) or incoherence
(confusion traps).
.Pp
Using Mirostat causes the Top K, Nucleus, Tail Free and Locally Typical
samplers parameter to be ignored if used.
.Pp
.Bl -dash -compact
.It
Expand All @@ -264,11 +334,22 @@ Use Mirostat sampling. Top K, Nucleus, Tail Free and Locally Typical samplers ar
.Pp
Default: 0
.It Fl Fl mirostat-lr Ar N
Mirostat learning rate, parameter eta.
Sets the Mirostat learning rate (eta).
.Pp
The learning rate influences how quickly the algorithm responds to
feedback from the generated text. A lower learning rate will result in
slower adjustments, while a higher learning rate will make the algorithm
more responsive.
.Pp
Default: 0.1
.It Fl Fl mirostat-ent Ar N
Mirostat target entropy, parameter tau.
Sets the Mirostat target entropy (tau).
.Pp
This represents the desired perplexity value for the generated text.
Adjusting the target entropy allows you to control the balance between
coherence and diversity in the generated text. A lower value will result
in more focused and coherent text, while a higher value will lead to
more diverse and potentially less coherent text.
.Pp
Default: 5.0
.It Fl l Ar TOKEN_ID(+/-)BIAS , Fl Fl logit-bias Ar TOKEN_ID(+/-)BIAS
Expand Down Expand Up @@ -352,10 +433,17 @@ Default: 32.0
.It Fl Fl ignore-eos
Ignore end of stream token and continue generating (implies
.Fl Fl logit-bias Ar 2-inf )
.It Fl Fl no-penalize-nl
Do not penalize newline token.
.It Fl Fl temp Ar N
Temperature.
Adjust the randomness of the generated text.
.Pp
Temperature is a hyperparameter that controls the randomness of the
generated text. It affects the probability distribution of the model's
output tokens. A higher temperature (e.g., 1.5) makes the output more
random and creative, while a lower temperature (e.g., 0.5) makes the
output more focused, deterministic, and conservative. The default value
is 0.8, which provides a balance between randomness and determinism. At
the extreme, a temperature of 0 will always pick the most likely next
token, leading to identical outputs in each run.
.Pp
Default: 0.8
.It Fl Fl logits-all
Expand Down Expand Up @@ -525,10 +613,6 @@ Run in chatml mode (use with ChatML-compatible models)
Verbose print of the KV cache.
.It Fl nkvo , Fl Fl no-kv-offload
Disable KV offload.
.It Fl ctk Ar TYPE , Fl Fl cache-type-k Ar TYPE
KV cache data type for K.
.It Fl ctv Ar TYPE , Fl Fl cache-type-v Ar TYPE
KV cache data type for V.
.It Fl gan Ar N , Fl Fl grp-attn-n Ar N
Group-attention factor.
.Pp
Expand Down
Loading

0 comments on commit d25c077

Please sign in to comment.