An api to query local language models using different backends. Supported backends:
- Llama.cpp Python: the local Python bindings for Llama.cpp
- Kobold.cpp: the Koboldcpp api server
- Ollama: the Ollama api server
pip install locallm
from locallm import LocalLm, InferenceParams, LmParams
lm = LocalLm(
LmParams(
models_dir="/home/me/my/models/dir"
)
)
lm.load_model("mistral-7b-instruct-v0.1.Q4_K_M.gguf", 8192)
template = "<s>[INST] {prompt} [/INST]"
lm.infer(
"list the planets in the solar system",
InferenceParams(
template=template,
temperature=0.2,
stream=True,
max_tokens=512,
),
)
from locallm import KoboldcppLm, LmParams, InferenceParams
lm = KoboldcppLm(
LmParams(is_verbose=True)
)
lm.load_model("", 8192) # sets the context window size to 8196 tokens
template = "<s>[INST] {prompt} [/INST]"
lm.infer(
"list the planets in the solar system",
InferenceParams(
template=template,
stream=True,
max_tokens=512,
),
)
from locallm import OllamaLm, LmParams, InferenceParams
lm = OllamaLm(
LmParams(is_verbose=True)
)
lm.load_model("mistral-7b-instruct-v0.1.Q4_K_M.gguf", 8192)
template = "<s>[INST] {prompt} [/INST]"
lm.infer(
"list the planets in the solar system",
InferenceParams(
stream=True,
template=template,
temperature=0.5,
),
)
Providers:
- Llama.cpp Python provider
- Kobold.cpp provider
- Ollama provider
Other:
An abstract base class to describe a language model provider. All the providers implement this api
- llm
Optional[Llama]
: the language model. - models_dir
str
: the directory where the models are stored. - api_key
str
: the API key for the language model. - server_url
str
: the URL of the language model server. - is_verbose
bool
: whether to print more information. - threads
Optional[int]
: the numbers of threads to use. - gpu_layers
Optional[int]
: the numbers of layers to offload to the GPU. - embedding
Optional[bool]
: use embeddings or not. - on_token
OnTokenType
: the function to be called when a token is generated. Default: outputs the token to the terminal. - on_start_emit
OnStartEmitType
: the function to be called when the model starts emitting tokens.
lm = OllamaLm(LmParams(is_verbose=True))
Methods:
Constructs all the necessary attributes for the LmProvider object.
- params
LmParams
: the parameters for the language model.
lm = KoboldcppLm(LmParams())
Loads a language model.
- model_name
str
: The name of the model to load. - ctx
int
: The context window size for the model. - gpu_layers
Optional[int]
: The number of layers to offload to the GPU for the model.
lm.load_model("my_model.gguf", 2048, 32)
Run an inference query.
- prompt
str
: the prompt to generate text from. - params
InferenceParams
: the parameters for the inference query.
- result
InferenceResult
: the generated text and stats
>>> lm.infer("<s>[INST] List the planets in the solar system [/INST>")
The planets in the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune.
Parameters for inference.
- stream
bool, Optional
: Whether to stream the output. - template
str, Optional
: The template to use for the inference. - threads
int, Optional
: The number of threads to use for the inference. - max_tokens
int, Optional
: The maximum number of tokens to generate. - temperature
float, Optional
: The temperature for the model. - top_p
float, Optional
: The probability cutoff for the top k tokens. - top_k
int, Optional
: The top k tokens to generate. - min_p
float, Optional
: The minimum probability for a token to be considered. - stop
List[str], Optional
: A list of words to stop the model from generating. - frequency_penalty
float, Optional
: The frequency penalty for the model. - presence_penalty
float, Optional
: The presence penalty for the model. - repeat_penalty
float, Optional
: The repeat penalty for the model. - tfs
float, Optional
: The temperature for the model. - grammar
str, Optional
: A gbnf grammar to constraint the model's output
InferenceParams(stream=True, template="<s>[INST] {prompt} [/INST>")
{
"stream": True,
"template": "<s>[INST] {prompt} [/INST>"
}
Parameters for language model.
- models_dir
str, Optional
: The directory containing the language model. - api_key
str, Optional
: The API key for the language model. - server_url
str, Optional
: The server URL for the language model. - is_verbose
bool, Optional
: Whether to enable verbose output. - on_token
Callable[[str], None], Optional
: A callback function to be called on each token generated. If not provided the default will output tokens to the command line as they arrive - on_start_emit
Callable[[Optional[Any]], None], Optional
: A callback function to be called on the start of the emission.
LmParams(
models_dir="/home/me/models",
api_key="abc123",
)
To configure the tests create a tests/localconf.py
containing the some local config info to
run the tests:
# absolute path to your models dir
MODELS_DIR = "/home/me/my/models/dir"
# the model to use in the tests
MODEL = "q5_1-gguf-mamba-gpt-3B_v4.gguf"
# the context window size for the tests
CTX = 2048
Be sure to have the corresponding backend up before running a test.