This Rust program provides a simple HTTP API to tokenize text using Hugging Face's tokenizer. It is primarily intended for use from languages where there aren't any nice or fast tokenizers available. The API allows you to get the token length and trim strings to a specified token length.
To use this program, you will need the following:
- Rust (latest stable version)
- Cargo (latest stable version)
- Docker (optional)
- Clone the repository:
git clone https://github.com/josephhajduk/howmanytokens.git
cd howmanytokens
- Build the project:
cargo build --release
- Run the program:
cargo run --release
- Clone the repository:
git clone https://github.com/josephhajduk/howmanytokens.git
cd howmanytokens
- Build the Docker image:
docker build -t howmanytokens .
- Run the Docker container:
docker run -p 30000:30000 howmanytokens
By default, the program will listen on 0.0.0.0:30000
and use the gpt2
tokenizer. You can change these settings by setting the SERVER_ADDR
and PRETRAINED
environment variables, respectively.
POST /len
: Get the token length of the input string.POST /trim/<length>
: Trim the input string to the specified token length.POST /trimw/<length>
: Trim the input string to the specified token length while ensuring that the last word is not cut off.
# Get token length
curl -X POST -d "This is a test string." http://localhost:30000/len
# Trim string to 5 tokens
curl -X POST -d "This is a test string." http://localhost:30000/trim/5
# Trim string to 5 tokens without cutting off the last word
curl -X POST -d "This is a test string." http://localhost:30000/trimw/5
This project is licensed under the MIT License.
GPT4 wrote this MD file for me and I barely skimmed it so it could all be nonsense.