An implementation of ``kNNBE: Incorporating Labeled Sentences in Bi-encoder Inference for Fast and Accurate Skill Mapping'', CIKM'25.
⚠️ Important: Building faiss and Flash Attention 2 from source can be complex and time-consuming. We strongly recommend using the Dev Container for the easiest setup experience.
The easiest way to get started is using VS Code with Dev Containers:
- Install VS Code and Docker
- Install the Dev Containers extension
- Clone and open the repository:
git clone https://github.com/megagonlabs/knnbe cd knnbe code .
- When prompted, click "Reopen in Container" (or run
Dev Containers: Reopen in Containerfrom the command palette) - Wait for the container to build (first time only)
That's it! All dependencies including faiss and Flash Attention 2 will be automatically installed.
If you're not using VS Code, you can use Docker directly:
git clone https://github.com/megagonlabs/knnbe
cd knnbe
# Build the Docker image
docker build -t knnbe:latest -f .devcontainer/gpu/Dockerfile .
# Run the container interactively
docker run -it --gpus all \
-v $(pwd):/workspace \
-w /workspace \
knnbe:latest bash
# Inside the container, install Flash Attention 2 and faiss
uv sync --group flash-attn --no-build-isolation
uv pip install --no-deps /opt/wheels/*.whl
# Verify installation
uv run python -c "import faiss; import flash_attn; print('✓ All dependencies installed')"Useful Docker commands:
# Run with Jupyter port forwarding
docker run -it --gpus all \
-v $(pwd):/workspace \
-p 8888:8888 \
knnbe:latest bash
# Run a specific command without entering the container
docker run --gpus all \
-v $(pwd):/workspace \
knnbe:latest \
bash -c "uv sync --group flash-attn --no-build-isolation && uv pip install --no-deps /opt/wheels/*.whl && uv run python -m knnbe.evaluate --help"For advanced users who want to set up the environment locally:
- Python 3.10 or 3.11
- CUDA Toolkit (for GPU support)
- C++ compiler (gcc/g++ or clang)
- CMake 3.17+
- Git
git clone https://github.com/megagonlabs/knnbe
cd knnbe
# Install base dependencies (without faiss and Flash Attention 2)
uv syncNote: faiss is not included in pyproject.toml and must be installed separately via uv pip install after building from source (see Step 3).
uv sync --group flash-attn may uninstall faiss.
# Uninstall dev and flash-attn groups first
uv sync --no-group dev --no-group flash-attn
# Reinstall dev dependencies
uv sync --group dev
# Install Flash Attention 2 (requires CUDA)
uv sync --group flash-attn --no-build-isolationBuild faiss with GPU support:
# Install build dependencies
# Ubuntu/Debian:
sudo apt-get install -y cmake build-essential swig
# Clone faiss
git clone https://github.com/facebookresearch/faiss.git
cd faiss
# Build with GPU support
cmake -B build \
-DFAISS_ENABLE_GPU=ON \
-DFAISS_ENABLE_PYTHON=ON \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Build Python wheel
uv build --wheel --out-dir ../.faiss-wheel build/faiss/python
cd ..
# Install the wheel
uv pip install .faiss-wheel/faiss-*.whluv run torchrun --nproc_per_node=2 -m knnbe.train_sentence_transformer --config config/skill_mapping/biencoder.yamluv run python -m knnbe.evaluation.cross_validation --memory TechWolf/Synthetic-ESCO-skill-sentences --model_name ./models/lightonai/modernbert-embed-large/ --attn_implementation sdpa --ontology_file ./data/skills_en.csvDownload skills_en.csv from https://esco.ec.europa.eu/en/use-esco/download in advance.
uv run python -m knnbe.evaluate --model_name ./models/bi_encoder/lightonai/modernbert-embed-large/ --dataset TechWolf/skill-extraction-tech --ontology_file ./data/skills_en.csv --attn_implementation sdpa --knn_weight 0.6 --num_neighbors 32 --memory TechWolf/Synthetic-ESCO-skill-sentencesfrom datasets import load_dataset, Dataset
from knnbe import KnnBiEncoder, ModelOutputs
model_name = "sentence-transformers/distiluse-base-multilingual-cased-v2"
model = KnnBiEncoder(model_name=model_name)
# Pre-compute skill embeddings
model.create_skill_embeddings(
ontology_file="/path/to/skills_en.csv",
label_key="preferredLabel",
)
# Building the memory with a dataset
ds: Dataset = load_dataset("TechWolf/Synthetic-ESCO-skill-sentences", split="train")
memory_samples: list[dict] = ds.to_list()
model.build_memory(
memory_samples,
label_key="preferredLabel",
memory_text_column_name="sentence",
memory_label_column_name="skill",
)
# Inference
texts = ["strong Python skills", "experience with data analysis", "ability to work in a team"]
model_outputs: ModelOutputs = model(
texts,
top_n=3,
batch_size=256,
output_scores=True,
num_neighbors=32,
knn_weight=0.6,
)
# Convert skill IDs to skill records
records: list[list[dict]] = model.ids_to_skills(model_outputs.skill_ids)
for i in range(len(texts)):
text = texts[i]
print(text)
print("---")
for j, record in enumerate(records[i]):
score = model_outputs.scores[i, j]
label = record["preferredLabel"]
print(f'{score:.2f} {label}')
print()strong Python skills
---
0.32 Python (computer programming)
0.20 apply basic programming skills
0.20 teach survival skills
experience with data analysis
---
0.35 analyse experimental laboratory data
0.34 perform data analysis
0.33 analyse test data
ability to work in a team
---
0.37 work in teams
0.31 work in a land-based team
0.30 work in a hospitality teamThis software may include, incorporate, or access open source software (OSS) components, datasets and other third party components, including those identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms may limit any distribution, use, and copying. You may use any OSS components under the terms of their respective licenses, which may include BSD 3, Apache 2.0, and other licenses. In the event of conflicts between Megagon Labs, Inc. (“Megagon”) license conditions and the OSS license conditions, the applicable OSS conditions governing the corresponding OSS components shall prevail. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon are governed by the respective third party’s license conditions. You agree that Megagon grants no license as to any of its intellectual property and patent rights. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS (INCLUDING MEGAGON) “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You agree to cease using, incorporating, and distributing any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to [email protected].
All datasets used within the product are listed below (including their copyright holders and the license information). For Datasets having different portions released under different licenses, please refer to the included source link specified for each of the respective datasets for identifications of dataset files released under the identified licenses.
| ID | OSS Component Name | Modified | Copyright Holder | Upstream Link | License |
|---|---|---|---|---|---|
| 1 | TechWolf/Synthetic-ESCO-skill-sentences | No | TechWolf | link | Creative Commons Attribution 4.0 |
| 2 | TechWolf/skill-extraction-tech | No | TechWolf | link | Creative Commons Attribution 4.0 |
| 3 | TechWolf/skill-extraction-house | No | TechWolf | link | Creative Commons Attribution 4.0 |
| 3 | TechWolf/skill-extraction-techwolf | No | TechWolf | link | Creative Commons Attribution 4.0 |
All OSS components used within the product are listed below (including their copyright holders and the license information). For OSS components having different portions released under different licenses, please refer to the included Upstream link(s) specified for each of the respective OSS components for identifications of code files released under the identified licenses.
| ID | OSS Component Name | Copyright Holder | Upstream Link | License |
|---|---|---|---|---|
| 1 | bitsandbytes | Tim Dettmers | link | MIT License * |
| 2 | coverage | Ned Batchelder and 243 others | link | Apache-2.0 |
| 3 | datasets | HuggingFace Inc. | link | Apache Software License |
| 4 | diskcache | Grant Jenks | link | Apache Software License |
| 5 | editables | Paul Moore | link | MIT License |
| 6 | evaluate | HuggingFace Inc. | link | Apache Software License |
| 7 | faiss | Matthijs Douze, Jeff Johnson, Herve Jegou, Lucas Hosseini | link | MIT |
| 8 | flash-attn | Tri Dao | link | BSD License |
| 9 | hatchling | Ofek Lev | link | MIT License |
| 10 | kaleido | Jon Mease | link | MIT |
| 11 | loguru | Delgan | link | MIT License |
| 12 | more-itertools | Erik Rose | link | MIT License * |
| 13 | mypy | Jukka Lehtosalo | link | MIT License |
| 14 | numpy | Travis E. Oliphant et al. | link | BSD License |
| 15 | openpyxl | Eric Gazoni, Charlie Clark | link | MIT License |
| 16 | pandas | The Pandas Development Team | link | BSD License |
| 17 | peft | The HuggingFace team | link | Apache Software License |
| 18 | pip-licenses | raimon | link | MIT |
| 19 | plotly | Chris P | link | MIT License |
| 20 | pre-commit | Anthony Sottile | link | MIT |
| 21 | protobuf | [email protected] | link | 3-Clause BSD License |
| 22 | pydantic | Samuel Colvin | link | MIT License |
| 23 | pytest | Holger Krekel, Bruno Oliveira, Ronny Pfannschmidt, Floris Bruynooghe, Brianna Laugher, Florian Bruhin, Others | link | MIT License |
| 24 | pytest-cov | Marc Schlaich | link | MIT License |
| 25 | PyYAML | Kirill Simonov | link | MIT License |
| 26 | ruff | "Astral Software Inc." | link | MIT License |
| 27 | scipy | SciPy Developers | link | BSD License |
| 28 | sentence-transformers | Nils Reimers | link | Apache Software License |
| 29 | sentencepiece | Taku Kudo | link | Apache License 2.0 * |
| 30 | tensorboard | Google Inc. | link | Apache Software License |
| 31 | tensorboardX | Tzu-Wei Huang | link | MIT License * |
| 32 | torch | PyTorch Team | link | BSD License |
| 33 | tqdm | tqdm developers | link | MIT License; Mozilla Public License 2.0 (MPL 2.0) |
| 34 | transformers | The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors) | link | Apache Software License |
| 35 | types-PyYAML | typeshed maintainers | link | Apache License 2.0 * |
| 36 | wheel | Daniel Holth | link | MIT License |
License Information Sources:
- Without '*': Extracted from package metadata via pip-licenses
- With '*': Manually verified from source repositories
@inproceedings{10.1145/3746252.3760896,
author = {Makino, Takuya},
title = {kNNBE: Incorporating Labeled Sentences in Bi-encoder Inference for Fast and Accurate Skill Mapping},
year = {2025},
isbn = {9798400720406},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3746252.3760896},
doi = {10.1145/3746252.3760896},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
pages = {5021–5025},
numpages = {5},
keywords = {bi-encoder, knn, skill-mapping},
location = {Seoul, Republic of Korea},
series = {CIKM '25}
}