Skip to content

An implementation of ``kNNBE: Incorporating Labeled Sentences in Bi-encoder Inference for Fast and Accurate Skill Mapping'', CIKM'25.

License

Notifications You must be signed in to change notification settings

megagonlabs/knnbe

Repository files navigation

kNNBE

An implementation of ``kNNBE: Incorporating Labeled Sentences in Bi-encoder Inference for Fast and Accurate Skill Mapping'', CIKM'25.

Install

⚠️ Important: Building faiss and Flash Attention 2 from source can be complex and time-consuming. We strongly recommend using the Dev Container for the easiest setup experience.

Option 1: Dev Container (Recommended)

The easiest way to get started is using VS Code with Dev Containers:

  1. Install VS Code and Docker
  2. Install the Dev Containers extension
  3. Clone and open the repository:
    git clone https://github.com/megagonlabs/knnbe
    cd knnbe
    code .
  4. When prompted, click "Reopen in Container" (or run Dev Containers: Reopen in Container from the command palette)
  5. Wait for the container to build (first time only)

That's it! All dependencies including faiss and Flash Attention 2 will be automatically installed.


Option 2: Docker

If you're not using VS Code, you can use Docker directly:

git clone https://github.com/megagonlabs/knnbe
cd knnbe

# Build the Docker image
docker build -t knnbe:latest -f .devcontainer/gpu/Dockerfile .

# Run the container interactively
docker run -it --gpus all \
  -v $(pwd):/workspace \
  -w /workspace \
  knnbe:latest bash

# Inside the container, install Flash Attention 2 and faiss
uv sync --group flash-attn --no-build-isolation
uv pip install --no-deps /opt/wheels/*.whl

# Verify installation
uv run python -c "import faiss; import flash_attn; print('✓ All dependencies installed')"

Useful Docker commands:

# Run with Jupyter port forwarding
docker run -it --gpus all \
  -v $(pwd):/workspace \
  -p 8888:8888 \
  knnbe:latest bash

# Run a specific command without entering the container
docker run --gpus all \
  -v $(pwd):/workspace \
  knnbe:latest \
  bash -c "uv sync --group flash-attn --no-build-isolation && uv pip install --no-deps /opt/wheels/*.whl && uv run python -m knnbe.evaluate --help"

Option 3: Local Setup

For advanced users who want to set up the environment locally:

Prerequisites

  • Python 3.10 or 3.11
  • CUDA Toolkit (for GPU support)
  • C++ compiler (gcc/g++ or clang)
  • CMake 3.17+
  • Git

Step 1: Basic Installation

git clone https://github.com/megagonlabs/knnbe
cd knnbe

# Install base dependencies (without faiss and Flash Attention 2)
uv sync

Note: faiss is not included in pyproject.toml and must be installed separately via uv pip install after building from source (see Step 3).

Step 2: Installing Flash Attention 2

⚠️ Important: Install Flash Attention 2 before faiss, as uv sync --group flash-attn may uninstall faiss.

# Uninstall dev and flash-attn groups first
uv sync --no-group dev --no-group flash-attn

# Reinstall dev dependencies
uv sync --group dev

# Install Flash Attention 2 (requires CUDA)
uv sync --group flash-attn --no-build-isolation

Step 3: Installing faiss from source

Build faiss with GPU support:

# Install build dependencies
# Ubuntu/Debian:
sudo apt-get install -y cmake build-essential swig

# Clone faiss
git clone https://github.com/facebookresearch/faiss.git
cd faiss

# Build with GPU support
cmake -B build \
  -DFAISS_ENABLE_GPU=ON \
  -DFAISS_ENABLE_PYTHON=ON \
  -DBUILD_SHARED_LIBS=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build -j$(nproc)

# Build Python wheel
uv build --wheel --out-dir ../.faiss-wheel build/faiss/python
cd ..

# Install the wheel
uv pip install .faiss-wheel/faiss-*.whl

Training

uv run torchrun --nproc_per_node=2 -m knnbe.train_sentence_transformer --config config/skill_mapping/biencoder.yaml

Hyperparameter optimization for kNNBE

uv run python -m knnbe.evaluation.cross_validation --memory TechWolf/Synthetic-ESCO-skill-sentences --model_name ./models/lightonai/modernbert-embed-large/ --attn_implementation sdpa --ontology_file ./data/skills_en.csv

Evaluation

Download skills_en.csv from https://esco.ec.europa.eu/en/use-esco/download in advance.

uv run python -m knnbe.evaluate --model_name ./models/bi_encoder/lightonai/modernbert-embed-large/ --dataset TechWolf/skill-extraction-tech --ontology_file ./data/skills_en.csv --attn_implementation sdpa --knn_weight 0.6 --num_neighbors 32 --memory TechWolf/Synthetic-ESCO-skill-sentences

Usage

from datasets import load_dataset, Dataset
from knnbe import KnnBiEncoder, ModelOutputs


model_name = "sentence-transformers/distiluse-base-multilingual-cased-v2"
model = KnnBiEncoder(model_name=model_name)

# Pre-compute skill embeddings
model.create_skill_embeddings(
    ontology_file="/path/to/skills_en.csv",
    label_key="preferredLabel",
)

# Building the memory with a dataset
ds: Dataset = load_dataset("TechWolf/Synthetic-ESCO-skill-sentences", split="train")
memory_samples: list[dict] = ds.to_list()
model.build_memory(
    memory_samples,
    label_key="preferredLabel",
    memory_text_column_name="sentence",
    memory_label_column_name="skill",
)

# Inference
texts = ["strong Python skills", "experience with data analysis", "ability to work in a team"]
model_outputs: ModelOutputs = model(
    texts,
    top_n=3,
    batch_size=256,
    output_scores=True,
    num_neighbors=32,
    knn_weight=0.6,
)

# Convert skill IDs to skill records
records: list[list[dict]] = model.ids_to_skills(model_outputs.skill_ids)

for i in range(len(texts)):
    text = texts[i]
    print(text)
    print("---")
    for j, record in enumerate(records[i]):
        score = model_outputs.scores[i, j]
        label = record["preferredLabel"]
        print(f'{score:.2f} {label}')
    print()
strong Python skills
---
0.32 Python (computer programming)
0.20 apply basic programming skills
0.20 teach survival skills

experience with data analysis
---
0.35 analyse experimental laboratory data
0.34 perform data analysis
0.33 analyse test data

ability to work in a team
---
0.37 work in teams
0.31 work in a land-based team
0.30 work in a hospitality team

Software Agreement for knnbe

1. Disclosure

This software may include, incorporate, or access open source software (OSS) components, datasets and other third party components, including those identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms may limit any distribution, use, and copying. You may use any OSS components under the terms of their respective licenses, which may include BSD 3, Apache 2.0, and other licenses. In the event of conflicts between Megagon Labs, Inc. (“Megagon”) license conditions and the OSS license conditions, the applicable OSS conditions governing the corresponding OSS components shall prevail. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon are governed by the respective third party’s license conditions. You agree that Megagon grants no license as to any of its intellectual property and patent rights. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS (INCLUDING MEGAGON) “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You agree to cease using, incorporating, and distributing any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to [email protected].

2. Datasets

All datasets used within the product are listed below (including their copyright holders and the license information). For Datasets having different portions released under different licenses, please refer to the included source link specified for each of the respective datasets for identifications of dataset files released under the identified licenses.

ID OSS Component Name Modified Copyright Holder Upstream Link License
1 TechWolf/Synthetic-ESCO-skill-sentences No TechWolf link Creative Commons Attribution 4.0
2 TechWolf/skill-extraction-tech No TechWolf link Creative Commons Attribution 4.0
3 TechWolf/skill-extraction-house No TechWolf link Creative Commons Attribution 4.0
3 TechWolf/skill-extraction-techwolf No TechWolf link Creative Commons Attribution 4.0

3. Open Source Software (OSS) Components

All OSS components used within the product are listed below (including their copyright holders and the license information). For OSS components having different portions released under different licenses, please refer to the included Upstream link(s) specified for each of the respective OSS components for identifications of code files released under the identified licenses.

ID OSS Component Name Copyright Holder Upstream Link License
1 bitsandbytes Tim Dettmers link MIT License *
2 coverage Ned Batchelder and 243 others link Apache-2.0
3 datasets HuggingFace Inc. link Apache Software License
4 diskcache Grant Jenks link Apache Software License
5 editables Paul Moore link MIT License
6 evaluate HuggingFace Inc. link Apache Software License
7 faiss Matthijs Douze, Jeff Johnson, Herve Jegou, Lucas Hosseini link MIT
8 flash-attn Tri Dao link BSD License
9 hatchling Ofek Lev link MIT License
10 kaleido Jon Mease link MIT
11 loguru Delgan link MIT License
12 more-itertools Erik Rose link MIT License *
13 mypy Jukka Lehtosalo link MIT License
14 numpy Travis E. Oliphant et al. link BSD License
15 openpyxl Eric Gazoni, Charlie Clark link MIT License
16 pandas The Pandas Development Team link BSD License
17 peft The HuggingFace team link Apache Software License
18 pip-licenses raimon link MIT
19 plotly Chris P link MIT License
20 pre-commit Anthony Sottile link MIT
21 protobuf [email protected] link 3-Clause BSD License
22 pydantic Samuel Colvin link MIT License
23 pytest Holger Krekel, Bruno Oliveira, Ronny Pfannschmidt, Floris Bruynooghe, Brianna Laugher, Florian Bruhin, Others link MIT License
24 pytest-cov Marc Schlaich link MIT License
25 PyYAML Kirill Simonov link MIT License
26 ruff "Astral Software Inc." link MIT License
27 scipy SciPy Developers link BSD License
28 sentence-transformers Nils Reimers link Apache Software License
29 sentencepiece Taku Kudo link Apache License 2.0 *
30 tensorboard Google Inc. link Apache Software License
31 tensorboardX Tzu-Wei Huang link MIT License *
32 torch PyTorch Team link BSD License
33 tqdm tqdm developers link MIT License; Mozilla Public License 2.0 (MPL 2.0)
34 transformers The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors) link Apache Software License
35 types-PyYAML typeshed maintainers link Apache License 2.0 *
36 wheel Daniel Holth link MIT License

License Information Sources:

  • Without '*': Extracted from package metadata via pip-licenses
  • With '*': Manually verified from source repositories

Citation

@inproceedings{10.1145/3746252.3760896,
author = {Makino, Takuya},
title = {kNNBE: Incorporating Labeled Sentences in Bi-encoder Inference for Fast and Accurate Skill Mapping},
year = {2025},
isbn = {9798400720406},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3746252.3760896},
doi = {10.1145/3746252.3760896},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
pages = {5021–5025},
numpages = {5},
keywords = {bi-encoder, knn, skill-mapping},
location = {Seoul, Republic of Korea},
series = {CIKM '25}
}

About

An implementation of ``kNNBE: Incorporating Labeled Sentences in Bi-encoder Inference for Fast and Accurate Skill Mapping'', CIKM'25.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published