Skip to content
/ LCSvec Public

Package solving the Longest Common Subsequence/Substring (LCS) for integer vectors (numpy, PyTorch, Jax, Tensorflow)

License

Notifications You must be signed in to change notification settings

Natooz/LCSvec

Repository files navigation

Python 3.8 GitHub CI GitHub license Downloads Code style

LCSvec

Longest Common Subsequence/Substring (LCS) solving package for vectors.

Why LCSvec

While looking for fast implementations solving the Longest Common Subsequence and Longest Common Substring (i.e. Contiguous Subsequence) (LCCS) problems, I only found pieces of code for strings, while I needed something working for vectors of integers. Yet, string LCS implementations 1) work at the character level, which might not be suitable for certain use-cases where one wants to work at word or sentence level; 2) only work with strings, other modalities will not work and might not be designed to be converted to bytes.

LCSvec aims to solve this gap by providing user-friendly and fast implementations of the LCS and LCCS problems for vectors. It works with numpy, pytorch, tensorflow and jax arrays/tensors! The code is written in C++ and the methods are bind with Python with nanobind for optimal performances.

Example

You can install the package with pip: pip install lcsvec. Here we will use numpy, the usage for other deep learning libraries is the same.

from lcsvec import lccs, lccs_length, lcs, lcs_length
import numpy as np

seq1 = np.arange(0, 12)
seq2 = np.array([8, 0, 1, 2, 8, 2, 3, 4, 5, 6], dtype=np.int64)

lcs_ = lcs(seq1, seq2)  # [0, 1, 2, 3, 4, 5, 6]
lcs_len = lcs_length(seq1, seq2)  # 7, more efficient than calling len(lcs(seq1, seq2))

lccs = lccs(seq1, seq2)  # [2, 3, 4, 5, 6]
lccs_len = lccs_length(seq1, seq2)  # 5, more efficient than calling len(lccs(seq1, seq2))

TODOs

  • implement lccs with suffix tree;
  • batch methods, i.e. supporting 2D arrays;
  • batch methods with any number of dimensions (nD array) and add a dim argument;
  • make it work with an unlimited number of sequences, and set dim and pad_token as kwargs only;

About

Package solving the Longest Common Subsequence/Substring (LCS) for integer vectors (numpy, PyTorch, Jax, Tensorflow)

Resources

License

Stars

Watchers

Forks

Packages

No packages published