Longest Common Subsequence/Substring (LCS) solving package for vectors.
While looking for fast implementations solving the Longest Common Subsequence and Longest Common Substring (i.e. Contiguous Subsequence) (LCCS) problems, I only found pieces of code for strings, while I needed something working for vectors of integers. Yet, string LCS implementations 1) work at the character level, which might not be suitable for certain use-cases where one wants to work at word or sentence level; 2) only work with strings, other modalities will not work and might not be designed to be converted to bytes.
LCSvec aims to solve this gap by providing user-friendly and fast implementations of the LCS and LCCS problems for vectors. It works with numpy, pytorch, tensorflow and jax arrays/tensors! The code is written in C++ and the methods are bind with Python with nanobind for optimal performances.
You can install the package with pip: pip install lcsvec
.
Here we will use numpy, the usage for other deep learning libraries is the same.
from lcsvec import lccs, lccs_length, lcs, lcs_length
import numpy as np
seq1 = np.arange(0, 12)
seq2 = np.array([8, 0, 1, 2, 8, 2, 3, 4, 5, 6], dtype=np.int64)
lcs_ = lcs(seq1, seq2) # [0, 1, 2, 3, 4, 5, 6]
lcs_len = lcs_length(seq1, seq2) # 7, more efficient than calling len(lcs(seq1, seq2))
lccs = lccs(seq1, seq2) # [2, 3, 4, 5, 6]
lccs_len = lccs_length(seq1, seq2) # 5, more efficient than calling len(lccs(seq1, seq2))
- implement lccs with suffix tree;
- batch methods, i.e. supporting 2D arrays;
- batch methods with any number of dimensions (nD array) and add a
dim
argument; - make it work with an unlimited number of sequences, and set
dim
andpad_token
as kwargs only;