Search posts from x that have liked yourself using the archive download files
- Github repository: https://github.com/cast42/search-x-likes/
- Documentation https://cast42.github.io/search-x-likes/
- Pypy package https://pypi.org/project/search-x-likes/
First, create a repository on GitHub with the same name as this project, and then run the following commands:
git init -b main
git add .
git commit -m "init commit"
git remote add origin [email protected]:cast42/search-x-likes.git
git push -u origin main
Then, install the environment and the pre-commit hooks with
make install
This will also generate your uv.lock
file
Initially, the CI/CD pipeline might be failing due to formatting issues. To resolve those run:
uv run pre-commit run -a
Lastly, commit the changes made by the two steps above to your repository.
git add .
git commit -m 'Fix formatting issues'
git push origin main
export OPENAI_API_KEY=<your key>
You are now ready to start development on your project! The CI/CD pipeline will be triggered when you open a pull request, merge to main, or when you create a new release.
To finalize the set-up for publishing to PyPI, see here. For activating the automatic documentation with MkDocs, see here. To enable the code coverage reports, see here.
- Create an API Token on PyPI.
- Add the API Token to your projects secrets with the name
PYPI_TOKEN
by visiting this page. - Create a new release on Github.
- Create a new tag in the form
*.*.*
.
For more details, see here.
Use ruff
for linting and formatting, mypy
for static code analysis, and pytest
for testing.
The documentation is built with mkdocs
, mkdocs-material
and mkdocstrings
.
Retrieve the first k exact matches. This approach is implemented as a textual TUI in search_x_likes/exact_search.py
k = 5 # retrieve k documents
retrieved = []
for idx, document in enumerate(documents):
if query in document:
retrieved.append(document)
if idx > k:
break
BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices.
This approach is implemented in search_x_likes/bm25_search.py
BM25 for Python: Achieving high performance while simplifying dependencies with BM25S⚡
Xing Han Lù, BM25S: Orders of magnitude faster lexical search via eager sparse scoring
Given the embedding of a document A and an embedding of a query B, score it's similarity as the normalized dot product of the two vectors:
This approach is implemented in search_x_likes/cosine_search.py
The code to generate the synthetic dataset with gpt-4o-mini is in search_x_likes/generate_synthetic_eval_dataset.py It uses input the dataset that contains the post on x that are liked:
The evaluation results are:
Model | MRR | Recall@5 | NDCG@5 | Wall Time (CPU) | Wall Time (GPU) |
---|---|---|---|---|---|
BM25s | 0.7711 | 0.8367 | 0.3376 | 0.2s | 0.4s |
sentence-transformers/all-MiniLM-L6-v2 | 0.6517 | 0.9246 | 0.3964 | 20s | 4.09s |
nomic-ai/modernbert-embed-base | 0.6654 | 0.9472 | 0.4044 | 3m01s | 6.82s |
intfloat/multilingual-e5-large | 0.7063 | 0.9246 | 0.3823 | 7m57s | 12.5s |
minishlab/potion-retrieval-32M | 0.6346 | 0.8894 | 0.3813 | 2s | 1.64s |
All contributions are welcome, including more documentation, examples, code, and tests. Even questions.
The package is open-sourced under the conditions of the MIT license.
Repository initiated with fpgmaas/cookiecutter-uv.