Intra-Search

A tool for performing semantic search within PDF documents using pre-trained Sentence Transformers (aka SBERT) models to find contextually relevant text.

Features:

Meaning-Based Search: Retrieve passages of text within documents that are semantically related to your query.
Local Caching: Embeddings are cached locally, avoiding the need for repeated processing.
Flexible Model Support: Use pre-trained Sentence Transformer models from Hugging Face or local machine.
Interactive UI: Web app that highlights search results and allows interactive navigation to the relevant sections of the document.

Note:

Currently supports only pdf.

Requirements

Python >= 3.11

Installation

Using `pipx` (Recommended):

Install pipx (if not already installed)

python3 -m pip install --user pipx
python3 -m pipx ensurepath

Install intra-search

pipx install intra-search

Using `pip`:

# create a new virtual environment
python3 -m venv .venv

# activate the virtual environment
source .venv/bin/activate

pip install intra-search

Usage

intra-search [OPTIONS] COMMAND [ARGS]...

Options:
  -d, --show-dir  Show the directory where document embeddings are cached.
  --help          Show this message and exit.

Commands:
  create  Create document embeddings
  list    List all cached embeddings
  remove  Remove embeddings
  start   Start the flask application (which serves both API and web app)

Start by creating vector embeddings for one or more documents using create subcommand.
```
intra-search create doc1.pdf doc2.pdf doc3.pdf
```
Options:
- -m, --model: Name of the Transformer model that should be used for generating embeddings.
  
  Any pre-trained community Sentence Transformer model (6000+ models) listed in hugging face hub can be used (eg. "sentence-transformers/all-MiniLM-L6-v2"). Checkout SBERT & All Sentence Transformer models on Hugging Face for a list of models that can be used.
  
  To use a model from your local machine, provide the path to the model.
  (deafult = msmarco-distilbert-cos-v5)
- -n, --chunks: The number of words per chunk (default = 50).
Launch the web application which runs on http://localhost:5000 by default.
```
intra-search start
```
Options:
- -p, --port: Specify the port for the web application (default = 5000).
Select a document embedding for querying from the list of cached embeddings. Each option displays embedding's details such as the source document filename, the model used, and chunk size.
Type a query into the textarea and click search.
- Each result is associated by a similarity score ranging from 0 to 1, where a higher score indicates greater relevance to the search query. The results are sorted in descending order of similarity.
- The number of results to be shown can be adjusted using the range slider input.
- All the texts corresponding to the search results are highlighted in the document.
- Clicking on a search result will scroll the document to the location of the highlighted text.

Delete cached embeddings

# Example:
# delete all embeddings created using doc1.pdf, doc2.pdf, & doc3.pdf
intra-search remove doc1.pdf doc2.pdf doc3.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
intra_search		intra_search
ui		ui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intra-Search

Features:

Note:

Requirements

Installation

Using `pipx` (Recommended):

Using `pip`:

Usage

Delete cached embeddings

About

Releases

Packages

Languages

License

monish-prabhu/Intra-Search

Folders and files

Latest commit

History

Repository files navigation

Intra-Search

Features:

Note:

Requirements

Installation

Using pipx (Recommended):

Using pip:

Usage

Delete cached embeddings

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Using `pipx` (Recommended):

Using `pip`:

Packages