Add vllm support #254

saketkc · 2025-11-08T18:34:40Z

Summary by Sourcery

Add support for local vLLM and llama-cpp metadata enrichment backends with confidence scoring and extend the metadata command to accept file-based batch inputs with per-ID output directories.

New Features:

Add --input-file and --output-dir options to the metadata CLI for batch processing of IDs and saving individual metadata CSVs

Enhancements:

Extend MetadataExtractor to auto-detect and initialize vllm/llamacpp backends based on provider prefix
Minor formatting and cleanup updates in sraweb module

sourcery-ai · 2025-11-08T18:34:45Z

Reviewer's Guide

This PR adds support for vLLM and llama-cpp-based metadata enrichment backends and enhances the CLI with batch processing options via input files and output directories, along with minor formatting cleanups in the SRAweb module.

Sequence diagram for CLI batch metadata enrichment with input file and output directory

sequenceDiagram
    actor User
    participant CLI
    participant SRAweb
    participant MetadataExtractor
    User->>CLI: Run metadata command with --input-file and --output-dir
    CLI->>CLI: Read IDs from input file
    CLI->>SRAweb: For each ID, call geo_metadata or sra_metadata
    SRAweb->>MetadataExtractor: Enrich metadata (may use vLLM or llama-cpp)
    MetadataExtractor-->>SRAweb: Return enriched metadata
    SRAweb-->>CLI: Return DataFrame
    CLI->>CLI: Save each DataFrame to output_dir/{ID}_metadata.csv
    CLI->>User: Print progress and completion

Class diagram for new and updated metadata enrichment backends

classDiagram
    class MetadataExtractor {
        <<abstract>>
        +extract_metadata(text, fields)
        +enrich_dataframe(df, text_column, fields, prefix, show_progress)
    }
    class LlamaCppMetadataExtractor {
        +model_id: str
        +model_file: str
        +extract_metadata(text, fields)
        +enrich_dataframe(df, text_column, fields)
    }
    class VLLMMetadataExtractor {
        +model_id: str
        +port: int
        +extract_metadata(text, fields)
        +enrich_dataframe(df, text_column, fields, prefix, show_progress)
        +close()
    }
    class VLLMServerManager {
        +model_id: str
        +port: int
        +start(max_retries)
        +stop()
    }
    MetadataExtractor <|-- LlamaCppMetadataExtractor
    MetadataExtractor <|-- VLLMMetadataExtractor
    VLLMMetadataExtractor "1" *-- "1" VLLMServerManager

File-Level Changes

Change	Details	Files
Enhanced CLI with batch metadata processing support	Introduced --input-file and --output-dir arguments in parse_args Validated input file existence and read list of IDs Implemented per-ID loop: fetch metadata, save CSV, and display progress Reverted to original single/multi-ID behavior when no output directory	`pysradb/cli.py`
Extended MetadataExtractor to handle vLLM and llama-cpp backends	Added flags _is_vllamacpp and _is_vllm to track backend selection Implemented _init_llamacpp_backend and _init_vllm_backend initialization methods Augmented extract_metadata and enrich_dataframe to branch on backend flags	`pysradb/metadata_enrichment.py`
Added new modules for vLLM and llama-cpp enrichment	Created VLLMServerManager to manage vLLM server lifecycle and configuration Built VLLMMetadataExtractor: HTTP API calls, JSON parsing, and confidence scoring Developed LlamaCppMetadataExtractor: local GGUF model loading, prompt creation, and logprob-based scoring	`pysradb/vllm_enrichment.py` `pysradb/llamacpp_enrichment.py`
Cleaned up formatting and dict/list comprehensions in SRAweb	Added consistent trailing commas and unified dictionary formatting Adjusted list comprehension indentation for readability	`pysradb/sraweb.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

Add explicit validation to error out when --output-dir is provided without --input-file to prevent confusing behavior in the metadata CLI.
Remove or guard the interactive pip‐install prompts in VLLMServerManager and llama-cpp initializers, since they can block non‐interactive workflows—raise a clear error instead.
There’s a lot of duplicated logic between the vLLM and llama-cpp extractors (prompt construction, confidence scoring); consider refactoring shared pieces into common helper methods or a base class.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Add explicit validation to error out when --output-dir is provided without --input-file to prevent confusing behavior in the metadata CLI.
- Remove or guard the interactive pip‐install prompts in VLLMServerManager and llama-cpp initializers, since they can block non‐interactive workflows—raise a clear error instead.
- There’s a lot of duplicated logic between the vLLM and llama-cpp extractors (prompt construction, confidence scoring); consider refactoring shared pieces into common helper methods or a base class.

## Individual Comments

### Comment 1
<location> `pysradb/cli.py:206` </location>
<code_context>
     sradb = SRAweb()

+    # If input_file is provided, read IDs from file
+    if input_file:
+        if not os.path.exists(input_file):
+            console.print(f"[red]Error: Input file '{input_file}' not found[/red]")
</code_context>

<issue_to_address>
**suggestion:** Input file error handling could be more robust for permission issues.

Wrap the file opening in a try/except block to catch permission errors and display a clear error message if the file cannot be read.
</issue_to_address>

### Comment 2
<location> `pysradb/cli.py:220-226` </location>
<code_context>
+
+        # If output_dir is specified, process each ID separately
+        if output_dir:
+            os.makedirs(output_dir, exist_ok=True)
+            console.print(
+                f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]"
</code_context>

<issue_to_address>
**suggestion:** Output directory creation does not handle permission or invalid path errors.

Wrap os.makedirs in a try/except block to handle and report errors if directory creation fails.

```suggestion
        # If output_dir is specified, process each ID separately
        if output_dir:
            try:
                os.makedirs(output_dir, exist_ok=True)
            except Exception as e:
                console.print(f"[red]Error: Failed to create output directory '{output_dir}': {e}[/red]")
                sradb.close()
                return
            console.print(
                f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]"
            )
            console.print(f"[blue]Output directory: '{output_dir}'[/blue]")
```
</issue_to_address>

### Comment 3
<location> `pysradb/cli.py:267-270` </location>
<code_context>
+                        output_file = os.path.join(
+                            output_dir, f"{accession_id}_metadata.csv"
+                        )
+                        df.to_csv(output_file, index=False)
+                        console.print(
+                            f"[green]  → Saved to {output_file} ({len(df)} rows)[/green]"
</code_context>

<issue_to_address>
**suggestion:** Saving CSV files does not handle write errors.

Wrap df.to_csv in a try/except block to catch and report file write errors for each ID, preventing unexpected termination.

```suggestion
                        try:
                            df.to_csv(output_file, index=False)
                            console.print(
                                f"[green]  → Saved to {output_file} ({len(df)} rows)[/green]"
                            )
                        except Exception as write_err:
                            console.print(
                                f"[red]  → Error saving to {output_file}: {str(write_err)}[/red]"
                            )
```
</issue_to_address>

### Comment 4
<location> `pysradb/llamacpp_enrichment.py:388` </location>
<code_context>
+
+        return enriched_df
+
+    def __del__(self):
+        """Cleanup llama.cpp model."""
+        if hasattr(self, "llm") and self.llm is not None:
</code_context>

<issue_to_address>
**suggestion:** Reliance on __del__ for resource cleanup may be unreliable.

Explicitly implement a close() method and support context management to ensure reliable resource cleanup.
</issue_to_address>

### Comment 5
<location> `pysradb/vllm_enrichment.py:245-250` </location>
<code_context>
            self.process = subprocess.Popen(
                cmd,
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE,
                text=True,
            )
</code_context>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

### Comment 6
<location> `pysradb/cli.py:246` </location>
<code_context>
                                enrich_backend if enrich_backend else "ollama/phi3"

</code_context>

<issue_to_address>
**suggestion (code-quality):** Replace if-expression with `or` ([`or-if-exp-identity`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/or-if-exp-identity))

```suggestion
                                enrich_backend or "ollama/phi3"
```

<br/><details><summary>Explanation</summary>Here we find ourselves setting a value if it evaluates to `True`, and otherwise
using a default.

The 'After' case is a bit easier to read and avoids the duplication of
`input_currency`.

It works because the left-hand side is evaluated first. If it evaluates to
true then `currency` will be set to this and the right-hand side will not be
evaluated. If it evaluates to false the right-hand side will be evaluated and
`currency` will be set to `DEFAULT_CURRENCY`.
</details>
</issue_to_address>

### Comment 7
<location> `pysradb/cli.py:258` </location>
<code_context>
                                enrich_backend if enrich_backend else "ollama/phi3"

</code_context>

<issue_to_address>
**suggestion (code-quality):** Replace if-expression with `or` ([`or-if-exp-identity`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/or-if-exp-identity))

```suggestion
                                enrich_backend or "ollama/phi3"
```

<br/><details><summary>Explanation</summary>Here we find ourselves setting a value if it evaluates to `True`, and otherwise
using a default.

The 'After' case is a bit easier to read and avoids the duplication of
`input_currency`.

It works because the left-hand side is evaluated first. If it evaluates to
true then `currency` will be set to this and the right-hand side will not be
evaluated. If it evaluates to false the right-hand side will be evaluated and
`currency` will be set to `DEFAULT_CURRENCY`.
</details>
</issue_to_address>

### Comment 8
<location> `pysradb/llamacpp_enrichment.py:166-168` </location>
<code_context>
        if not self._check_llama_cpp_installed():
            if not self._install_llama_cpp():
                raise RuntimeError("llama-cpp-python is required but not installed")

</code_context>

<issue_to_address>
**suggestion (code-quality):** Merge nested if conditions ([`merge-nested-ifs`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/merge-nested-ifs))

```suggestion
        if not self._check_llama_cpp_installed() and not self._install_llama_cpp():
            raise RuntimeError("llama-cpp-python is required but not installed")

```

<br/><details><summary>Explanation</summary>Too much nesting can make code difficult to understand, and this is especially
true in Python, where there are no brackets to help out with the delineation of
different nesting levels.

Reading deeply nested code is confusing, since you have to keep track of which
conditions relate to which levels. We therefore strive to reduce nesting where
possible, and the situation where two `if` conditions can be combined using
`and` is an easy win.
</details>
</issue_to_address>

### Comment 9
<location> `pysradb/vllm_enrichment.py:165-167` </location>
<code_context>
        if not self._check_vllm_installed():
            if not self._install_vllm():
                raise RuntimeError("vLLM is required but not installed")

</code_context>

<issue_to_address>
**suggestion (code-quality):** Merge nested if conditions ([`merge-nested-ifs`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/merge-nested-ifs))

```suggestion
        if not self._check_vllm_installed() and not self._install_vllm():
            raise RuntimeError("vLLM is required but not installed")

```

<br/><details><summary>Explanation</summary>Too much nesting can make code difficult to understand, and this is especially
true in Python, where there are no brackets to help out with the delineation of
different nesting levels.

Reading deeply nested code is confusing, since you have to keep track of which
conditions relate to which levels. We therefore strive to reduce nesting where
possible, and the situation where two `if` conditions can be combined using
`and` is an easy win.
</details>
</issue_to_address>

### Comment 10
<location> `pysradb/cli.py:191` </location>
<code_context>
def metadata(
    srp_id,
    assay,
    desc,
    detailed,
    expand,
    saveto,
    enrich=False,
    enrich_backend=None,
    input_file=None,
    output_dir=None,
):
    sradb = SRAweb()

    # If input_file is provided, read IDs from file
    if input_file:
        if not os.path.exists(input_file):
            console.print(f"[red]Error: Input file '{input_file}' not found[/red]")
            sradb.close()
            return

        with open(input_file, "r") as f:
            file_ids = [line.strip() for line in f if line.strip()]

        if not file_ids:
            console.print(f"[red]Error: No IDs found in '{input_file}'[/red]")
            sradb.close()
            return

        # If output_dir is specified, process each ID separately
        if output_dir:
            os.makedirs(output_dir, exist_ok=True)
            console.print(
                f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]"
            )
            console.print(f"[blue]Output directory: '{output_dir}'[/blue]")

            for idx, accession_id in enumerate(file_ids, 1):
                console.print(
                    f"[green]Processing {idx}/{len(file_ids)}: {accession_id}[/green]"
                )

                # Determine if GSE or SRP/SRX/etc
                is_gse = isinstance(
                    accession_id, str
                ) and accession_id.upper().startswith("GSE")

                try:
                    if is_gse:
                        df = sradb.geo_metadata(
                            accession_id,
                            sample_attribute=desc,
                            detailed=detailed,
                            enrich=enrich,
                            enrich_backend=(
                                enrich_backend if enrich_backend else "ollama/phi3"
                            ),
                        )
                    else:
                        df = sradb.sra_metadata(
                            accession_id,
                            assay=assay,
                            detailed=detailed,
                            sample_attribute=desc,
                            expand_sample_attributes=expand,
                            enrich=enrich,
                            enrich_backend=(
                                enrich_backend if enrich_backend else "ollama/phi3"
                            ),
                        )

                    if df is not None and not df.empty:
                        # Save to individual file
                        output_file = os.path.join(
                            output_dir, f"{accession_id}_metadata.csv"
                        )
                        df.to_csv(output_file, index=False)
                        console.print(
                            f"[green]  → Saved to {output_file} ({len(df)} rows)[/green]"
                        )
                    else:
                        console.print(
                            f"[yellow]  → No metadata found for {accession_id}[/yellow]"
                        )

                except Exception as e:
                    console.print(
                        f"[red]  → Error processing {accession_id}: {str(e)}[/red]"
                    )

            console.print(
                f"[blue]Batch processing complete. Files saved to '{output_dir}'[/blue]"
            )
            sradb.close()
            return
        else:
            # output_dir not specified, use file IDs as regular input
            srp_id = file_ids

    # Original single/multiple ID processing (when no input_file or no output_dir)
    srp_ids = []
    gse_ids = []
    for accession in srp_id:
        if isinstance(accession, str) and accession.upper().startswith("GSE"):
            gse_ids.append(accession)
        else:
            srp_ids.append(accession)

    metadata_frames = []
    if srp_ids:
        srp_metadata = sradb.sra_metadata(
            srp_ids if len(srp_ids) > 1 else srp_ids[0],
            assay=assay,
            detailed=detailed,
            sample_attribute=desc,
            expand_sample_attributes=expand,
            enrich=enrich,
            enrich_backend=enrich_backend if enrich_backend else "ollama/phi3",
        )
        if srp_metadata is not None:
            metadata_frames.append(srp_metadata)

    if gse_ids:
        geo_metadata_df = sradb.geo_metadata(
            gse_ids if len(gse_ids) > 1 else gse_ids[0],
            sample_attribute=desc,
            detailed=detailed,
            enrich=enrich,
            enrich_backend=enrich_backend if enrich_backend else "ollama/phi3",
        )
        if not geo_metadata_df.empty:
            metadata_frames.append(geo_metadata_df)

    if metadata_frames:
        df = pd.concat(metadata_frames, ignore_index=True, sort=False)
    else:
        df = pd.DataFrame()

    _print_save_df(df, saveto)
    sradb.close()

</code_context>

<issue_to_address>
**issue (code-quality):** Low code quality found in metadata - 6% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>Explanation</summary>The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 11
<location> `pysradb/llamacpp_enrichment.py:133` </location>
<code_context>
    def _download_model(self) -> str:
        """Download model from Hugging Face and return the GGUF file path."""
        from huggingface_hub import hf_hub_download

        logger.info(f"Downloading model: {self.model_id}")

        if self.model_file:
            model_path = hf_hub_download(
                repo_id=self.model_id, filename=self.model_file
            )
            logger.info(f"Downloaded model file: {model_path}")
            return model_path

        preferred_files = [
            "llama3-openbiollm-8b.Q4_K_M.gguf",
            "*Q4_K_M*.gguf",
            "*Q5_K_M*.gguf",
            "*Q4*.gguf",
            "*.gguf",
        ]

        from huggingface_hub import list_repo_files

        try:
            files = list_repo_files(repo_id=self.model_id)
            gguf_files = [f for f in files if f.endswith(".gguf")]

            if not gguf_files:
                raise RuntimeError(f"No GGUF files found in {self.model_id}")

            selected_file = None
            for preferred in preferred_files:
                if "*" in preferred:
                    pattern = preferred.replace("*", "")
                    matches = [f for f in gguf_files if pattern in f]
                    if matches:
                        selected_file = matches[0]
                        break
                else:
                    if preferred in gguf_files:
                        selected_file = preferred
                        break

            if not selected_file:
                selected_file = gguf_files[0]

            logger.info(f"Selected GGUF file: {selected_file}")
            model_path = hf_hub_download(repo_id=self.model_id, filename=selected_file)
            logger.info(f"Downloaded model: {model_path}")
            return model_path

        except Exception as e:
            raise RuntimeError(f"Failed to download model: {e}")

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Extract code out into method ([`extract-method`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/extract-method/))
- Merge else clause's nested if statement into elif ([`merge-else-if-into-elif`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/merge-else-if-into-elif/))
- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Explicitly raise from a previous error ([`raise-from-previous-error`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/raise-from-previous-error/))
</issue_to_address>

### Comment 12
<location> `pysradb/llamacpp_enrichment.py:216-226` </location>
<code_context>
    def _create_extraction_prompt(
        self, metadata: str, fields: Optional[List[str]] = None
    ) -> str:
        """Create extraction prompt for the model."""
        if fields is None:
            fields = [
                "cell_type",
                "tissue",
                "disease",
                "organism",
                "developmental_stage",
            ]

        field_descriptions = {
            "cell_type": "the specific cell type or cell line used",
            "tissue": "the tissue or organ of origin",
            "disease": "any disease or condition being studied",
            "organism": "the organism or species",
            "developmental_stage": "the developmental stage or age",
            "treatment": "any treatments, drugs, or interventions applied",
            "assay": "the experimental assay or technique used",
        }

        fields_text = "\n".join(
            [f"- {field}: {field_descriptions.get(field, field)}" for field in fields]
        )

        prompt = f"""You are a bioinformatics expert. Extract the following information from the metadata below and return it as a JSON object.

Fields to extract:
{fields_text}

Metadata:
{metadata}

Return ONLY a valid JSON object with the extracted fields. Use null if information is not available. Do not include any other text.

JSON:"""

        return prompt

</code_context>

<issue_to_address>
**issue (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
</issue_to_address>

### Comment 13
<location> `pysradb/llamacpp_enrichment.py:244` </location>
<code_context>
    def _compute_confidence_score(self, logprobs: List[float]) -> float:
        """
        Compute confidence score from log probabilities.

        Uses perplexity-based scoring:
        - Lower perplexity = higher confidence
        - Formula: confidence = 1 / (1 + perplexity / k) where k=10

        Args:
            logprobs: List of log probabilities for each token

        Returns:
            Confidence score between 0 and 1
        """
        if not logprobs or len(logprobs) == 0:
            return 0.5  # Default confidence when no logprobs

        # Compute mean negative log-likelihood
        mean_nll = -np.mean(logprobs)

        # Compute perplexity
        perplexity = np.exp(mean_nll)

        # Convert to confidence score (0-1 range)
        # Lower perplexity = higher confidence
        confidence = 1.0 / (1.0 + perplexity / 10.0)

        return float(confidence)

</code_context>

<issue_to_address>
**suggestion (code-quality):** We've found these issues:

- Remove redundant conditional ([`remove-redundant-if`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-redundant-if/))
- Simplify sequence length comparison ([`simplify-len-comparison`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/simplify-len-comparison/))

```suggestion
        if not logprobs:
```
</issue_to_address>

### Comment 14
<location> `pysradb/llamacpp_enrichment.py:303-309` </location>
<code_context>
    def extract_metadata(
        self, text: str, fields: Optional[List[str]] = None
    ) -> _MetadataExtraction:
        """
        Extract metadata from text using llama.cpp.

        Args:
            text: Input text to extract metadata from
            fields: List of fields to extract

        Returns:
            _MetadataExtraction object with extracted data and confidence scores
        """
        prompt = self._create_extraction_prompt(text, fields)

        try:
            # Generate response with logprobs
            response = self.llm(
                prompt,
                max_tokens=self.max_tokens,
                temperature=self.temperature,
                logprobs=1,  # Request log probabilities
                echo=False,
            )

            generated_text = response["choices"][0]["text"].strip()
            logprobs_data = response["choices"][0].get("logprobs", {})
            token_logprobs = logprobs_data.get("token_logprobs", [])
            valid_logprobs = [lp for lp in token_logprobs if lp is not None]
            confidence = self._compute_confidence_score(valid_logprobs)

            json_str = generated_text
            if "```json" in json_str:
                json_str = json_str.split("```json")[1].split("```")[0].strip()
            elif "```" in json_str:
                json_str = json_str.split("```")[1].split("```")[0].strip()

            if "{" in json_str and "}" in json_str:
                start = json_str.find("{")
                end = json_str.rfind("}") + 1
                json_str = json_str[start:end]

            extracted_data = json.loads(json_str)

            field_confidences = {}
            for field in extracted_data.keys():
                if extracted_data[field] is not None:
                    field_confidences[f"{field}_confidence"] = confidence
                else:
                    field_confidences[f"{field}_confidence"] = 0.0

            return _MetadataExtraction(
                success=True,
                extracted_metadata=extracted_data,
                confidence_scores=field_confidences,
                error=None,
            )

        except json.JSONDecodeError as e:
            logger.error(f"Failed to parse JSON response: {generated_text}")
            return _MetadataExtraction(
                success=False,
                extracted_metadata={},
                confidence_scores={},
                error=f"JSON parsing error: {str(e)}",
            )
        except Exception as e:
            logger.error(f"Extraction failed: {str(e)}")
            return _MetadataExtraction(
                success=False,
                extracted_metadata={},
                confidence_scores={},
                error=str(e),
            )

</code_context>

<issue_to_address>
**suggestion (code-quality):** We've found these issues:

- Convert for loop into dictionary comprehension ([`dict-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/dict-comprehension/))
- Replace if statement with if expression ([`assign-if-exp`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/assign-if-exp/))

```suggestion
            field_confidences = {
                f"{field}_confidence": (
                    confidence if extracted_data[field] is not None else 0.0
                )
                for field in extracted_data.keys()
            }
```
</issue_to_address>

### Comment 15
<location> `pysradb/metadata_enrichment.py:323-325` </location>
<code_context>
    def _init_llamacpp_backend(self, model: Optional[str], **kwargs):
        """Initialize llama-cpp backend."""
        from .llamacpp_enrichment import LlamaCppMetadataExtractor

        model_id = self.provider[9:]
        if model:
            model_id = model
        if not model_id:
            model_id = "MoMonir/Llama3-OpenBioLLM-8B-GGUF"

        self.logger.info(f"Using llama-cpp backend with model: {model_id}")
        self._llamacpp_extractor = LlamaCppMetadataExtractor(
            model_id=model_id, **kwargs
        )

</code_context>

<issue_to_address>
**suggestion (code-quality):** We've found these issues:

- Move setting of default value for variable into `else` branch ([`introduce-default-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/introduce-default-else/))
- Replace if statement with if expression ([`assign-if-exp`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/assign-if-exp/))

```suggestion
        model_id = model if model else self.provider[9:]
```
</issue_to_address>

### Comment 16
<location> `pysradb/metadata_enrichment.py:338-340` </location>
<code_context>
    def _init_vllm_backend(self, model: Optional[str], **kwargs):
        """Initialize vLLM backend."""
        from .vllm_enrichment import VLLMMetadataExtractor

        model_id = self.provider[5:]
        if model:
            model_id = model
        if not model_id:
            model_id = "Qwen/Qwen2.5-1.5B-Instruct"

        self.logger.info(f"Using vLLM backend with model: {model_id}")
        self._vllm_extractor = VLLMMetadataExtractor(model_id=model_id, **kwargs)

</code_context>

<issue_to_address>
**suggestion (code-quality):** We've found these issues:

- Move setting of default value for variable into `else` branch ([`introduce-default-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/introduce-default-else/))
- Replace if statement with if expression ([`assign-if-exp`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/assign-if-exp/))

```suggestion
        model_id = model if model else self.provider[5:]
```
</issue_to_address>

### Comment 17
<location> `pysradb/vllm_enrichment.py:98` </location>
<code_context>
    def _download_model(self) -> str:
        """
        Download model from Hugging Face.

        Returns:
            Path to the downloaded model
        """
        try:
            from huggingface_hub import snapshot_download
        except ImportError:
            raise ImportError(
                "huggingface_hub is required for downloading models. "
                "Install with: pip install huggingface_hub"
            )

        logger.info(f"Downloading model {self.model_id}...")
        try:
            model_path = snapshot_download(
                repo_id=self.model_id,
                allow_patterns=[
                    "*.gguf",
                    "*.safetensors",
                    "*.bin",
                    "*.json",
                    "*.txt",
                    "*.model",
                    "tokenizer*",
                    "config.json",
                    "generation_config.json",
                ],
            )
            logger.info(f"Model downloaded to {model_path}")
            return model_path
        except Exception as e:
            raise RuntimeError(f"Failed to download model {self.model_id}: {e}")

</code_context>

<issue_to_address>
**issue (code-quality):** Explicitly raise from a previous error [×2] ([`raise-from-previous-error`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/raise-from-previous-error/))
</issue_to_address>

### Comment 18
<location> `pysradb/vllm_enrichment.py:155` </location>
<code_context>
    def start(self, max_retries: int = 3) -> bool:
        """
        Start vLLM server with automatic port retry on conflicts.

        Args:
            max_retries: Maximum number of port retries (default: 3)

        Returns:
            True if server started successfully
        """
        if not self._check_vllm_installed():
            if not self._install_vllm():
                raise RuntimeError("vLLM is required but not installed")

        # Check if port is already in use
        if self._is_server_running():
            logger.info(f"vLLM server already running on port {self.port}")
            return True

        model_path = self._download_model()
        gguf_file = self._find_gguf_file(model_path)

        if gguf_file:
            logger.info(f"Using GGUF file: {gguf_file}")
            model_arg = gguf_file
            tokenizer = self._get_tokenizer_path()
        else:
            logger.info(f"Using model ID directly (non-GGUF): {self.model_id}")
            model_arg = self.model_id
            tokenizer = None

        original_port = self.port
        for attempt in range(max_retries):
            current_port = original_port + attempt

            logger.info(
                f"Attempting to start vLLM server on port {current_port} (attempt {attempt + 1}/{max_retries})"
            )

            cmd = [
                "python",
                "-m",
                "vllm.entrypoints.openai.api_server",
                "--model",
                model_arg,
                "--port",
                str(current_port),
                "--max-logprobs",
                str(self.max_logprobs),
                "--gpu-memory-utilization",
                str(self.gpu_memory_utilization),
            ]

            if tokenizer:
                cmd.extend(["--tokenizer", tokenizer])

            if self.trust_remote_code:
                cmd.append("--trust-remote-code")

            # Add chat template to fix "default chat template is no longer allowed" error
            if (
                "llama" in self.model_id.lower()
                or "openbiollm" in self.model_id.lower()
            ):
                llama3_chat_template = (
                    "{% set loop_messages = messages %}"
                    "{% for message in loop_messages %}"
                    "{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' %}"
                    "{% if loop.index0 == 0 %}"
                    "{% set content = bos_token + content %}"
                    "{% endif %}"
                    "{{ content }}"
                    "{% endfor %}"
                    "{{ '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}"
                )
                cmd.extend(["--chat-template", llama3_chat_template])
            elif "qwen" in self.model_id.lower():
                qwen_chat_template = (
                    "{% for message in messages %}"
                    "{% if loop.first and messages[0]['role'] != 'system' %}"
                    "{{ '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}"
                    "{% endif %}"
                    "{{ '<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>\\n' }}"
                    "{% endfor %}"
                    "{{ '<|im_start|>assistant\\n' }}"
                )
                cmd.extend(["--chat-template", qwen_chat_template])

            logger.info(f"Starting vLLM server: {' '.join(cmd)}")

            self.process = subprocess.Popen(
                cmd,
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE,
                text=True,
            )

            logger.info("Waiting for vLLM server to start...")
            max_wait = 300
            start_time = time.time()

            while time.time() - start_time < max_wait:
                self.port = current_port
                self.base_url = f"http://localhost:{current_port}/v1"

                if self._is_server_running():
                    logger.info(f"vLLM server started successfully on port {self.port}")
                    return True

                if self.process.poll() is not None:
                    try:
                        stdout, stderr = self.process.communicate(timeout=5)
                    except:
                        logger.warning(
                            f"Process died on port {current_port}, but couldn't read output (timeout)"
                        )
                        stdout, stderr = "", ""

                    if stderr and (
                        "Address already in use" in stderr
                        or "OSError: [Errno 98]" in stderr
                    ):
                        logger.warning(
                            f"Port {current_port} is already in use, trying next port..."
                        )
                        break
                    elif stderr:
                        logger.error(
                            f"vLLM server failed to start.\nStdout: {stdout}\nStderr: {stderr}"
                        )
                        raise RuntimeError(
                            f"vLLM server failed with error: {stderr[:500]}"
                        )
                    else:
                        logger.warning(
                            f"Process died on port {current_port}, retrying..."
                        )
                        break

                time.sleep(2)

            if not self._is_server_running():
                stdout, stderr = "", ""

                if self.process and self.process.poll() is None:
                    logger.warning(
                        f"Server didn't start within timeout on port {current_port}"
                    )
                    self.process.kill()
                    try:
                        stdout, stderr = self.process.communicate(timeout=5)
                    except:
                        self.process.terminate()
                        try:
                            stdout, stderr = self.process.communicate(timeout=2)
                        except:
                            pass
                elif self.process:
                    try:
                        stdout, stderr = self.process.communicate(timeout=5)
                    except:
                        logger.warning(
                            f"Timed out reading process output on port {current_port}"
                        )
                        pass

                    if stderr:
                        logger.error(
                            f"vLLM server died during startup.\nStdout: {stdout[-1000:] if stdout else ''}\nStderr: {stderr[-1000:]}"
                        )

                    # Check for port conflict
                    if (
                        stderr
                        and "Address already in use" not in stderr
                        and "OSError: [Errno 98]" not in stderr
                    ):
                        # Not a port conflict - this is a real error
                        if attempt < max_retries - 1:
                            logger.warning(
                                f"Server failed on port {current_port}, trying next port..."
                            )
                        else:
                            raise RuntimeError(
                                f"vLLM server failed with error: {stderr[-500:]}"
                            )

                if attempt < max_retries - 1:
                    logger.info(f"Retrying with next port...")
                    continue
                else:
                    # Last attempt failed
                    logger.error(f"Failed to start server after {max_retries} attempts")
                    raise RuntimeError(
                        f"vLLM server failed to start within timeout after {max_retries} port attempts"
                    )

        # Should not reach here
        raise RuntimeError("vLLM server failed to start")

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Hoist statements out of for/while loops ([`hoist-statement-from-loop`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/hoist-statement-from-loop/))
- Remove redundant pass statement ([`remove-redundant-pass`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-redundant-pass/))
- Remove redundant continue statement ([`remove-redundant-continue`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-redundant-continue/))
- Use `except Exception:` rather than bare `except:` [×4] ([`do-not-use-bare-except`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/do-not-use-bare-except/))
- Low code quality found in VLLMServerManager.start - 7% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>Explanation</summary>






The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 19
<location> `pysradb/vllm_enrichment.py:360` </location>
<code_context>
    def _is_server_running(self) -> bool:
        """Check if vLLM server is running."""
        try:
            response = requests.get(f"{self.base_url}/models", timeout=5)
            return response.status_code == 200
        except:
            return False

</code_context>

<issue_to_address>
**suggestion (code-quality):** Use `except Exception:` rather than bare `except:` ([`do-not-use-bare-except`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/do-not-use-bare-except/))

```suggestion
        except Exception:
```
</issue_to_address>

### Comment 20
<location> `pysradb/vllm_enrichment.py:469-474` </location>
<code_context>
    def _compute_confidence_score(self, logprobs: List[Dict[str, Any]]) -> float:
        """
        Compute confidence score from token logprobs.

        Uses mean negative log-likelihood (NLL) and converts to a 0-1 score.

        Args:
            logprobs: List of logprob dictionaries from vLLM

        Returns:
            Confidence score (0-1, higher is better)
        """
        if not logprobs:
            return 0.0

        # Extract log probabilities for the top token
        log_probs = []
        for token_logprob in logprobs:
            if token_logprob and "logprob" in token_logprob:
                log_probs.append(token_logprob["logprob"])

        if not log_probs:
            return 0.0

        # Compute mean NLL
        mean_nll = -np.mean(log_probs)

        # Convert to perplexity
        perplexity = np.exp(mean_nll)

        # Convert perplexity to 0-1 confidence score
        # Lower perplexity = higher confidence
        # Use sigmoid-like transformation: confidence = 1 / (1 + perplexity/k)
        # where k is a scaling factor (e.g., 10)
        k = 10.0
        confidence = 1.0 / (1.0 + perplexity / k)

        return float(confidence)

</code_context>

<issue_to_address>
**suggestion (code-quality):** Convert for loop into list comprehension ([`list-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/list-comprehension/))

```suggestion
        log_probs = [
            token_logprob["logprob"]
            for token_logprob in logprobs
            if token_logprob and "logprob" in token_logprob
        ]
```
</issue_to_address>

### Comment 21
<location> `pysradb/vllm_enrichment.py:500` </location>
<code_context>
    def _call_vllm(self, prompt: str) -> Tuple[Dict[str, Any], Dict[str, float]]:
        """
        Call vLLM API with the prompt.

        Returns:
            Tuple of (extracted_data, confidence_scores)
        """
        try:
            response = requests.post(
                f"{self.base_url}/chat/completions",
                json={
                    "model": self.served_model_name,
                    "messages": [{"role": "user", "content": prompt}],
                    "temperature": self.temperature,
                    "max_tokens": self.max_tokens,
                    "logprobs": True,
                    "top_logprobs": 5,
                },
                timeout=60,
            )
            response.raise_for_status()

            result = response.json()
            content = result["choices"][0]["message"]["content"]
            logprobs = result["choices"][0].get("logprobs", {}).get("content", [])

            try:
                if "```json" in content:
                    json_start = content.find("```json") + 7
                    json_end = content.find("```", json_start)
                    content = content[json_start:json_end].strip()
                elif "```" in content:
                    json_start = content.find("```") + 3
                    json_end = content.find("```", json_start)
                    content = content[json_start:json_end].strip()
                else:
                    import re

                    json_match = re.search(r'\{[^}]*"organ"[^}]*\}', content, re.DOTALL)
                    if json_match:
                        content = json_match.group(0)

                data = json.loads(content)
            except json.JSONDecodeError as e:
                logger.error(f"Failed to parse JSON response: {content[:500]}")
                raise RuntimeError(f"Failed to parse LLM response as JSON: {e}")

            overall_confidence = self._compute_confidence_score(logprobs)
            confidence_scores = {field: overall_confidence for field in data.keys()}

            return data, confidence_scores

        except requests.exceptions.RequestException as e:
            raise RuntimeError(f"vLLM API call failed: {e}")

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Explicitly raise from a previous error [×2] ([`raise-from-previous-error`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/raise-from-previous-error/))
</issue_to_address>

### Comment 22
<location> `pysradb/vllm_enrichment.py:561` </location>
<code_context>
    def extract_metadata(
        self, text: str, fields: Optional[List[str]] = None
    ) -> Dict[str, Any]:
        """
        Extract metadata from text using vLLM.

        Args:
            text: Input text
            fields: List of fields to extract

        Returns:
            Dictionary with extracted metadata and confidence scores
        """
        if not text or text.strip() == "":
            default_fields = {
                "organ": "Unknown",
                "tissue": "Unknown",
                "anatomical_system": "Unknown",
                "cell_type": "Unknown",
                "disease": "Unknown",
                "sex": "Unknown",
                "development_stage": "Unknown",
                "assay": "Unknown",
                "organism": "Unknown",
            }
            confidence_fields = {f"{k}_confidence": 0.0 for k in default_fields.keys()}
            return {**default_fields, **confidence_fields}

        prompt = self._create_extraction_prompt(text, fields)
        data, confidence_scores = self._call_vllm(prompt)

        # Add confidence scores to the result
        result = {}
        for field, value in data.items():
            result[field] = value
            result[f"{field}_confidence"] = confidence_scores.get(field, 0.0)

        return result

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Replaces an empty collection equality with a boolean operation ([`simplify-empty-collection-comparison`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/simplify-empty-collection-comparison/))
- Remove unnecessary call to keys() ([`remove-dict-keys`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-dict-keys/))
- Applies the dictionary union operator where applicable ([`use-dictionary-union`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-dictionary-union/))
</issue_to_address>

### Comment 23
<location> `pysradb/vllm_enrichment.py:609-615` </location>
<code_context>
    def enrich_dataframe(
        self,
        df: pd.DataFrame,
        text_column: Optional[str] = None,
        fields: Optional[List[str]] = None,
        prefix: str = "guessed_",
        show_progress: bool = True,
    ) -> pd.DataFrame:
        """
        Enrich a DataFrame with extracted metadata and confidence scores.

        Args:
            df: Input DataFrame
            text_column: Column containing text to analyze
            fields: List of metadata fields to extract
            prefix: Prefix for new columns
            show_progress: Show progress bar

        Returns:
            DataFrame with additional metadata columns and confidence scores
        """
        # Reuse parent class logic for text column detection
        enriched_df = super().enrich_dataframe(
            df=df,
            text_column=text_column,
            fields=fields,
            prefix=prefix,
            show_progress=show_progress,
        )

        return enriched_df

</code_context>

<issue_to_address>
**issue (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

pysradb/cli.py

sourcery-ai · 2025-11-08T18:35:43Z

pysradb/cli.py

+        # If output_dir is specified, process each ID separately
+        if output_dir:
+            os.makedirs(output_dir, exist_ok=True)
+            console.print(
+                f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]"
+            )
+            console.print(f"[blue]Output directory: '{output_dir}'[/blue]")


suggestion: Output directory creation does not handle permission or invalid path errors.

Wrap os.makedirs in a try/except block to handle and report errors if directory creation fails.

Suggested change

# If output_dir is specified, process each ID separately

if output_dir:

os.makedirs(output_dir, exist_ok=True)

console.print(

f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]"

)

console.print(f"[blue]Output directory: '{output_dir}'[/blue]")

# If output_dir is specified, process each ID separately

if output_dir:

try:

os.makedirs(output_dir, exist_ok=True)

except Exception as e:

console.print(f"[red]Error: Failed to create output directory '{output_dir}': {e}[/red]")

sradb.close()

return

console.print(

f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]"

)

console.print(f"[blue]Output directory: '{output_dir}'[/blue]")

pysradb/cli.py

pysradb/llamacpp_enrichment.py

sourcery-ai · 2025-11-08T18:35:44Z

pysradb/vllm_enrichment.py

+            self.process = subprocess.Popen(
+                cmd,
+                stdout=subprocess.PIPE,
+                stderr=subprocess.PIPE,
+                text=True,
+            )


security (python.lang.security.audit.dangerous-subprocess-use-audit): Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

Source: opengrep

pysradb/vllm_enrichment.py

sourcery-ai · 2025-11-08T18:35:45Z

pysradb/vllm_enrichment.py

+        Returns:
+            Dictionary with extracted metadata and confidence scores
+        """
+        if not text or text.strip() == "":


issue (code-quality): We've found these issues:

Replaces an empty collection equality with a boolean operation (simplify-empty-collection-comparison)

Remove unnecessary call to keys() (remove-dict-keys)

Applies the dictionary union operator where applicable (use-dictionary-union)

sourcery-ai · 2025-11-08T18:35:45Z

pysradb/vllm_enrichment.py

+        enriched_df = super().enrich_dataframe(
+            df=df,
+            text_column=text_column,
+            fields=fields,
+            prefix=prefix,
+            show_progress=show_progress,
+        )


issue (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)

Copilot

Pull Request Overview

This PR adds support for vLLM and llama.cpp backends for metadata enrichment, along with confidence scoring based on token log probabilities. It also introduces batch processing capabilities for CLI metadata extraction.

Adds new vLLM backend with GGUF model support and confidence scoring
Adds new llama.cpp backend with local model support
Extends CLI to support batch processing from input files with individual output files
Minor formatting improvements in existing code

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
pysradb/vllm_enrichment.py	New module implementing vLLM-based metadata extraction with server management and confidence scoring
pysradb/llamacpp_enrichment.py	New module implementing llama.cpp-based metadata extraction with GGUF model support
pysradb/metadata_enrichment.py	Extended LLMMetadataExtractor to integrate vLLM and llama.cpp backends
pysradb/cli.py	Added batch processing support with `--input-file` and `--output-dir` options; fixed import ordering
pysradb/sraweb.py	Code formatting improvements (whitespace, line breaks)

Comments suppressed due to low confidence (5)

pysradb/vllm_enrichment.py:267

Except block directly handles BaseException.

                    except:

pysradb/vllm_enrichment.py:306

Except block directly handles BaseException.

                    except:

pysradb/vllm_enrichment.py:310

Except block directly handles BaseException.

                        except:

pysradb/vllm_enrichment.py:315

Except block directly handles BaseException.

                    except:

pysradb/vllm_enrichment.py:360

Except block directly handles BaseException.

        except:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-08T18:38:00Z

pysradb/vllm_enrichment.py

+                if self.process.poll() is not None:
+                    try:
+                        stdout, stderr = self.process.communicate(timeout=5)
+                    except:


Bare except clause catches all exceptions including KeyboardInterrupt and SystemExit. Use except Exception: or a more specific exception type (e.g., subprocess.TimeoutExpired) instead.

Suggested change

except:

except subprocess.TimeoutExpired:

Copilot · 2025-11-08T18:38:00Z

pysradb/vllm_enrichment.py

+                    self.process.kill()
+                    try:
+                        stdout, stderr = self.process.communicate(timeout=5)
+                    except:


Bare except clause catches all exceptions including KeyboardInterrupt and SystemExit. Use except Exception: or a more specific exception type (e.g., subprocess.TimeoutExpired) instead.

pysradb/vllm_enrichment.py

Copilot · 2025-11-08T18:38:01Z

pysradb/vllm_enrichment.py

+                elif self.process:
+                    try:
+                        stdout, stderr = self.process.communicate(timeout=5)
+                    except:


Bare except clause catches all exceptions including KeyboardInterrupt and SystemExit. Use except Exception: or a more specific exception type (e.g., subprocess.TimeoutExpired) instead.

Suggested change

except:

except Exception:

Copilot · 2025-11-08T18:38:01Z

pysradb/vllm_enrichment.py

+        try:
+            response = requests.get(f"{self.base_url}/models", timeout=5)
+            return response.status_code == 200
+        except:


Bare except clause catches all exceptions including KeyboardInterrupt and SystemExit. Use except Exception: or a more specific exception type (e.g., requests.RequestException) instead.

Suggested change

except:

except requests.RequestException:

Copilot · 2025-11-08T18:38:03Z

pysradb/llamacpp_enrichment.py

+import json
+import logging
+import os
+from typing import Any, Dict, List, Optional, Tuple


Import of 'Any' is not used.
Import of 'Dict' is not used.
Import of 'Tuple' is not used.

Suggested change

from typing import Any, Dict, List, Optional, Tuple

from typing import List, Optional

Copilot · 2025-11-08T18:38:03Z

pysradb/vllm_enrichment.py

+from tqdm.autonotebook import tqdm
+


Import of 'tqdm' is not used.

Suggested change

from tqdm.autonotebook import tqdm

pysradb/vllm_enrichment.py

Copilot · 2025-11-08T18:38:04Z

pysradb/vllm_enrichment.py

+                        except:
+                            pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

pass

except Exception as e:

logger.warning(

f"Exception while reading process output after terminate: {e}"

)

pysradb/vllm_enrichment.py

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Signed-off-by: Saket Choudhary <[email protected]>

Co-authored-by: Copilot <[email protected]> Signed-off-by: Saket Choudhary <[email protected]>

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Signed-off-by: Saket Choudhary <[email protected]>

Co-authored-by: Copilot <[email protected]> Signed-off-by: Saket Choudhary <[email protected]>

Validate that at least one accession ID is provided when --input-file is not used. This ensures the CLI contract is maintained now that srp_id uses nargs='*' instead of nargs='+'. Users receive a clear error message and usage instructions if they attempt to run 'pysradb metadata' without providing any IDs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…tractor

saketkc and others added 2 commits November 8, 2025 11:01

Format

96f1200

Add utility functions for metadata, vllm

e93c56d

saketkc requested a review from Copilot November 8, 2025 18:34

sourcery-ai bot requested changes Nov 8, 2025

View reviewed changes

Copilot AI reviewed Nov 8, 2025

View reviewed changes

saketkc and others added 14 commits November 9, 2025 00:09

Update pysradb/vllm_enrichment.py

8cef9d3

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Signed-off-by: Saket Choudhary <[email protected]>

Update pysradb/vllm_enrichment.py

f3851e0

Co-authored-by: Copilot <[email protected]> Signed-off-by: Saket Choudhary <[email protected]>

Update pysradb/vllm_enrichment.py

5d8829e

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Signed-off-by: Saket Choudhary <[email protected]>

Update pysradb/vllm_enrichment.py

941e31d

Co-authored-by: Copilot <[email protected]> Signed-off-by: Saket Choudhary <[email protected]>

Update pysradb/vllm_enrichment.py

02c810c

Co-authored-by: Copilot <[email protected]> Signed-off-by: Saket Choudhary <[email protected]>

Initialize backend flags before conditional branches in LLMMetadataEx…

874c638

…tractor

Merge branch 'develop' into vllm

3ae758c

Merge branch 'develop' into vllm

0d77b72

Remove db.close()

5034457

Remove sradb instance

fd8110b

update retries

1d68f54

Allow history

6ccd81e

use strain/phonotype

844cfb3

	from typing import Any, Dict, List, Optional, Tuple
	from typing import List, Optional

Add vllm support #254

Are you sure you want to change the base?

Add vllm support #254

Uh oh!

Conversation

saketkc commented Nov 8, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for CLI batch metadata enrichment with input file and output directory

Class diagram for new and updated metadata enrichment backends

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcery-ai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saketkc commented Nov 8, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Nov 8, 2025 •

edited

Loading