-
Notifications
You must be signed in to change notification settings - Fork 64
Add vllm support #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add vllm support #254
Conversation
Reviewer's GuideThis PR adds support for vLLM and llama-cpp-based metadata enrichment backends and enhances the CLI with batch processing options via input files and output directories, along with minor formatting cleanups in the SRAweb module. Sequence diagram for CLI batch metadata enrichment with input file and output directorysequenceDiagram
actor User
participant CLI
participant SRAweb
participant MetadataExtractor
User->>CLI: Run metadata command with --input-file and --output-dir
CLI->>CLI: Read IDs from input file
CLI->>SRAweb: For each ID, call geo_metadata or sra_metadata
SRAweb->>MetadataExtractor: Enrich metadata (may use vLLM or llama-cpp)
MetadataExtractor-->>SRAweb: Return enriched metadata
SRAweb-->>CLI: Return DataFrame
CLI->>CLI: Save each DataFrame to output_dir/{ID}_metadata.csv
CLI->>User: Print progress and completion
Class diagram for new and updated metadata enrichment backendsclassDiagram
class MetadataExtractor {
<<abstract>>
+extract_metadata(text, fields)
+enrich_dataframe(df, text_column, fields, prefix, show_progress)
}
class LlamaCppMetadataExtractor {
+model_id: str
+model_file: str
+extract_metadata(text, fields)
+enrich_dataframe(df, text_column, fields)
}
class VLLMMetadataExtractor {
+model_id: str
+port: int
+extract_metadata(text, fields)
+enrich_dataframe(df, text_column, fields, prefix, show_progress)
+close()
}
class VLLMServerManager {
+model_id: str
+port: int
+start(max_retries)
+stop()
}
MetadataExtractor <|-- LlamaCppMetadataExtractor
MetadataExtractor <|-- VLLMMetadataExtractor
VLLMMetadataExtractor "1" *-- "1" VLLMServerManager
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
Blocking issues:
- Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
General comments:
- Add explicit validation to error out when --output-dir is provided without --input-file to prevent confusing behavior in the metadata CLI.
- Remove or guard the interactive pip‐install prompts in VLLMServerManager and llama-cpp initializers, since they can block non‐interactive workflows—raise a clear error instead.
- There’s a lot of duplicated logic between the vLLM and llama-cpp extractors (prompt construction, confidence scoring); consider refactoring shared pieces into common helper methods or a base class.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Add explicit validation to error out when --output-dir is provided without --input-file to prevent confusing behavior in the metadata CLI.
- Remove or guard the interactive pip‐install prompts in VLLMServerManager and llama-cpp initializers, since they can block non‐interactive workflows—raise a clear error instead.
- There’s a lot of duplicated logic between the vLLM and llama-cpp extractors (prompt construction, confidence scoring); consider refactoring shared pieces into common helper methods or a base class.
## Individual Comments
### Comment 1
<location> `pysradb/cli.py:206` </location>
<code_context>
sradb = SRAweb()
+ # If input_file is provided, read IDs from file
+ if input_file:
+ if not os.path.exists(input_file):
+ console.print(f"[red]Error: Input file '{input_file}' not found[/red]")
</code_context>
<issue_to_address>
**suggestion:** Input file error handling could be more robust for permission issues.
Wrap the file opening in a try/except block to catch permission errors and display a clear error message if the file cannot be read.
</issue_to_address>
### Comment 2
<location> `pysradb/cli.py:220-226` </location>
<code_context>
+
+ # If output_dir is specified, process each ID separately
+ if output_dir:
+ os.makedirs(output_dir, exist_ok=True)
+ console.print(
+ f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]"
</code_context>
<issue_to_address>
**suggestion:** Output directory creation does not handle permission or invalid path errors.
Wrap os.makedirs in a try/except block to handle and report errors if directory creation fails.
```suggestion
# If output_dir is specified, process each ID separately
if output_dir:
try:
os.makedirs(output_dir, exist_ok=True)
except Exception as e:
console.print(f"[red]Error: Failed to create output directory '{output_dir}': {e}[/red]")
sradb.close()
return
console.print(
f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]"
)
console.print(f"[blue]Output directory: '{output_dir}'[/blue]")
```
</issue_to_address>
### Comment 3
<location> `pysradb/cli.py:267-270` </location>
<code_context>
+ output_file = os.path.join(
+ output_dir, f"{accession_id}_metadata.csv"
+ )
+ df.to_csv(output_file, index=False)
+ console.print(
+ f"[green] → Saved to {output_file} ({len(df)} rows)[/green]"
</code_context>
<issue_to_address>
**suggestion:** Saving CSV files does not handle write errors.
Wrap df.to_csv in a try/except block to catch and report file write errors for each ID, preventing unexpected termination.
```suggestion
try:
df.to_csv(output_file, index=False)
console.print(
f"[green] → Saved to {output_file} ({len(df)} rows)[/green]"
)
except Exception as write_err:
console.print(
f"[red] → Error saving to {output_file}: {str(write_err)}[/red]"
)
```
</issue_to_address>
### Comment 4
<location> `pysradb/llamacpp_enrichment.py:388` </location>
<code_context>
+
+ return enriched_df
+
+ def __del__(self):
+ """Cleanup llama.cpp model."""
+ if hasattr(self, "llm") and self.llm is not None:
</code_context>
<issue_to_address>
**suggestion:** Reliance on __del__ for resource cleanup may be unreliable.
Explicitly implement a close() method and support context management to ensure reliable resource cleanup.
</issue_to_address>
### Comment 5
<location> `pysradb/vllm_enrichment.py:245-250` </location>
<code_context>
self.process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
)
</code_context>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>
### Comment 6
<location> `pysradb/cli.py:246` </location>
<code_context>
enrich_backend if enrich_backend else "ollama/phi3"
</code_context>
<issue_to_address>
**suggestion (code-quality):** Replace if-expression with `or` ([`or-if-exp-identity`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/or-if-exp-identity))
```suggestion
enrich_backend or "ollama/phi3"
```
<br/><details><summary>Explanation</summary>Here we find ourselves setting a value if it evaluates to `True`, and otherwise
using a default.
The 'After' case is a bit easier to read and avoids the duplication of
`input_currency`.
It works because the left-hand side is evaluated first. If it evaluates to
true then `currency` will be set to this and the right-hand side will not be
evaluated. If it evaluates to false the right-hand side will be evaluated and
`currency` will be set to `DEFAULT_CURRENCY`.
</details>
</issue_to_address>
### Comment 7
<location> `pysradb/cli.py:258` </location>
<code_context>
enrich_backend if enrich_backend else "ollama/phi3"
</code_context>
<issue_to_address>
**suggestion (code-quality):** Replace if-expression with `or` ([`or-if-exp-identity`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/or-if-exp-identity))
```suggestion
enrich_backend or "ollama/phi3"
```
<br/><details><summary>Explanation</summary>Here we find ourselves setting a value if it evaluates to `True`, and otherwise
using a default.
The 'After' case is a bit easier to read and avoids the duplication of
`input_currency`.
It works because the left-hand side is evaluated first. If it evaluates to
true then `currency` will be set to this and the right-hand side will not be
evaluated. If it evaluates to false the right-hand side will be evaluated and
`currency` will be set to `DEFAULT_CURRENCY`.
</details>
</issue_to_address>
### Comment 8
<location> `pysradb/llamacpp_enrichment.py:166-168` </location>
<code_context>
if not self._check_llama_cpp_installed():
if not self._install_llama_cpp():
raise RuntimeError("llama-cpp-python is required but not installed")
</code_context>
<issue_to_address>
**suggestion (code-quality):** Merge nested if conditions ([`merge-nested-ifs`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/merge-nested-ifs))
```suggestion
if not self._check_llama_cpp_installed() and not self._install_llama_cpp():
raise RuntimeError("llama-cpp-python is required but not installed")
```
<br/><details><summary>Explanation</summary>Too much nesting can make code difficult to understand, and this is especially
true in Python, where there are no brackets to help out with the delineation of
different nesting levels.
Reading deeply nested code is confusing, since you have to keep track of which
conditions relate to which levels. We therefore strive to reduce nesting where
possible, and the situation where two `if` conditions can be combined using
`and` is an easy win.
</details>
</issue_to_address>
### Comment 9
<location> `pysradb/vllm_enrichment.py:165-167` </location>
<code_context>
if not self._check_vllm_installed():
if not self._install_vllm():
raise RuntimeError("vLLM is required but not installed")
</code_context>
<issue_to_address>
**suggestion (code-quality):** Merge nested if conditions ([`merge-nested-ifs`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/merge-nested-ifs))
```suggestion
if not self._check_vllm_installed() and not self._install_vllm():
raise RuntimeError("vLLM is required but not installed")
```
<br/><details><summary>Explanation</summary>Too much nesting can make code difficult to understand, and this is especially
true in Python, where there are no brackets to help out with the delineation of
different nesting levels.
Reading deeply nested code is confusing, since you have to keep track of which
conditions relate to which levels. We therefore strive to reduce nesting where
possible, and the situation where two `if` conditions can be combined using
`and` is an easy win.
</details>
</issue_to_address>
### Comment 10
<location> `pysradb/cli.py:191` </location>
<code_context>
def metadata(
srp_id,
assay,
desc,
detailed,
expand,
saveto,
enrich=False,
enrich_backend=None,
input_file=None,
output_dir=None,
):
sradb = SRAweb()
# If input_file is provided, read IDs from file
if input_file:
if not os.path.exists(input_file):
console.print(f"[red]Error: Input file '{input_file}' not found[/red]")
sradb.close()
return
with open(input_file, "r") as f:
file_ids = [line.strip() for line in f if line.strip()]
if not file_ids:
console.print(f"[red]Error: No IDs found in '{input_file}'[/red]")
sradb.close()
return
# If output_dir is specified, process each ID separately
if output_dir:
os.makedirs(output_dir, exist_ok=True)
console.print(
f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]"
)
console.print(f"[blue]Output directory: '{output_dir}'[/blue]")
for idx, accession_id in enumerate(file_ids, 1):
console.print(
f"[green]Processing {idx}/{len(file_ids)}: {accession_id}[/green]"
)
# Determine if GSE or SRP/SRX/etc
is_gse = isinstance(
accession_id, str
) and accession_id.upper().startswith("GSE")
try:
if is_gse:
df = sradb.geo_metadata(
accession_id,
sample_attribute=desc,
detailed=detailed,
enrich=enrich,
enrich_backend=(
enrich_backend if enrich_backend else "ollama/phi3"
),
)
else:
df = sradb.sra_metadata(
accession_id,
assay=assay,
detailed=detailed,
sample_attribute=desc,
expand_sample_attributes=expand,
enrich=enrich,
enrich_backend=(
enrich_backend if enrich_backend else "ollama/phi3"
),
)
if df is not None and not df.empty:
# Save to individual file
output_file = os.path.join(
output_dir, f"{accession_id}_metadata.csv"
)
df.to_csv(output_file, index=False)
console.print(
f"[green] → Saved to {output_file} ({len(df)} rows)[/green]"
)
else:
console.print(
f"[yellow] → No metadata found for {accession_id}[/yellow]"
)
except Exception as e:
console.print(
f"[red] → Error processing {accession_id}: {str(e)}[/red]"
)
console.print(
f"[blue]Batch processing complete. Files saved to '{output_dir}'[/blue]"
)
sradb.close()
return
else:
# output_dir not specified, use file IDs as regular input
srp_id = file_ids
# Original single/multiple ID processing (when no input_file or no output_dir)
srp_ids = []
gse_ids = []
for accession in srp_id:
if isinstance(accession, str) and accession.upper().startswith("GSE"):
gse_ids.append(accession)
else:
srp_ids.append(accession)
metadata_frames = []
if srp_ids:
srp_metadata = sradb.sra_metadata(
srp_ids if len(srp_ids) > 1 else srp_ids[0],
assay=assay,
detailed=detailed,
sample_attribute=desc,
expand_sample_attributes=expand,
enrich=enrich,
enrich_backend=enrich_backend if enrich_backend else "ollama/phi3",
)
if srp_metadata is not None:
metadata_frames.append(srp_metadata)
if gse_ids:
geo_metadata_df = sradb.geo_metadata(
gse_ids if len(gse_ids) > 1 else gse_ids[0],
sample_attribute=desc,
detailed=detailed,
enrich=enrich,
enrich_backend=enrich_backend if enrich_backend else "ollama/phi3",
)
if not geo_metadata_df.empty:
metadata_frames.append(geo_metadata_df)
if metadata_frames:
df = pd.concat(metadata_frames, ignore_index=True, sort=False)
else:
df = pd.DataFrame()
_print_save_df(df, saveto)
sradb.close()
</code_context>
<issue_to_address>
**issue (code-quality):** Low code quality found in metadata - 6% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
<br/><details><summary>Explanation</summary>The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.</details>
</issue_to_address>
### Comment 11
<location> `pysradb/llamacpp_enrichment.py:133` </location>
<code_context>
def _download_model(self) -> str:
"""Download model from Hugging Face and return the GGUF file path."""
from huggingface_hub import hf_hub_download
logger.info(f"Downloading model: {self.model_id}")
if self.model_file:
model_path = hf_hub_download(
repo_id=self.model_id, filename=self.model_file
)
logger.info(f"Downloaded model file: {model_path}")
return model_path
preferred_files = [
"llama3-openbiollm-8b.Q4_K_M.gguf",
"*Q4_K_M*.gguf",
"*Q5_K_M*.gguf",
"*Q4*.gguf",
"*.gguf",
]
from huggingface_hub import list_repo_files
try:
files = list_repo_files(repo_id=self.model_id)
gguf_files = [f for f in files if f.endswith(".gguf")]
if not gguf_files:
raise RuntimeError(f"No GGUF files found in {self.model_id}")
selected_file = None
for preferred in preferred_files:
if "*" in preferred:
pattern = preferred.replace("*", "")
matches = [f for f in gguf_files if pattern in f]
if matches:
selected_file = matches[0]
break
else:
if preferred in gguf_files:
selected_file = preferred
break
if not selected_file:
selected_file = gguf_files[0]
logger.info(f"Selected GGUF file: {selected_file}")
model_path = hf_hub_download(repo_id=self.model_id, filename=selected_file)
logger.info(f"Downloaded model: {model_path}")
return model_path
except Exception as e:
raise RuntimeError(f"Failed to download model: {e}")
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Extract code out into method ([`extract-method`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/extract-method/))
- Merge else clause's nested if statement into elif ([`merge-else-if-into-elif`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/merge-else-if-into-elif/))
- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Explicitly raise from a previous error ([`raise-from-previous-error`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/raise-from-previous-error/))
</issue_to_address>
### Comment 12
<location> `pysradb/llamacpp_enrichment.py:216-226` </location>
<code_context>
def _create_extraction_prompt(
self, metadata: str, fields: Optional[List[str]] = None
) -> str:
"""Create extraction prompt for the model."""
if fields is None:
fields = [
"cell_type",
"tissue",
"disease",
"organism",
"developmental_stage",
]
field_descriptions = {
"cell_type": "the specific cell type or cell line used",
"tissue": "the tissue or organ of origin",
"disease": "any disease or condition being studied",
"organism": "the organism or species",
"developmental_stage": "the developmental stage or age",
"treatment": "any treatments, drugs, or interventions applied",
"assay": "the experimental assay or technique used",
}
fields_text = "\n".join(
[f"- {field}: {field_descriptions.get(field, field)}" for field in fields]
)
prompt = f"""You are a bioinformatics expert. Extract the following information from the metadata below and return it as a JSON object.
Fields to extract:
{fields_text}
Metadata:
{metadata}
Return ONLY a valid JSON object with the extracted fields. Use null if information is not available. Do not include any other text.
JSON:"""
return prompt
</code_context>
<issue_to_address>
**issue (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
</issue_to_address>
### Comment 13
<location> `pysradb/llamacpp_enrichment.py:244` </location>
<code_context>
def _compute_confidence_score(self, logprobs: List[float]) -> float:
"""
Compute confidence score from log probabilities.
Uses perplexity-based scoring:
- Lower perplexity = higher confidence
- Formula: confidence = 1 / (1 + perplexity / k) where k=10
Args:
logprobs: List of log probabilities for each token
Returns:
Confidence score between 0 and 1
"""
if not logprobs or len(logprobs) == 0:
return 0.5 # Default confidence when no logprobs
# Compute mean negative log-likelihood
mean_nll = -np.mean(logprobs)
# Compute perplexity
perplexity = np.exp(mean_nll)
# Convert to confidence score (0-1 range)
# Lower perplexity = higher confidence
confidence = 1.0 / (1.0 + perplexity / 10.0)
return float(confidence)
</code_context>
<issue_to_address>
**suggestion (code-quality):** We've found these issues:
- Remove redundant conditional ([`remove-redundant-if`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-redundant-if/))
- Simplify sequence length comparison ([`simplify-len-comparison`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/simplify-len-comparison/))
```suggestion
if not logprobs:
```
</issue_to_address>
### Comment 14
<location> `pysradb/llamacpp_enrichment.py:303-309` </location>
<code_context>
def extract_metadata(
self, text: str, fields: Optional[List[str]] = None
) -> _MetadataExtraction:
"""
Extract metadata from text using llama.cpp.
Args:
text: Input text to extract metadata from
fields: List of fields to extract
Returns:
_MetadataExtraction object with extracted data and confidence scores
"""
prompt = self._create_extraction_prompt(text, fields)
try:
# Generate response with logprobs
response = self.llm(
prompt,
max_tokens=self.max_tokens,
temperature=self.temperature,
logprobs=1, # Request log probabilities
echo=False,
)
generated_text = response["choices"][0]["text"].strip()
logprobs_data = response["choices"][0].get("logprobs", {})
token_logprobs = logprobs_data.get("token_logprobs", [])
valid_logprobs = [lp for lp in token_logprobs if lp is not None]
confidence = self._compute_confidence_score(valid_logprobs)
json_str = generated_text
if "```json" in json_str:
json_str = json_str.split("```json")[1].split("```")[0].strip()
elif "```" in json_str:
json_str = json_str.split("```")[1].split("```")[0].strip()
if "{" in json_str and "}" in json_str:
start = json_str.find("{")
end = json_str.rfind("}") + 1
json_str = json_str[start:end]
extracted_data = json.loads(json_str)
field_confidences = {}
for field in extracted_data.keys():
if extracted_data[field] is not None:
field_confidences[f"{field}_confidence"] = confidence
else:
field_confidences[f"{field}_confidence"] = 0.0
return _MetadataExtraction(
success=True,
extracted_metadata=extracted_data,
confidence_scores=field_confidences,
error=None,
)
except json.JSONDecodeError as e:
logger.error(f"Failed to parse JSON response: {generated_text}")
return _MetadataExtraction(
success=False,
extracted_metadata={},
confidence_scores={},
error=f"JSON parsing error: {str(e)}",
)
except Exception as e:
logger.error(f"Extraction failed: {str(e)}")
return _MetadataExtraction(
success=False,
extracted_metadata={},
confidence_scores={},
error=str(e),
)
</code_context>
<issue_to_address>
**suggestion (code-quality):** We've found these issues:
- Convert for loop into dictionary comprehension ([`dict-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/dict-comprehension/))
- Replace if statement with if expression ([`assign-if-exp`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/assign-if-exp/))
```suggestion
field_confidences = {
f"{field}_confidence": (
confidence if extracted_data[field] is not None else 0.0
)
for field in extracted_data.keys()
}
```
</issue_to_address>
### Comment 15
<location> `pysradb/metadata_enrichment.py:323-325` </location>
<code_context>
def _init_llamacpp_backend(self, model: Optional[str], **kwargs):
"""Initialize llama-cpp backend."""
from .llamacpp_enrichment import LlamaCppMetadataExtractor
model_id = self.provider[9:]
if model:
model_id = model
if not model_id:
model_id = "MoMonir/Llama3-OpenBioLLM-8B-GGUF"
self.logger.info(f"Using llama-cpp backend with model: {model_id}")
self._llamacpp_extractor = LlamaCppMetadataExtractor(
model_id=model_id, **kwargs
)
</code_context>
<issue_to_address>
**suggestion (code-quality):** We've found these issues:
- Move setting of default value for variable into `else` branch ([`introduce-default-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/introduce-default-else/))
- Replace if statement with if expression ([`assign-if-exp`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/assign-if-exp/))
```suggestion
model_id = model if model else self.provider[9:]
```
</issue_to_address>
### Comment 16
<location> `pysradb/metadata_enrichment.py:338-340` </location>
<code_context>
def _init_vllm_backend(self, model: Optional[str], **kwargs):
"""Initialize vLLM backend."""
from .vllm_enrichment import VLLMMetadataExtractor
model_id = self.provider[5:]
if model:
model_id = model
if not model_id:
model_id = "Qwen/Qwen2.5-1.5B-Instruct"
self.logger.info(f"Using vLLM backend with model: {model_id}")
self._vllm_extractor = VLLMMetadataExtractor(model_id=model_id, **kwargs)
</code_context>
<issue_to_address>
**suggestion (code-quality):** We've found these issues:
- Move setting of default value for variable into `else` branch ([`introduce-default-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/introduce-default-else/))
- Replace if statement with if expression ([`assign-if-exp`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/assign-if-exp/))
```suggestion
model_id = model if model else self.provider[5:]
```
</issue_to_address>
### Comment 17
<location> `pysradb/vllm_enrichment.py:98` </location>
<code_context>
def _download_model(self) -> str:
"""
Download model from Hugging Face.
Returns:
Path to the downloaded model
"""
try:
from huggingface_hub import snapshot_download
except ImportError:
raise ImportError(
"huggingface_hub is required for downloading models. "
"Install with: pip install huggingface_hub"
)
logger.info(f"Downloading model {self.model_id}...")
try:
model_path = snapshot_download(
repo_id=self.model_id,
allow_patterns=[
"*.gguf",
"*.safetensors",
"*.bin",
"*.json",
"*.txt",
"*.model",
"tokenizer*",
"config.json",
"generation_config.json",
],
)
logger.info(f"Model downloaded to {model_path}")
return model_path
except Exception as e:
raise RuntimeError(f"Failed to download model {self.model_id}: {e}")
</code_context>
<issue_to_address>
**issue (code-quality):** Explicitly raise from a previous error [×2] ([`raise-from-previous-error`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/raise-from-previous-error/))
</issue_to_address>
### Comment 18
<location> `pysradb/vllm_enrichment.py:155` </location>
<code_context>
def start(self, max_retries: int = 3) -> bool:
"""
Start vLLM server with automatic port retry on conflicts.
Args:
max_retries: Maximum number of port retries (default: 3)
Returns:
True if server started successfully
"""
if not self._check_vllm_installed():
if not self._install_vllm():
raise RuntimeError("vLLM is required but not installed")
# Check if port is already in use
if self._is_server_running():
logger.info(f"vLLM server already running on port {self.port}")
return True
model_path = self._download_model()
gguf_file = self._find_gguf_file(model_path)
if gguf_file:
logger.info(f"Using GGUF file: {gguf_file}")
model_arg = gguf_file
tokenizer = self._get_tokenizer_path()
else:
logger.info(f"Using model ID directly (non-GGUF): {self.model_id}")
model_arg = self.model_id
tokenizer = None
original_port = self.port
for attempt in range(max_retries):
current_port = original_port + attempt
logger.info(
f"Attempting to start vLLM server on port {current_port} (attempt {attempt + 1}/{max_retries})"
)
cmd = [
"python",
"-m",
"vllm.entrypoints.openai.api_server",
"--model",
model_arg,
"--port",
str(current_port),
"--max-logprobs",
str(self.max_logprobs),
"--gpu-memory-utilization",
str(self.gpu_memory_utilization),
]
if tokenizer:
cmd.extend(["--tokenizer", tokenizer])
if self.trust_remote_code:
cmd.append("--trust-remote-code")
# Add chat template to fix "default chat template is no longer allowed" error
if (
"llama" in self.model_id.lower()
or "openbiollm" in self.model_id.lower()
):
llama3_chat_template = (
"{% set loop_messages = messages %}"
"{% for message in loop_messages %}"
"{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' %}"
"{% if loop.index0 == 0 %}"
"{% set content = bos_token + content %}"
"{% endif %}"
"{{ content }}"
"{% endfor %}"
"{{ '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}"
)
cmd.extend(["--chat-template", llama3_chat_template])
elif "qwen" in self.model_id.lower():
qwen_chat_template = (
"{% for message in messages %}"
"{% if loop.first and messages[0]['role'] != 'system' %}"
"{{ '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}"
"{% endif %}"
"{{ '<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>\\n' }}"
"{% endfor %}"
"{{ '<|im_start|>assistant\\n' }}"
)
cmd.extend(["--chat-template", qwen_chat_template])
logger.info(f"Starting vLLM server: {' '.join(cmd)}")
self.process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
)
logger.info("Waiting for vLLM server to start...")
max_wait = 300
start_time = time.time()
while time.time() - start_time < max_wait:
self.port = current_port
self.base_url = f"http://localhost:{current_port}/v1"
if self._is_server_running():
logger.info(f"vLLM server started successfully on port {self.port}")
return True
if self.process.poll() is not None:
try:
stdout, stderr = self.process.communicate(timeout=5)
except:
logger.warning(
f"Process died on port {current_port}, but couldn't read output (timeout)"
)
stdout, stderr = "", ""
if stderr and (
"Address already in use" in stderr
or "OSError: [Errno 98]" in stderr
):
logger.warning(
f"Port {current_port} is already in use, trying next port..."
)
break
elif stderr:
logger.error(
f"vLLM server failed to start.\nStdout: {stdout}\nStderr: {stderr}"
)
raise RuntimeError(
f"vLLM server failed with error: {stderr[:500]}"
)
else:
logger.warning(
f"Process died on port {current_port}, retrying..."
)
break
time.sleep(2)
if not self._is_server_running():
stdout, stderr = "", ""
if self.process and self.process.poll() is None:
logger.warning(
f"Server didn't start within timeout on port {current_port}"
)
self.process.kill()
try:
stdout, stderr = self.process.communicate(timeout=5)
except:
self.process.terminate()
try:
stdout, stderr = self.process.communicate(timeout=2)
except:
pass
elif self.process:
try:
stdout, stderr = self.process.communicate(timeout=5)
except:
logger.warning(
f"Timed out reading process output on port {current_port}"
)
pass
if stderr:
logger.error(
f"vLLM server died during startup.\nStdout: {stdout[-1000:] if stdout else ''}\nStderr: {stderr[-1000:]}"
)
# Check for port conflict
if (
stderr
and "Address already in use" not in stderr
and "OSError: [Errno 98]" not in stderr
):
# Not a port conflict - this is a real error
if attempt < max_retries - 1:
logger.warning(
f"Server failed on port {current_port}, trying next port..."
)
else:
raise RuntimeError(
f"vLLM server failed with error: {stderr[-500:]}"
)
if attempt < max_retries - 1:
logger.info(f"Retrying with next port...")
continue
else:
# Last attempt failed
logger.error(f"Failed to start server after {max_retries} attempts")
raise RuntimeError(
f"vLLM server failed to start within timeout after {max_retries} port attempts"
)
# Should not reach here
raise RuntimeError("vLLM server failed to start")
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Hoist statements out of for/while loops ([`hoist-statement-from-loop`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/hoist-statement-from-loop/))
- Remove redundant pass statement ([`remove-redundant-pass`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-redundant-pass/))
- Remove redundant continue statement ([`remove-redundant-continue`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-redundant-continue/))
- Use `except Exception:` rather than bare `except:` [×4] ([`do-not-use-bare-except`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/do-not-use-bare-except/))
- Low code quality found in VLLMServerManager.start - 7% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
<br/><details><summary>Explanation</summary>
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.</details>
</issue_to_address>
### Comment 19
<location> `pysradb/vllm_enrichment.py:360` </location>
<code_context>
def _is_server_running(self) -> bool:
"""Check if vLLM server is running."""
try:
response = requests.get(f"{self.base_url}/models", timeout=5)
return response.status_code == 200
except:
return False
</code_context>
<issue_to_address>
**suggestion (code-quality):** Use `except Exception:` rather than bare `except:` ([`do-not-use-bare-except`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/do-not-use-bare-except/))
```suggestion
except Exception:
```
</issue_to_address>
### Comment 20
<location> `pysradb/vllm_enrichment.py:469-474` </location>
<code_context>
def _compute_confidence_score(self, logprobs: List[Dict[str, Any]]) -> float:
"""
Compute confidence score from token logprobs.
Uses mean negative log-likelihood (NLL) and converts to a 0-1 score.
Args:
logprobs: List of logprob dictionaries from vLLM
Returns:
Confidence score (0-1, higher is better)
"""
if not logprobs:
return 0.0
# Extract log probabilities for the top token
log_probs = []
for token_logprob in logprobs:
if token_logprob and "logprob" in token_logprob:
log_probs.append(token_logprob["logprob"])
if not log_probs:
return 0.0
# Compute mean NLL
mean_nll = -np.mean(log_probs)
# Convert to perplexity
perplexity = np.exp(mean_nll)
# Convert perplexity to 0-1 confidence score
# Lower perplexity = higher confidence
# Use sigmoid-like transformation: confidence = 1 / (1 + perplexity/k)
# where k is a scaling factor (e.g., 10)
k = 10.0
confidence = 1.0 / (1.0 + perplexity / k)
return float(confidence)
</code_context>
<issue_to_address>
**suggestion (code-quality):** Convert for loop into list comprehension ([`list-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/list-comprehension/))
```suggestion
log_probs = [
token_logprob["logprob"]
for token_logprob in logprobs
if token_logprob and "logprob" in token_logprob
]
```
</issue_to_address>
### Comment 21
<location> `pysradb/vllm_enrichment.py:500` </location>
<code_context>
def _call_vllm(self, prompt: str) -> Tuple[Dict[str, Any], Dict[str, float]]:
"""
Call vLLM API with the prompt.
Returns:
Tuple of (extracted_data, confidence_scores)
"""
try:
response = requests.post(
f"{self.base_url}/chat/completions",
json={
"model": self.served_model_name,
"messages": [{"role": "user", "content": prompt}],
"temperature": self.temperature,
"max_tokens": self.max_tokens,
"logprobs": True,
"top_logprobs": 5,
},
timeout=60,
)
response.raise_for_status()
result = response.json()
content = result["choices"][0]["message"]["content"]
logprobs = result["choices"][0].get("logprobs", {}).get("content", [])
try:
if "```json" in content:
json_start = content.find("```json") + 7
json_end = content.find("```", json_start)
content = content[json_start:json_end].strip()
elif "```" in content:
json_start = content.find("```") + 3
json_end = content.find("```", json_start)
content = content[json_start:json_end].strip()
else:
import re
json_match = re.search(r'\{[^}]*"organ"[^}]*\}', content, re.DOTALL)
if json_match:
content = json_match.group(0)
data = json.loads(content)
except json.JSONDecodeError as e:
logger.error(f"Failed to parse JSON response: {content[:500]}")
raise RuntimeError(f"Failed to parse LLM response as JSON: {e}")
overall_confidence = self._compute_confidence_score(logprobs)
confidence_scores = {field: overall_confidence for field in data.keys()}
return data, confidence_scores
except requests.exceptions.RequestException as e:
raise RuntimeError(f"vLLM API call failed: {e}")
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Explicitly raise from a previous error [×2] ([`raise-from-previous-error`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/raise-from-previous-error/))
</issue_to_address>
### Comment 22
<location> `pysradb/vllm_enrichment.py:561` </location>
<code_context>
def extract_metadata(
self, text: str, fields: Optional[List[str]] = None
) -> Dict[str, Any]:
"""
Extract metadata from text using vLLM.
Args:
text: Input text
fields: List of fields to extract
Returns:
Dictionary with extracted metadata and confidence scores
"""
if not text or text.strip() == "":
default_fields = {
"organ": "Unknown",
"tissue": "Unknown",
"anatomical_system": "Unknown",
"cell_type": "Unknown",
"disease": "Unknown",
"sex": "Unknown",
"development_stage": "Unknown",
"assay": "Unknown",
"organism": "Unknown",
}
confidence_fields = {f"{k}_confidence": 0.0 for k in default_fields.keys()}
return {**default_fields, **confidence_fields}
prompt = self._create_extraction_prompt(text, fields)
data, confidence_scores = self._call_vllm(prompt)
# Add confidence scores to the result
result = {}
for field, value in data.items():
result[field] = value
result[f"{field}_confidence"] = confidence_scores.get(field, 0.0)
return result
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Replaces an empty collection equality with a boolean operation ([`simplify-empty-collection-comparison`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/simplify-empty-collection-comparison/))
- Remove unnecessary call to keys() ([`remove-dict-keys`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-dict-keys/))
- Applies the dictionary union operator where applicable ([`use-dictionary-union`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-dictionary-union/))
</issue_to_address>
### Comment 23
<location> `pysradb/vllm_enrichment.py:609-615` </location>
<code_context>
def enrich_dataframe(
self,
df: pd.DataFrame,
text_column: Optional[str] = None,
fields: Optional[List[str]] = None,
prefix: str = "guessed_",
show_progress: bool = True,
) -> pd.DataFrame:
"""
Enrich a DataFrame with extracted metadata and confidence scores.
Args:
df: Input DataFrame
text_column: Column containing text to analyze
fields: List of metadata fields to extract
prefix: Prefix for new columns
show_progress: Show progress bar
Returns:
DataFrame with additional metadata columns and confidence scores
"""
# Reuse parent class logic for text column detection
enriched_df = super().enrich_dataframe(
df=df,
text_column=text_column,
fields=fields,
prefix=prefix,
show_progress=show_progress,
)
return enriched_df
</code_context>
<issue_to_address>
**issue (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| # If output_dir is specified, process each ID separately | ||
| if output_dir: | ||
| os.makedirs(output_dir, exist_ok=True) | ||
| console.print( | ||
| f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]" | ||
| ) | ||
| console.print(f"[blue]Output directory: '{output_dir}'[/blue]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Output directory creation does not handle permission or invalid path errors.
Wrap os.makedirs in a try/except block to handle and report errors if directory creation fails.
| # If output_dir is specified, process each ID separately | |
| if output_dir: | |
| os.makedirs(output_dir, exist_ok=True) | |
| console.print( | |
| f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]" | |
| ) | |
| console.print(f"[blue]Output directory: '{output_dir}'[/blue]") | |
| # If output_dir is specified, process each ID separately | |
| if output_dir: | |
| try: | |
| os.makedirs(output_dir, exist_ok=True) | |
| except Exception as e: | |
| console.print(f"[red]Error: Failed to create output directory '{output_dir}': {e}[/red]") | |
| sradb.close() | |
| return | |
| console.print( | |
| f"[blue]Processing {len(file_ids)} IDs from '{input_file}'[/blue]" | |
| ) | |
| console.print(f"[blue]Output directory: '{output_dir}'[/blue]") |
| self.process = subprocess.Popen( | ||
| cmd, | ||
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.PIPE, | ||
| text=True, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (python.lang.security.audit.dangerous-subprocess-use-audit): Detected subprocess function 'Popen' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
Source: opengrep
| Returns: | ||
| Dictionary with extracted metadata and confidence scores | ||
| """ | ||
| if not text or text.strip() == "": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Replaces an empty collection equality with a boolean operation (
simplify-empty-collection-comparison) - Remove unnecessary call to keys() (
remove-dict-keys) - Applies the dictionary union operator where applicable (
use-dictionary-union)
| enriched_df = super().enrich_dataframe( | ||
| df=df, | ||
| text_column=text_column, | ||
| fields=fields, | ||
| prefix=prefix, | ||
| show_progress=show_progress, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for vLLM and llama.cpp backends for metadata enrichment, along with confidence scoring based on token log probabilities. It also introduces batch processing capabilities for CLI metadata extraction.
- Adds new vLLM backend with GGUF model support and confidence scoring
- Adds new llama.cpp backend with local model support
- Extends CLI to support batch processing from input files with individual output files
- Minor formatting improvements in existing code
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| pysradb/vllm_enrichment.py | New module implementing vLLM-based metadata extraction with server management and confidence scoring |
| pysradb/llamacpp_enrichment.py | New module implementing llama.cpp-based metadata extraction with GGUF model support |
| pysradb/metadata_enrichment.py | Extended LLMMetadataExtractor to integrate vLLM and llama.cpp backends |
| pysradb/cli.py | Added batch processing support with --input-file and --output-dir options; fixed import ordering |
| pysradb/sraweb.py | Code formatting improvements (whitespace, line breaks) |
Comments suppressed due to low confidence (5)
pysradb/vllm_enrichment.py:267
- Except block directly handles BaseException.
except:
pysradb/vllm_enrichment.py:306
- Except block directly handles BaseException.
except:
pysradb/vllm_enrichment.py:310
- Except block directly handles BaseException.
except:
pysradb/vllm_enrichment.py:315
- Except block directly handles BaseException.
except:
pysradb/vllm_enrichment.py:360
- Except block directly handles BaseException.
except:
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if self.process.poll() is not None: | ||
| try: | ||
| stdout, stderr = self.process.communicate(timeout=5) | ||
| except: |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare except clause catches all exceptions including KeyboardInterrupt and SystemExit. Use except Exception: or a more specific exception type (e.g., subprocess.TimeoutExpired) instead.
| except: | |
| except subprocess.TimeoutExpired: |
| self.process.kill() | ||
| try: | ||
| stdout, stderr = self.process.communicate(timeout=5) | ||
| except: |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare except clause catches all exceptions including KeyboardInterrupt and SystemExit. Use except Exception: or a more specific exception type (e.g., subprocess.TimeoutExpired) instead.
| elif self.process: | ||
| try: | ||
| stdout, stderr = self.process.communicate(timeout=5) | ||
| except: |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare except clause catches all exceptions including KeyboardInterrupt and SystemExit. Use except Exception: or a more specific exception type (e.g., subprocess.TimeoutExpired) instead.
| except: | |
| except Exception: |
pysradb/vllm_enrichment.py
Outdated
| try: | ||
| response = requests.get(f"{self.base_url}/models", timeout=5) | ||
| return response.status_code == 200 | ||
| except: |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare except clause catches all exceptions including KeyboardInterrupt and SystemExit. Use except Exception: or a more specific exception type (e.g., requests.RequestException) instead.
| except: | |
| except requests.RequestException: |
| import json | ||
| import logging | ||
| import os | ||
| from typing import Any, Dict, List, Optional, Tuple |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Any' is not used.
Import of 'Dict' is not used.
Import of 'Tuple' is not used.
| from typing import Any, Dict, List, Optional, Tuple | |
| from typing import List, Optional |
| from tqdm.autonotebook import tqdm | ||
|
|
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'tqdm' is not used.
| from tqdm.autonotebook import tqdm |
pysradb/vllm_enrichment.py
Outdated
| except: | ||
| pass |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| pass | |
| except Exception as e: | |
| logger.warning( | |
| f"Exception while reading process output after terminate: {e}" | |
| ) |
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Signed-off-by: Saket Choudhary <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: Saket Choudhary <[email protected]>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Signed-off-by: Saket Choudhary <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: Saket Choudhary <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: Saket Choudhary <[email protected]>
Validate that at least one accession ID is provided when --input-file is not used. This ensures the CLI contract is maintained now that srp_id uses nargs='*' instead of nargs='+'. Users receive a clear error message and usage instructions if they attempt to run 'pysradb metadata' without providing any IDs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Summary by Sourcery
Add support for local vLLM and llama-cpp metadata enrichment backends with confidence scoring and extend the metadata command to accept file-based batch inputs with per-ID output directories.
New Features:
Enhancements: