Skip to content

Commit

Permalink
update randomizer, folder organization and stacking shell script
Browse files Browse the repository at this point in the history
  • Loading branch information
ChongLC committed Feb 3, 2022
1 parent 892eaa0 commit ade9b94
Show file tree
Hide file tree
Showing 7 changed files with 35 additions and 19 deletions.
19 changes: 16 additions & 3 deletions PythonScript/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
[To the main README >](https://github.com/ChongLC/MinimalSetofViralPeptidome-UNIQmin/blob/master/README.md)

# **Step-by-step of UNIQmin**

Table of Contents
Expand Down Expand Up @@ -267,6 +269,14 @@ with open(result, "w") as f:
#### Step 5 - Identification of the final minimal set of sequences
Match between the remaining unique, multi-occurring *k*-mers of *B#* and the remaining sequence of *A#*, and subsequently, identify the sequence with the maximal *k*-mers coverage, which are then deposited into the earlier defined file *Z* (minimal set). The deposited sequences in file *Z* and their inherent *k*-mers are removed from file *A#* and file *B#*, respectively. This process is repeated until the *k*-mers in the file *B#* are exhausted. This step is carried out by use of the "U5.1_RemainingMinSet" and "U5.2_MinSet" scripts. The output of the sample input file (exampleinput.fas) is provided as an example (exampleoutput.fasta).

```
#create a new directory for all the final output files (named it as `minimalSet`)
mkdir minimalSet
#copy the pre-qualified minimal set into a new file (named as `fileZ.txt`) that will be appended later, resulting a final minimal set
cp seqfileZ.txt minimalSet/fileZ.txt
```

```
#create a new directory for all the intermediate files (named it as `match`)
mkdir match
Expand Down Expand Up @@ -321,7 +331,7 @@ while(len(remain_kmer) != 0):
df = pd.read_csv(matching_file, delimiter=';', names=['sequence_id', 'matched_kmer', 'count']).sort_values(by='count',ascending=False, kind='mergesort')
df['matched_kmer'] = df['matched_kmer'].str.replace(r"\[|\]|'","")
fileZ = open('fileZ.txt', 'a') # file Z is an example for output (exampleoutput.txt)
fileZ = open('minimalSet/fileZ.txt', 'a') # file Z is an example for output (exampleoutput.txt)
fileZ.write(df['sequence_id'].iloc[0] + '\n')
kmer_to_remove = df['matched_kmer'].iloc[0].split(', ')
Expand All @@ -337,8 +347,8 @@ while(len(remain_kmer) != 0):
from Bio import SeqIO
fasta_file = "inputfile.fas" # Input fasta file
wanted_file = "fileZ.txt" # Input interesting sequence IDs, one per line
result_file = "FileZ.fasta" # Output fasta file
wanted_file = "minimalSet/fileZ.txt" # Input interesting sequence IDs, one per line
result_file = "minimalSet/fileZ.fasta" # Output fasta file
wanted = set()
with open (wanted_file) as f:
Expand All @@ -353,6 +363,9 @@ with open (result_file, "w") as f:
if seq.id in wanted:
SeqIO.write([seq], f, "fasta")
```

[To the main README >](https://github.com/ChongLC/MinimalSetofViralPeptidome-UNIQmin/blob/master/README.md)

---
## Figure Summary
<img src="Summary.png" width="640" height="1075">
2 changes: 1 addition & 1 deletion PythonScript/U5.1_RemainingMinSet.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def find_matching(line, A):
df['matched_kmer'] = df['matched_kmer'].str.replace(r"\[|\]|'","")

# save highest count id to file
fileZ = open('fileZ.txt', 'a')
fileZ = open('minimalSet/fileZ.txt', 'a')
fileZ.write(df['sequence_id'].iloc[0] + '\n')

# remove highest count kmer
Expand Down
4 changes: 2 additions & 2 deletions PythonScript/U5.2_MinSet.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
from Bio import SeqIO

fasta_file = "inputfile.fas" # Input fasta file
wanted_file = "fileZ.txt" # Input interesting sequence IDs, one per line
result_file = "FileZ.fasta" # Output fasta file
wanted_file = "minimalSet/fileZ.txt" # Input interesting sequence IDs, one per line
result_file = "minimalSet/fileZ.fasta" # Output fasta file

wanted = set()
with open (wanted_file) as f:
Expand Down
6 changes: 3 additions & 3 deletions randpseqgen/README.md → randomizer/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# randpSeqGen: A tool to generate a random protein sequence dataset
# randomizer: A tool to generate a random protein sequence dataset
[![Run in Google Colab](https://img.shields.io/badge/Colab-Run_in_Google_Colab-blue?logo=Google&logoColor=FDBA18)](https://colab.research.google.com/drive/1IwNPKaRKGgPzqiOBuEo8S0VbUpe3XqVh?usp=sharing) <br>

A random protein sequence dataset can be useful for various sequence analysis, in order to evaluate and correct the analysis for the background noise. Herein, we offer a tool that can generate a dataset of random protein sequences.

---

### Usage
`python randpseqgen.py [-h] [-[-o OUTPUT] [-l SEQLEN] [-n SEQNUM]`
`python randomizer.py [-h] [-[-o OUTPUT] [-l SEQLEN] [-n SEQNUM]`

In the usage case below, the randpSeqGen tool is applied to generate a random protein sequence dataset, named `randomproteinseq.fasta` consisting of 1,000 sequences of length 1,000 amino acids. The amino acid composition of the random sequences is based on all reported viral sequence retrieved from the NCBI Protein database (as of May 2021; `allVirus080521.fasta`). <br>

```
python randpseqgen.py -o randomproteinseq.fasta -l 1000 -n 1000
python randomizer.py -o randomproteinseq.fasta -l 1000 -n 1000
```

#### Command-line Arguments
Expand Down
File renamed without changes.
File renamed without changes.
23 changes: 13 additions & 10 deletions uniqmin.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,26 @@

dir=/backup/user/ext/perdana/lichuin/cdhitObj2/testingStitchScript/
cd $dir
python p1.py
python3 U1_KmerGenerator.py
wait
python p2.py
python3 U2.1_Singletons.py
wait
python p2_2.py
python3 U2.2_Multitons.py
wait
python p3.py
python3 U3.1_PreQualifiedMinSet.py
wait
python p3_2.py
python3 U3.2_UnmatchedSingletons.py
wait
python p4_1.py
python3 U4.1_Non-SingletonsDedup.py
wait
python p4_2.py
python3 U4.2_Multi-OccurringPreMinSet.py
wait
python p4_3.py
python3 U4.3_UnmatchedMulti-Occurring.py
wait
cp seqfileZ.txt fileZ.txt
mkdir minimalSet
cp seqfileZ.txt minimalSet/fileZ.txt
mkdir match
wait
python p5.py
python3 U5.1_RemainingMinSet.py
wait
python3 U5.2_MinSet.py

0 comments on commit ade9b94

Please sign in to comment.