update randomizer, folder organization and stacking shell script

ChongLC · Feb 3, 2022 · ade9b94 · ade9b94
1 parent 892eaa0
commit ade9b94
Show file tree

Hide file tree

Showing 7 changed files with 35 additions and 19 deletions.
diff --git a/PythonScript/README.md b/PythonScript/README.md
@@ -1,3 +1,5 @@
+[To the main README >](https://github.com/ChongLC/MinimalSetofViralPeptidome-UNIQmin/blob/master/README.md)
+
 # **Step-by-step of UNIQmin**
 
 Table of Contents
@@ -267,6 +269,14 @@ with open(result, "w") as f:
 #### Step 5 - Identification of the final minimal set of sequences
 Match between the remaining unique, multi-occurring *k*-mers of *B#* and the remaining sequence of *A#*, and subsequently, identify the sequence with the maximal *k*-mers coverage, which are then deposited into the earlier defined file *Z* (minimal set). The deposited sequences in file *Z* and their inherent *k*-mers are removed from file *A#* and file *B#*, respectively. This process is repeated until the *k*-mers in the file *B#* are exhausted. This step is carried out by use of the "U5.1_RemainingMinSet" and "U5.2_MinSet" scripts. The output of the sample input file (exampleinput.fas) is provided as an example (exampleoutput.fasta). 
 
+```
+#create a new directory for all the final output files (named it as `minimalSet`)
+mkdir minimalSet
+
+#copy the pre-qualified minimal set into a new file (named as `fileZ.txt`) that will be appended later, resulting a final minimal set
+cp seqfileZ.txt minimalSet/fileZ.txt 
+```
+
 ```
 #create a new directory for all the intermediate files (named it as `match`)
 mkdir match
@@ -321,7 +331,7 @@ while(len(remain_kmer) != 0):
     df = pd.read_csv(matching_file, delimiter=';', names=['sequence_id', 'matched_kmer', 'count']).sort_values(by='count',ascending=False, kind='mergesort')
     df['matched_kmer'] = df['matched_kmer'].str.replace(r"\[|\]|'","")
     
-    fileZ = open('fileZ.txt', 'a') # file Z is an example for output (exampleoutput.txt)
+    fileZ = open('minimalSet/fileZ.txt', 'a') # file Z is an example for output (exampleoutput.txt)
     fileZ.write(df['sequence_id'].iloc[0] + '\n')
     
     kmer_to_remove = df['matched_kmer'].iloc[0].split(', ')
@@ -337,8 +347,8 @@ while(len(remain_kmer) != 0):
 from Bio import SeqIO
 
 fasta_file = "inputfile.fas" # Input fasta file
-wanted_file = "fileZ.txt" # Input interesting sequence IDs, one per line
-result_file = "FileZ.fasta" # Output fasta file
+wanted_file = "minimalSet/fileZ.txt" # Input interesting sequence IDs, one per line
+result_file = "minimalSet/fileZ.fasta" # Output fasta file
 
 wanted = set()
 with open (wanted_file) as f: 
@@ -353,6 +363,9 @@ with open (result_file, "w") as f:
     if seq.id in wanted: 
       SeqIO.write([seq], f, "fasta")
 ```
+
+[To the main README >](https://github.com/ChongLC/MinimalSetofViralPeptidome-UNIQmin/blob/master/README.md)
+
 ---
 ## Figure Summary
 <img src="Summary.png" width="640" height="1075">
diff --git a/PythonScript/U5.1_RemainingMinSet.py b/PythonScript/U5.1_RemainingMinSet.py
@@ -48,7 +48,7 @@ def find_matching(line, A):
     df['matched_kmer'] = df['matched_kmer'].str.replace(r"\[|\]|'","")
 
     # save highest count id to file
-    fileZ = open('fileZ.txt', 'a')
+    fileZ = open('minimalSet/fileZ.txt', 'a')
     fileZ.write(df['sequence_id'].iloc[0] + '\n')
 
     # remove highest count kmer

diff --git a/PythonScript/U5.2_MinSet.py b/PythonScript/U5.2_MinSet.py
@@ -1,8 +1,8 @@
 from Bio import SeqIO
 
 fasta_file = "inputfile.fas" # Input fasta file
-wanted_file = "fileZ.txt" # Input interesting sequence IDs, one per line
-result_file = "FileZ.fasta" # Output fasta file
+wanted_file = "minimalSet/fileZ.txt" # Input interesting sequence IDs, one per line
+result_file = "minimalSet/fileZ.fasta" # Output fasta file
 
 wanted = set()
 with open (wanted_file) as f: 

diff --git a/randpseqgen/README.md → randomizer/README.md b/randpseqgen/README.md → randomizer/README.md
@@ -1,17 +1,17 @@
-# randpSeqGen: A tool to generate a random protein sequence dataset
+# randomizer: A tool to generate a random protein sequence dataset
 [![Run in Google Colab](https://img.shields.io/badge/Colab-Run_in_Google_Colab-blue?logo=Google&logoColor=FDBA18)](https://colab.research.google.com/drive/1IwNPKaRKGgPzqiOBuEo8S0VbUpe3XqVh?usp=sharing) <br>
 
 A random protein sequence dataset can be useful for various sequence analysis, in order to evaluate and correct the analysis for the background noise. Herein, we offer a tool that can generate a dataset of random protein sequences. 
 
 ---
 
 ### Usage
-`python randpseqgen.py [-h] [-[-o OUTPUT] [-l SEQLEN] [-n SEQNUM]`
+`python randomizer.py [-h] [-[-o OUTPUT] [-l SEQLEN] [-n SEQNUM]`
 
 In the usage case below, the randpSeqGen tool is applied to generate a random protein sequence dataset, named `randomproteinseq.fasta` consisting of 1,000 sequences of length 1,000 amino acids. The amino acid composition of the random sequences is based on all reported viral sequence retrieved from the NCBI Protein database (as of May 2021; `allVirus080521.fasta`). <br> 
 
 ```
-python randpseqgen.py -o randomproteinseq.fasta -l 1000 -n 1000
+python randomizer.py -o randomproteinseq.fasta -l 1000 -n 1000
 ```
 
 #### Command-line Arguments

diff --git a/randpseqgen/randomproteinseq.fasta → randomizer/randomproteinseq.fasta b/randpseqgen/randomproteinseq.fasta → randomizer/randomproteinseq.fasta
diff --git a/randpseqgen/randpseqgen.py → randomizer/v_randomizer.py b/randpseqgen/randpseqgen.py → randomizer/v_randomizer.py
diff --git a/uniqmin.sh b/uniqmin.sh
@@ -3,23 +3,26 @@
 
 dir=/backup/user/ext/perdana/lichuin/cdhitObj2/testingStitchScript/
 cd $dir
-python p1.py
+python3 U1_KmerGenerator.py
 wait 
-python p2.py
+python3 U2.1_Singletons.py
 wait
-python p2_2.py
+python3 U2.2_Multitons.py
 wait
-python p3.py
+python3 U3.1_PreQualifiedMinSet.py
 wait 
-python p3_2.py
+python3 U3.2_UnmatchedSingletons.py
 wait 
-python p4_1.py
+python3 U4.1_Non-SingletonsDedup.py
 wait 
-python p4_2.py
+python3 U4.2_Multi-OccurringPreMinSet.py
 wait 
-python p4_3.py
+python3 U4.3_UnmatchedMulti-Occurring.py 
 wait
-cp seqfileZ.txt fileZ.txt
+mkdir minimalSet
+cp seqfileZ.txt minimalSet/fileZ.txt
 mkdir match
 wait 
-python p5.py
+python3 U5.1_RemainingMinSet.py
+wait
+python3 U5.2_MinSet.py