-
Notifications
You must be signed in to change notification settings - Fork 0
HOWTO:_Getting_Genomic_Sequences
Naohisa Goto edited this page Jun 14, 2016
·
1 revision
Developped a quick script following this biostar question:
#!~/bin/ruby
require 'bio'
Bio::NCBI.default_email = "[email protected]" # required by NCBI
ncbi = Bio::NCBI::REST.new
hymenopteranSequenceIDs = ncbi.esearch("Hymenoptera[organism]",
{ "db"=>"protein", "rettype"=>"gb", "retmax"=> 10000000})
outputFile = 'genbankHymenopteranProts.fasta'
raise IOError, 'File exists!!' + outputFile if File.exists?(outputFile)
# get sequences, 100 at a time to show progression for many sequences. Writing to file in realtime in case the connection dies.
while (hymenopteranSequenceIDs.length > 1)
puts hymenopteranSequenceIDs.length.to_s + ' sequences left to download'
sequenceIdSlice = hymenopteranSequenceIDs.slice!(1..100)
sequences = ncbi.efetch(ids = sequenceIdSlice,
hash ={"db"=>"protein", "rettype"=>"fasta", "retmax"=> 10000000})
sequences.gsub!("\n\n", "\n") # ncbi returns a single big string with records separated by two newlines
File.open(outputFile, 'a') {|f| f.write(sequences + "\n") }
end