Skip to content

SampleCodes

Naohisa Goto edited this page Jun 14, 2016 · 1 revision

This document is based on the original 'BioJava in Anger' by Mark Schreiber et al. "BioJava in Anger, A Tutorial and Recipe Book for Those in a Hurry".

TOC

Introduction

BioRuby can be both big and intimidating. For those of us who are in a hurry there really is a whole lot there to get your head around. This document is designed to help you develop BioRuby programs that do 99% of common tasks without needing to read and understand 99% of the BioRuby API.

The page was inspired by various programming cookbooks and follows a "How do I...?" type approach. Each How do I is linked to some example code that does what you want and sometimes more. Basically if you find the code you want and copy and paste it into your program you should be up and running quickly. I have endeavoured to over document the code to make it more obvious what I am doing so some of the code might look a bit bloated.

'BioRuby in Anger' is maintained by Toshiaki Katayama and Pjotr Prins. If you have any suggestions, questions or comments contact the BioRuby mailing list.

These demos are tested with BioRuby 0.5.3 and Ruby 1.6.8 (Partly 1.8.1).

Alphabets and Symbols

In BioRuby, Sequence class inherits String so you can treat Sequence object as a String with various powerful methods implemented in Ruby's String class. You can easily generate DNA and/or Amino Acid sequences to edit, extract subsequence, regexp pattern match on it with usual methods for String object. Sequence class also has methods for splicing, translation, calculate stastical values, window search etc.

There are nothing equivarent to BioJava's Alphabet and/or Symbols in BioRuby, however, BioRuby provides lists of nucleic acids, amino acids and codon tables and use it transparently in appropreate methods?as needed.

How can I make an ambiguous Symbol like Y or R?

The IBU defines standard codes for symbols that are ambiguous such as Y to indicate C or T and R to indicate G or C or N to indicate any nucleotide. BioRuby represents these symbols as the same Bio::Sequence::NA object which can be easily converted to Regular expression that matches components of the ambiguous symbols. In turn, Bio::Sequence::NA object can contain symbols matching one or more component symbols that are valid members of the same alphabet as the Bio::Sequence::NA and are therefore capable of being ambiguous.

Generally an ambiguity symbol is converted to a Regexp object by calling the to_re method from the Bio::Sequence::NA that contains the symbol itself. You don't need to make symbol 'Y' by yourself because it is already built in the Bio::NucleicAcid class as a hash named Bio::NucleicAcid::Names.

#!/usr/bin/env ruby

require 'bio'

# creating a Bio::Sequence::NA object containing ambiguous alphabets
ambiguous_seq = Bio::Sequence::NA.new("atgcyrwskmbdhvn")

# show the contents and class of the DNA sequence object
p ambiguous_seq              # => "atgcyrwskmbdhvn"
p ambiguous_seq.class        # => Bio::Sequence::NA

# convert the sequence to a Regexp object
p ambiguous_seq.to_re        # => /atgc[tc][ag][at][gc][tg][ac][tgc][atg][atc][agc][atgc]/
p ambiguous_seq.to_re.class  # => Regexp

# example to match an ambiguous sequence to the rigid sequence
att_or_atc = Bio::Sequence::NA.new("aty").to_re
puts "match" if att_or_atc.match(Bio::Sequence::NA.new("att"))
puts "match, too" if Bio::Sequence::NA.new("att").match(att_or_atc)
if Bio::Sequence::NA.new("atc") =~ att_or_atc
  puts "also match"
end

Basic Sequence Manipulation

How do I make a Sequence from a String or make a Sequence Object back into a String?

A lot of the time we see sequence represented as a String of characters eg "atgccgtggcatcgaggcatatagc". It's a convenient method for viewing and succinctly representing a more complex biological polymer. BioRuby makes use of a Ruby's String class to represent these biological polymers as Objects. Unlike BioJava's SymbolList, BioRuby's Bio::Sequence inherits String and provide extra methods for the sequence manipulation. We don't have a container class like a BioJava's Sequence class, to store things like the name of the sequence and any features it might have, you can think of to use other container classes such as a Bio::FastaFormat, Bio::GFF, Bio::Features etc. for now (We have a plan to prepare a general container class for this to be compatible with a Sequence class in other Open Bio* projects).

Bio::Sequence class has same capability as a Ruby's String class, it is simple easy to use. You can represent a DNA sequence within the Bio::Sequence::NA class and a protein sequence within the Bio::Sequence::AA class. You can translate DNA sequence into protein sequence with a single method call and can concatenate them with the same method '+' as a String class's.

String to Bio::Sequence object

Simply pass the sequence string to the constructor.

#!/usr/bin/env ruby

require 'bio'

# create a DNA sequence object from a String
dna = Bio::Sequence::NA.new("atcggtcggctta")

# create a RNA sequence object from a String
rna = Bio::Sequence::NA.new("auugccuacauaggc")

# create a Protein sequence from a String
aa = Bio::Sequence::AA.new("AGFAVENDSA")

# you can check if the sequence contains illegal characters
# that is not an accepted IUB character for that symbol
# (should prepare a Bio::Sequence::AA#illegal_symbols method also)
puts dna.illegal_bases

# translate and concatenate a DNA sequence to Protein sequence
newseq = aa + dna.translate
puts newseq      # => "AGFAVENDSAIGRL"

String to Sequence with comments

Yes, we should prepare a better container class for this. Temporally, you can do this as:

#!/usr/bin/env ruby

require 'bio'

# store a DNA sequence with the name dna_1 in a Bio::FastaFormat object
dna = Bio::Sequence::NA.new("atgctg").to_fasta("dna_1")

# store a RNA sequence with the name rna_1 in a Bio::FastaFormat object
rna = Bio::Sequence::NA.new("augcug").to_fasta("rna_1")

# store a Protein sequence with the name prot_1 in a Bio::FastaFormat object
prot = Bio::Sequence::AA.new("AFHS").to_fasta("prot_1")

# you can extract a name and a sequence stored in a Bio::FastaFormat object.
fasta_seq = Bio::FastaFormat.new(dna)
puts fasta_seq.entry_id  # => "dna_1"
puts fasta_seq.naseq     # => "atgctg"

Bio::Sequence to String

You don't need to call any method to convert a Bio::Sequence object to use as a String object because it can behave as a String, although you can call a to_s method to stringify explicitly.

#!/usr/bin/env ruby

# you can use Bio::Sequence object as a String object to print, seamlessly
dna = Bio::Sequence::NA.new("atgc")
puts dna        # => "atgc"
str = dna.to_s
puts str        # => "atgc"

How do I get a subsection of a Sequence?

Given a Sequence object we might only be interested in examining the first 10 bases or we might want to get a region between two points. You might also want to print a subsequence to a file or to STDOUT how could you do this?

BioRuby partly uses a biological coordinate system for identifying bases. You can use Bio::Sequence#subseq method to extract subsequence as the first base is numbered 1 and the last base index is equal to the length of the sequence. Other methods that are inherited from a String class use a normal String indexing which starts at 0 and proceedes to length -1. If you attempt to access a region outside of 1..length with a subseq method nil will be returned. Other methods in a String class will behave as a same.

Getting a Subsequence

# sample DNA sequence
seq = Bio::Sequence::NA.new("atgcatgc")

# get the first symbol
sym1 = seq.subseq(1,1)  # => "a"
sym2 = seq[0]           # => 97 (ascii code for "a", ruby's default behavior)

# get the first three bases
seq1 = seq.subseq(1,3)  # => "atg"
seq2 = seq[0,3]         # => "atg"
seq3 = seq[0..2]        # => "atg"

# get the last three bases
seq4 = seq.subseq(seq.length - 2, seq.length)  # => "tgc"
seq5 = seq[-3,3]                               # => "tgc", this is probably the most elegant solution
seq6 = seq[-3..-1]                             # => "tgc"

Printing a Subsequence

# print the last three bases of a SymbolList or Sequence
#  // Is the '-3' true for BioJava?
#  String s = symL.subStr(symL.length() - 3, symL.length());
puts seq.subseq(seq.length - 2, seq.length)

Complete Listing

#!/usr/bin/env ruby

require 'bio'

# generate an RNA sequence
seq = Bio::Sequence::NA.new("auggcaccguccagauu")

# get the first Symbol
sym = seq.subseq(1,1)    # => "a"

# get the first three bases
seq2 = seq.subseq(1,3)

# get the last three bases
seq3 = seq.subseq(seq.length - 2, seq.length)

# print the last three bases
s = seq.subseq(seq.length - 2, seq.length)
puts s

Iteration

You can iterate on every subsequences easily with BioRuby.

#!/usr/bin/env ruby

require 'bio'

seq = Bio::Sequence::NA.new("auggcaccguccagauu")

window_size = 3
step_size   = 3

# this will print all the codons as "aug", "gca", "ccg", "ucc", "aga"
seq.window_search(window_size, step_size) do |subseq|
  puts subseq
end

# this will calculate molecular weight for each codon
seq.window_search(window_size, step_size) do |subseq|
  puts subseq.molecular_weight
end

How do I transcribe a DNA Sequence to a RNA Sequence?

In BioRuby, DNA and RNA sequences are stored in the same Bio::Sequence::NA class just using different Alphabets, you can convert from DNA to RNA or RNA to DNA using the rna or dna methods, respectively.

#!/usr/bin/env ruby

require 'bio'

# make a DNA sequence
dna = Bio::Sequence::NA.new("atgccgaatcgtaa")

# transcribe it to RNA
rna = dna.rna

# just to prove it worked
puts dna       # => "atgccgaatcgtaa"
puts rna       # => "augccgaaucguaa"

# revert to the DNA again
puts rna.dna   # => "atgccgaatcgtaa"

How do I reverse complement a DNA or RNA Sequence?

To reverse complement a DNA sequence simply use the complement method.

#!/usr/bin/env ruby

require 'bio'

# make a DNA sequence
seq = Bio::Sequence::NA.new("atgcacgggaactaa")

# reverse complement it
rev = seq.complement

# prove that it worked
puts seq  # => "atgcacgggaactaa"
puts rev  # => "ttagttcccgtgcat"

Sequences are immutable so how can I change it's name?

Sequences are not immutable in BioRuby - just use the freeze method to make sequence unchangable.

How can I edit a Sequence or SymbolList?

Sometimes you will want to modify the order of Symbols in a sequence. For example you may wish to delete some bases, insert some bases or overwrite some bases in a DNA sequence. BioRuby's Bio::Sequence object can be edited by any methods inherited from Ruby's String class.

#!/usr/bin/env ruby

require 'bio'

# create a DNA sequence
seq = Bio::Sequence::NA.new("atggct")

# add "cc" to the end
seq += Bio::Sequence::NA.new("cc")

# should now be atggctcc
puts seq  # => "atggctcc"

# insert "tt" at the start
seq = Bio::Sequence::NA.new("tt") + seq

# should now be ttatggctcc
puts seq  # => "ttatggctcc"

# insert "aca" at position 4
seq[3,0] = Bio::Sequence::NA.new("aca")

# should now be ttaacatggctcc
puts seq  # => "ttaacatggctcc"

# overwrite at position 2, 3 bases with "ggg"
seq[1,3] = Bio::Sequence::NA.new("ggg")

# should now be tgggcatggctcc
puts seq  # => "tgggcatggctcc"

# delete from the start 5 bases (overwrite 5 bases with nothing)
seq[0,5] = ""

# should now be atggctcc
puts seq  # => "atggctcc"

# now a more complex example

# overwrite positions two and three with aa and then insert tt
seq[1,2] = Bio::Sequence::NA.new("aa") + Bio::Sequence::NA.new("tt")

# should now be aaattgctcc
puts seq  # => "aaattgctcc"

How can I make a sequence motif into a regular expression?

In BioRuby, to make a Sequence into a Ruby's regular expression pattern is done by calling to_re method for the Bio::Sequence::NA object. The generated pattern can even be from an ambiguous sequence such as "acgytnwacrs". For the protein sequences, you can do this by just wrapping a Bio::Sequence::AA object with // as usual for the String to make it Regexp (or use Regexp.new). You can then use this pattern to search Strings for the existence of that pattern.

#!/usr/bin/env ruby

# Translated from an example in "BioJava in Anger"
#
# * Lists all instances of a motif in specified (dna/rna/protein) fasta file.
# * The motif can contain Ambiguity symbols
# * Lists the ORF title and position of motif
# * Outputs a list of counts to stdout.

require 'bio'

STDERR.puts <<END if ARGV.size < 4
Usage: #{$0} type motif frame target_files

Example:
% #{$0} dna AAAAAAG 3 ecoli*.fasta > output.txt

would search for A AAA AAG in the third frame in the fasta format
dna sequence files matched to ecoli*.fasta and print the results
to file output.txt.

'type' can be dna, rna or protein.
'frame' can be integers 0 through 3.
0 counts any instance of the motif.
1,2,3 counts only instances of the motif in the specified frame.
END

def generate_regexp(type, motif)
  case type
  when 'dna'
    pat = Bio::Sequence::NA.new(motif).to_re
  when 'rna'
    pat = Bio::Sequence::NA.new(motif).rna.to_re
  when 'protein'
    pat = /#{Bio::Sequence::AA.new(motif)}/
  end
  return pat
end

def check_frame(placement)
  frame = placement.to_i
  if frame < 0 or frame > 3
    STDERR.puts "Only frames 0 through 3 are allowed"
    STDERR.puts "frame zero searches all frames."
    exit 0
  end
  return frame
end

def motif_search(type, pattern, frame, entry)
  count = 0
  start = 0

  case type
  when 'dna'
    seq = entry.naseq
  when 'rna'
    seq = entry.naseq.rna
  when 'protein'
    seq = entry.aaseq
  end

  while seq.index(pattern, start)
    start, stop = Regexp.last_match.offset(0)
    if frame == 0 or start % 3 + 1 == frame
      puts "#{entry.entry_id} : [#{start + 1},#{stop}]"
      count += 1
    end
    start += 1
  end
  return count
end

### main

type = ARGV.shift
motif = ARGV.shift
placement = ARGV.shift

frame = check_frame(placement)
pattern = generate_regexp(type, motif)

ARGV.each do |file|
  puts "Searching file #{file} for the motif '#{motif}' in frame #{frame}."
  count = 0
  Bio::FlatFile.open(Bio::FastaFormat, file) do |fasta_file|
    fasta_file.each do |entry|
    count += motif_search(type, pattern, frame, entry)
  end
end
puts "Total hits = #{count}"
end

Translation

How do I translate a DNA or RNA Sequence or SymbolList to Protein?

All you need is to call a translate method for a Bio::Sequence::NA object. In BioRuby, you don't need to convert DNA to RNA before its translation.

#!/usr/bin/env ruby

require 'bio'

# create a DNA sequence
seq = Bio::Sequence::NA.new("atggccattgaatga")

# translate to protein
prot = seq.translate

# prove that it worked
puts seq   # => "atggccattgaatga"
puts prot  # => "MAIE*"

How do I translate a single codon to a single amino acid?

The general translation example shows how to use the translate method of Bio::Sequence::NA object but most of what goes on is hidden behind the convenience method. If you only want to translate a single codon into a single amino acid you get exposed to a bit more of the gory detail but you also get a chance to figure out more of what is going on under the hood.

#!/usr/bin/env ruby

require 'bio'

# make a 'codon'
codon = Bio::Sequence::NA.new("uug")

# you can translate the codon as described in the previous section.
puts codon.translate  # => "L"

Here's the other way

#!/usr/bin/env ruby

require 'bio'

# make a 'codon'
codon = Bio::Sequence::NA.new("uug")

# select the standard codon table
codon_table = Bio::CodonTable[1]

# You need to convert RNA codon to DNA alphabets because the
# CodonTable in BioRuby is implemented as a static Hash with keys
# expressed in DNA alphabets (not RNA alphabets).
codon2 = codon.dna

# get the representation of that codon and translate to amino acid.
amino_acid = codon_table[codon2]
puts amino_acid        # => "L"

How do I use a non standard translation table?

The convenient translate method in Bio::Sequence::NA, used in the general translation example, is not limited to use the "Universal" translation table. The translate method also accepts a translation starting frame and a codon table number as its arguments.

The following translation tables are available:

  • 1 - Standard (Eukaryote)
  • 2 - Vertebrate Mitochondrial
  • 3 - Yeast mitochondorial
  • 4 - Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma
  • 5 - Invertebrate Mitochondrial
  • 6 - Ciliate Macronuclear and Dasycladacean
  • 9 - Echinoderm Mitochondrial
  • 10 - Euplotid Nuclear
  • 11 - Bacteria
  • 12 - Alternative Yeast Nuclear
  • 13 - Ascidian Mitochondrial
  • 14 - Flatworm Mitochondrial
  • 15 - Blepharisma Macronuclear
  • 16 - Chlorophycean Mitochondrial
  • 21 - Trematode Mitochondrial
  • 22 - Scenedesmus obliquus mitochondrial
  • 23 - Thraustochytrium Mitochondrial Code

The following program shows the use of the Euplotid Nuclear translation table (where UGA = Cys).

#!/usr/bin/env ruby

require 'bio'

# make a DNA sequence including the 'tga' codon
seq = Bio::Sequence::NA.new("atgggcccatgaaaaggcttggagtaa")

# translate from the frame 1 with codon table 10 (Euplotid Nuclear)
protein = seq.translate(1, 10)

# print out the protein
puts protein        # => "MGPCKGLE*"

# compared to the universal translation table
puts seq.translate  # => "MGP*KGLE*"

How do I translate a nucleotide sequence in all six frames?

#!/usr/bin/env ruby

require 'bio'

# Create an empty hash to place the results within
data = Hash.new

seq = Bio::Sequence::NA.new("atgggcccatgaaaaggcttggagtaa")

(1..6).each do |frame|
  # Store the results within the data hash
  data[frame] = seq.translate(frame)
end

# Print the data hash
puts data # => {1=>"MGP*KGLE*", 2=>"WAHEKAWS", 3=>"GPMKRLGV", 4=>"LLQAFSWAH", 5=>"YSKPFHGP", 6=>"TPSLFMGP"}

Sequence I/O

How do I write Sequences in Fasta format?

FASTA format is a fairly standard bioinformatics output that is convenient and easy to read. BioRuby's Sequence class has a to_fasta method for formatting sequence in FASTA format.

Printing any Bio::Sequence sequence object in FASTA format.

#!/usr/bin/env ruby

require 'bio'

# Generates a sample 100bp sequence.
seq1 = Bio::Sequence::NA.new("aatgacccgt" * 10)

# Naming this sequence as "testseq" and print in FASTA format
# (folded by 60 chars per line).
puts seq1.to_fasta("testseq", 60)

How do I read in a Fasta file?

This program opens FASTA format file for reading and iterates on each sequence in the file.

#!/usr/bin/env ruby

require 'bio'

file = Bio::FastaFormat.open(ARGV.shift)
file.each do |entry|
# do something on each fasta sequence entry
end

This program automatically detects and reads FASTA format files given as its arguments.

#!/usr/bin/env ruby

require 'bio'

Bio::FlatFile.auto(ARGF) do |ff|
ff.each do |entry|
  # do something on each fasta sequence entry
end
end

Similar but specify FASTA format explicitly.

#!/usr/bin/env ruby

require 'bio'

Bio::FlatFile.open(Bio::FastaFormat, ARGV[0]) do |ff|
  ff.each do |entry|
    # do something on each fasta sequence entry
  end
end

How do I read a GenBank, EMBL or SwissProt file?

The Bio::FlatFile class contains methods for reading GenBank, SwissProt and EMBL files. Because any of these files can contain more than one sequence entry Bio::FlatFile can be used with a block to iterate through the individual sequences. One of the attractive features of this model is that the Sequences are only parsed and created as needed so very large collections of sequences can be handled with moderate resources.

Information in the file is stored in the Sequence (for now, this is one of the Bio::GenBank, Bio::EMBL or Bio::SwissProt classes, please wait for our future implementation of the generic Sequence class for this) as Bio::Features or where there is location information as Bio::Locations. Reading GenBank, EMBL or SwissProt

Save the following script as the readflat.rb and execute it with files in GenBank, EMBL or SwissProt format as arguments like:

ruby ./readflat.rb hoge*.seq

The file format is auto detected.

#!/usr/bin/env ruby
require 'bio'

# read the sequence entry by entry through the files listed in ARGV.
sequences = Bio::FlatFile.auto(ARGF)

sequences.each do |seq|
  # do stuff with the sequence entry
  puts seq.entry_id
end

How do I extract GenBank, EMBL or Swissprot sequences and write them as Fasta?

To perform this task we are going to extend the general reader from the previous demo and include in it the ability to write sequence data in fasta format.

#!/usr/bin/env ruby
require 'bio'

# read the sequence entry by entry through the files listed in ARGV.
entries = Bio::FlatFile.auto(ARGF)

# iterates on each entry to print the fasta formated string.
entries.each do |entry|
  name = entry.entry_id
  seq  = entry.naseq     # use aaseq method in the case of protein database
  puts seq.to_fasta(name)
end

How do I turn an ABI sequence trace into a BioJava Sequence?

BioRuby doesn't support ABI trace at the moment...

Annotations

How do I list the Annotations in a Sequence?

To be done...

How do I filter a Sequences based on their species (or another Annotation property)?

To be done...

Locations and Features

How do I specify a PointLocation?

To be done...

How do I specify a RangeLocation?

To be done...

How do CircularLocations work?

To be done...

How can I make a Feature?

To be done...

How can I filter Features by type?

To be done...

How can I remove features?

To be done...

BLAST and FASTA

How do I set up a BLAST parser?

A frequent task in bioinformatics is the generation of BLAST search results.

Save the following script as a blastparse.rb and execute it with a file in BLAST XML output format (-m 7) like:

blastall -p blastp -d yourdb -i yourquery -m 7 > blastoutput.xml
ruby blastparse.rb blastoutput.xml

This script will print BLAST hits and the scores having E-value smaller than 0.001.

#!/usr/bin/env ruby
require 'bio'

Bio::Blast.reports(ARGF) do |report|
  puts "Hits for " + report.query_def + " against " + report.db
  report.each do |hit|
    print hit.target_id, " -> ", hit.evalue if hit.evalue < 0.001
  end
end

We recommend you to install Ruby's xmlparser and expat library to make parsing much faster.


You can also parse blast output with one of the following two approaches:

  • Example (to use File object):
#...
f = File.open(path)
Bio::Blast.reports(f) do |report|
  #...
end
  • Example (to use String containing file content):
#...
str = File.read(path)
Bio::Blast.reports(str) do |report|
  #...
end

How do I set up a FASTA parser? (or GenBank, Kegg etc.)

Search fasta file tags using a regular expression (regex) and print to stdout

Usage: fastasearch [-q query] filename(s)

Example:

ruby fastasearch -q '/([Hh]uman|[Hh]omo sapiens)/' nr.fa > hs.fa

Source:

if !query
  print "Give query: "
  query = $stdin.gets.chomp
end

ARGV.each do | fn |
  f = Bio::FlatFile.auto(fn)
  f.each_entry do | e |
    if e.definition =~ /#{query}/
      print '>',e.definition,e.data
    end
  end
end

But in fact the BioRuby parser is a lot more powerful than that and is not limitied to Fasta. You can run

re_grep_def.rb 'serine.* kinase' genbank/gb*.seq
re_grep_def.rb 'serine.* kinase' kegg/genes/*.ent
re_grep_def.rb 'serine.* kinase' kegg/sequences/*.pep

Using the following

require 'bio'

re = /#{ARGV.shift}/i

Bio::FlatFile.auto(ARGF) do |ff|
  ff.each do |entry|
    if re.match(entry.definition)
      puts ff.entry_raw
    end
  end
end

How do I extract information from parsed results?

To be done...

Counts and Distributions

How do I count the residues in a Sequence?

#!/usr/bin/env ruby

require 'rubygems'
require 'bio'

seq = Bio::Sequence::NA.new("atgcatgcaaaa")
seq.composition.each do |nuc, count|
  puts "#{nuc}:#{count}"
end

How do I calculate the frequency of a Symbol in a Sequence?

To be done...

How can I turn a Count into a Distribution?

Using a function.

#!/usr/bin/env ruby

require 'rubygems'
require 'bio'

def distr(seq)
  l=seq.length.to_f
  dist={}
  seq.composition.each do |nuc, count|
    dist[nuc]=count/l.to_f
  end
  return dist
end

seq = Bio::Sequence::NA.new("atgcatgcaaaa")

p seq
p seq.composition
p distr(seq)

Output

"atgcatgcaaaa"
{"a"=>6, "c"=>2, "g"=>2, "t"=>2}
{"a"=>0.5, "c"=>0.166666666666667, "g"=>0.166666666666667, "t"=>0.166666666666667}

Using a class method.

#!/usr/bin/env ruby

require 'rubygems'
require 'bio'

class Bio::Sequence::NA
  def distribution
    length=self.length.to_f
    dist={}
    self.composition.each do |nuc, count|
      dist[nuc]=count/length
    end
    dist
  end
end

seq = Bio::Sequence::NA.new("atgcatgcaaaa")

p seq
p seq.composition
p seq.distribution

Output

"atgcatgcaaaa"
{"a"=>6, "c"=>2, "g"=>2, "t"=>2}
{"a"=>0.5, "c"=>0.166666666666667, "g"=>0.166666666666667, "t"=>0.166666666666667}

How can I generate a random sequence from a Distribution?

To be done...

How can I find the amount of information or entropy in a Distribution?

To be done...

What is an easy way to tell if two Distributions have equal weights?

To be done...

How can I make an OrderNDistribution over a custom Alphabet?

To be done...

How can I write a Distribution as XML?

To be done...

Weight Matrices and Dynamic Programming

How do I use a WeightMatrix to find a motif?

To be done...

How do I make a HMMER like profile HMM?

To be done...

How do I set up a custom HMM?

To be done...

User Interfaces

How can I visualize Annotations and Features as a tree?

To be done...

How can I display a Sequence in a GUI?

To be done...

How do I display Sequence coordinates?

To be done...

How can I display features?

To be done...

OBDA

How do I set up BioSQL?

To be done...

Structure I/O and Manipulation

How do I read a structure file?

The current implementation only reads PDB format files, there will hopefully be code for parsing mmCIF and PDB XML files in the future. Adding these new formats will probably change the API for reading and writing structure files.

Given the above caveat, the current way to read a PDB file is to slurp the file into a string which is fed to the constructor of the Bio::PDB class:

require 'bio/db/pdb'
string = IO.read('pdb1tim.ent')
structure = Bio::PDB.new(string)

The new Bio::PDB object now contains all the annotations and coordinate data.

How do I write a structure file?

As with reading, the current implementation can only write in PDB format.

All the coordinate classes in the Bio::PDB heirarchy have .to_s methods which return the object in PDB format. This makes output as simple as:

require 'bio/db/pdb'
file = File.new('pdb1tim.ent').gets(nil)
structure = Bio::PDB.new(file)
puts structure.to_s # Writes whole structure
puts structure[nil]['A'].to_s # Writes chain A only

How do I access annotations?

The annotations from the PDB file are stored in Bio::PDB::Record objects. You can retrieve a specific annotation by calling the '.record' method of a Bio::PDB object with the name of the annotation as the argument:

structure.record("HEADER")

In fact '.record' returns an array of all Bio::PDB::Records of the given type, so you'll need to call '.first' to get to the actual Bio::PDB::Record. Bio::PDB::Record provides methods to get to the individual data in each record. E.g. The 'HEADER' record provides classification, depDate and idCode methods. You can look at the PDB format to see what fields are available in each record, or just look in the pdb.rb file which has a hash of all the definitions.

So, to get the title of a PDB file you would do it like this:

structure.record('TITLE').first.title

Some records have their own special methods:

structure.remark(num)  #Returns hash of the REMARK records based on 'num'
structure.jrnl         #Returns a hash of the JRNL records
structure.jrnl('TITL') #Returns an array of all the TITL subfields from
#the JRNL records
structure.helix(id)    #Returns the HELIX records based on 'id'
#Returns an array if no 'id' is given
structure.turn         #Same interface as '.helix'
structure.sheet        #Same interface as '.sheet'
structure.seqres(id)   #Returns the sequence given in the SEQRES record
#as a Bio::Sequence::AA object
structure.keywords     #returns a flattened lsit of keywords

And some methods are included for Bio::DB compatability:

structure.entry_id     #Returns idCode
structure.accession    #Same as entry_id
structure.definition   #Title
structure.version      #REVDAT

How do I access the coordinate data?

Coordinate data is stroed in a heirarchy of classes with Bio::PDB at the top. A Bio::PDB object contains one or more Bio::PDB::Model objects which contain one or more Bio::PDB::Chain objects which contain one or more Bio::PDB::Residue objects which contain Bio::PDB::Atom objects.

There are two ways to access the coordinate data in a Bio::PDB object: keyed access and iterators.

Keyed access provides the simplest way to get to data if you know which residues or atoms you're interested in. For instance if you wanted to get hold of chain 'A' you can do it like this:

chain = structure[nil]['A']

The nil refers to the model, which will be nil in crystal structures but may have a number in NMR structures. Every level in the heirarchy can be accessed this way. E.g. to get hold of the CA atom of residue 100 in chain 'A':

structure[nil]['A']['100']['CA']

or if you still have your chain object:

chain['100']['CA']

The residue identifier is not just the residue number. PDB residues can have insertion codes and even negative numbers.

chain['100A']['CA']

could be valid.

Iterators are also provided so you can easily go through all objects in the heirarchy. E.g:

structure.each{ |model|
  model.each{ |chain|
    chain.each{ |residue|
      residue.each{ |atom|
        puts atom
      }
    }
  }
}

Goes through every atom in the structure and outputs it. All the classes in the heirarchy implement the standard Enumerable and Comparable mixins as well.

Each class has a number of accessors to allow access to the data, most is in the Bio::PDB::Atom class:

  • Bio::PDB::Model has a 'model_serial' accessor only.
  • Bio::PDB::Chain has an 'id' accessor for getting and setting the chain id.
  • Bio::PDB::Residue has accessors for resName, resSeq and iCode.
  • Bio::PDB::Atom has accessors for serial, element, alt_loc, x, y, z, occ, and bfac

To find the x coordinate of the CA atom of residue 100 is then:

structure[nil]['A']['100']['CA'].x

How do I make a deep copy of a structure object?

You can't! There seem to be problems implementing a deep copy function in BioRuby. This is to be fixed in future versions.

Note to developers: have we tried the time-old hack of Marshal.load(Marshal.dump(obj)) to do the trick?

How do I measure distances and angles?

Methods for measuring distance and dihedral angles are provided in the Utils module. To measure distance, use the Bio::PDB::Utils.distance method:

res1 = structure[nil]['A']['100']
res2 = structure[nil]['A']['101']

distance = Bio::PDB::Utils.distance(res1['CA'].xyz,
                                    res2['CA'].xyz)

Bio::PDB::Utils.distance requires two Vectors as its arguments. Bio::PDB::Coordinate objects, which are produced by the .xyz call to the Bio::PDB::Atom object, are Vectors. You can also provide a Vector directly:

distance = Bio::PDB::Utils.distance(res1['CA'].xyz,
                                    [10,10,10])

And use the .centreOfGravity and .geometricCentre methods, both of which provide a Bio::PDB::Coordinate:

distance = Bio::PDB::Utils.distance(res1.centreOfGravity,
                                    res1.geometricCentre)

All objects in the Bio::PDB heirarchy implement the centreOfGravity and geometricCentre methods.

Dihedral angles are calculated from four Vectors / Bio::PDB::Coordinates:

phi = Bio::PDB::Utils.dihedral_angle(res1['C'].xyz,
                                     res2['N'].xyz,
                                     res2['CA'].xyz,
                                     res2['C'].xyz)
psi = Bio::PDB::Utils.dihedral_angle(res2['N'].xyz,
                                     res2['CA'].xyz,
                                     res2['C'].xyz,
                                     res3['N'].xyz)

How do I find specific atoms, residues, chains or models?

If the iterators and keyed access aren't powerful enough for you, the finder method is also provided. All the objects in the Bio::PDB hierarchy implement finder. It requires the class of object you wish to find as the first argument, and then a block which is evaluated for each object. If the block returns true then that object is added to an array of 'found' things.

For example, to find all the atoms within a box:

atoms_in_box = structure.finder(Bio::PDB::Atom){ |atom|
  atom.x.between?(10,20) &&
  atom.y.between?(20,30) &&
  atom.z.between?(30,40)
}

Or, to find all the HIS residues, in chain 'A' can be written like this:

his_residues = structure[nil]['A'].finder(Bio::PDB::Residue){ |res|
  res.resName == 'HIS'
}

or like this:

his_residues = structure.finder(Bio::PDB::Residue){ |res|
  res.resName == 'HIS' && res.chain.id == 'A'
}

Since the results are returned in arrays, it is very simple to do set operations, such as finding the overlap or difference between two selections:

sel1 = structure.finder(Bio::PDB::Atom){ |atom|
  atom.residue.resName == 'HIS'
}
sel2 =  structure.finder(Bio::PDB::Atom){ |atom|
  atom.x.between?(10,20) &&
  atom.y.between?(20,30) &&
  atom.z.between?(30,40)
}
sel3 = sel1 & sel2
sel1 = sel1 - sel3

Selection 3 now contains all the HIS atoms within the box and selection 1 contains all the HIS atoms outside the box.

How do I work with ligand data?

Because some PDB files reuse residue numbers that they already used in the ATOM records in the HETATM records, we have to add an extra flag when we want to access HETATM data. The extra flag is simply adding 'LIGAND' to the residue id. E.g Given the PDB file

ATOM      1  N   GLY A   2      -5.308  22.647  33.657  1.00 59.93
ATOM      2  CA  GLY A   2      -5.090  23.965  33.005  1.00 53.36
ATOM      3  C   GLY A   2      -6.209  24.197  32.021  1.00 52.56
ATOM      4  O   GLY A   2      -7.000  25.134  32.176  1.00 55.02
....
HETATM 2097  O8  PHT A   2     -18.122  40.603  18.146  1.00 16.14
HETATM 2098  O9  PHT A   2     -17.348  39.940  20.109  1.00 16.06
HETATM 2099  C10 PHT A   2     -15.622  41.753  16.662  1.00 13.34
HETATM 2100  O11 PHT A   2     -16.077  42.457  15.721  1.00 16.27

Asking for

structure[nil]['A']['2']

is ambiguous, so if you want the ligand data you must use

structure[nil]['A']['LIGAND2']

There should also be an .each_ligand method in the Bio::PDB::Chain class for iterating through the ligands, but this is not implemented in the current cvs version.

How do I work with solvent?

Solvent is defined in the current parser as any HETATM record with HOH as the residue type. This may not be ideal for some files, but deals with 95% of cases which is good enough for me at the moment!

Solvent residues are stored in a separate Bio::PDB::Chain attached to each Bio::PDB::Model. Currently the only methods available are 'addSolvent' which adds a residue to the solvent chain and 'removeSolvent', which sets the whole solvent chain to nil.

An each_solvent method for Bio::PDB::Model has been implemented in my copy but is not in cvs at the moment.

How do I get the sequence of a chain?

There are two types of sequence in a PDB file: the real sequence of whatever protein has been crystalised, and the sequence observable in the structure.

The first type of sequence is available through the SEQRES record (if this is provided in the PDB file). You can access this through the top-level Bio::PDB object:

structure.seqres['A']

provides the sequence given in the SEQRES record for chain A. The observed sequence is obtained from the Bio::PDB::Chain object. E.g.

structure[nil]['A'].atom_seq

Both methods return a Bio::Sequence::AA object.

Disclaimer

This code is generously donated by people who probably have better things to do. Where possible we test it but errors may have crept in. As such, all code and advice here in has no warranty or guarantee of any sort. You didn't pay for it and if you use it we are not responsible for anything that goes wrong. Be a good programmer and test it yourself before unleashing it on your corporate database.

Copyright

The documentation on this site is the property of the people who contributed it. If you wish to use it in a publication please make a request through the BioRuby mailing list.The original version was based on the 'BioJava in Anger' document by Mark Schreider et al.

The code is open-source. A good definition of open-source can be found here. If you agree with that definition then you can use it.

Clone this wiki locally