test_shotgun

Test_shotgun is a Fortran program which simulates a shotgun sequencing project. It is a handy tool for estimating not only how much work must be done (or remains to be done) for a given project, but also how much it will cost. It simulates projects up to 1 Mbp directly in memory. For projects larger than that it simulates it in 1 Mbp region and then scales the results. It is known to compile and run on OpenVMS, Digital Unix, and IRIX. Probably it will work on other platforms, or at most, require only minimal code changes. The source code is included in: http://seqaxp.bio.caltech.edu/pub/software/shotgun.zip If you build a Win32 or Macintosh version please let me have a copy of the binary to serve from my site. Please report any bugs that you encounter.

Below you will find an example test_shotgun session and the output it produced. Note that if you run the exact same session repeatedly you will get slightly different results (since the program uses a pseudorandom number generator), which is similar to what would happen in reality (where the positions of clones that are recovered are approximately random.)

The human genome project is the biggest sequencing project around, and I always wondered what it was really going to cost. Using test_shotgun, it was a simple enough matter to run a series of cost estimates with varying sequencing strategies and process costs. The results seemed fairly interesting for policy reasons, if not scientific ones, so I wrote it up as a short technical comment and submitted it to Nature (who rejected it without review) and Science (who rejected it after a review, and eight months of sitting on it.) The Science review was, well, I'll let you all judge for yourselves, but even sending them a rebuttal to the review did not change their minds. So, for your enjoyment and edification, I'm "publishing" the Science version here. The .zip distribution mentioned above includes the raw datafiles used to construct the figure in this paper, as well as the routines used to generate those datafiles.

David Mathog
mathog@seqaxp.bio.caltech.edu
Manager, sequence analysis facility
Biology division 156-29
Caltech
1200 E. California Blvd.
Pasadena, CA 91125


Example test_shotgun session
(Responses to prompts are in bold.)

Test_Shotgun
  This program tests various cloning strategies on a
  project.  Internally, it models this project with an area
  of      1000000 bases of DNA.  If your project is larger
  than this the results will be scaled to match.  If your
  project is smaller then only part of this region will
  be utilized.
  
  The total number of fragments (after scaling) that
  can be modeled is       50000,(which corresponds to
  a maximum density of one clone start every 40 base pairs.
  
  This program uses the term Mesh in addition to Contig. Usually
  Contigs refer to overlapping stretches of sequence.
  During the simulation the program can tell which
  clones overlap even if there is not enough common
  sequence to form a classical Contig.  In order to avoid
  confusion, each such region of overlapping clones
  (not usually the same as overlapping sequences) is called
  a Mesh.
  
  You may vary at each phase:
      the number of clones added
      the size (unit cost) of these clones
      the number of bases (cost) for each end sequence
  
  After all phases have been completed the program will
  estimate the costs and steps needed to finish the project.
  The project is finished when every region has been 
  sequenced at least once in each direction and the minimum
  end overlaps are satisfied.
  
  Note that Mesh/Contig numbers only scale if the number
  is more than one, and the scale value is greater than 1
  That is, multiple Mesh/Contigs are found and the project is
  bigger than the internal work area.
  
  You may optionally specify costs:
    per clone         (isolation, purification, storage)
    per sequenced end (sequencing, data entry)
  
  In those locations where costs are entered, the prompt
    will include (,cost) and you would respond
    with (being sure to include the decimal point):  
       400, 12.35
    for "400 bases at a cost of 12.35"
    If no cost is provided it defaults initially to 0.0
    and subsequently to the previous assigned value
  
Input the name of the output file
test_shotgun.out

Input one text line that describes this project (<=60 characters)
Mythical sequencing project
 
Input the amount of project DNA, in base pairs
3000000000

The next set of questions apply to finishing off a 
project once all initial phases have been completed.
 
Input the cost of obtaining an oligonucleotide primer
15.00
 
Input the amount of sequence recoverable from a
  primer walking sequencing step (,cost)
500,15.0
 
Input the size of clones that will be used to close
  gaps between Mesh/Contigs (,cost).  (NOTE.  if clones
  are to be selected by hybridization with a single
  probe then this size should be 1/2 the actual size
  of the clones as each clone will be randomly positioned
  with respect to that probe so, on average, only 1/2 of
  each clone will be new sequence.)
4000,5.00
 
Input the minimum overlap to merge a Mesh/Contig
AND for overlap in primer walking
50
 
 
DNA sequencing project phase:            1
 
Input the size of each clone (,cost per clone)
2000,0.10
Input the amount of sequence from each end(,cost)
500,15.0
Input the number of such clones to process
7000000
0 exits, any other number to add more fragments
0


Example test_shotgun output, from the session above

 
 Project: Mythical sequencing project
 
 Project constants
 
   Genome size (base pairs):                        3000000000.
   Scaling from internal subsequence:                  3000.00
   Required overlap for Mesh merge:                         50
   Bases of sequence from a custom primer:                 450
   Bases in clones used to close mesh gaps:               4000
   Cost to obtain a clone outside a Mesh/contig:          5.00
   Cost to make a custom primer:                         15.00
   Cost to sequence from a custom primer:                15.00
 
 
 Phase :                                                     1
   Number of clones added:                             7000000
   Size of each clone added:                              2000
   Bases of sequence from each end primer:                 500
   Number of genome units of coverage:                    4.67
   Number of genome units of sequence:                    2.33
 
   Unit cost: clone isolation and preparation             0.10
   Unit cost: end primer sequencing:                     15.00
 
   Total phase cost for clone preparation:           700000.00
   Total phase cost for end sequencing:           210000000.00
 
 Project status after phase:                                 1
 
 Summary of coverage fractions:
   included in clones:                               0.9924380
   sequenced:                                        0.9004280
     both directions:                                0.4903880
     multiply, one direction:                        0.1840810
     once:                                           0.2259590
   gaps:                                             0.0995720
     IntraMesh:                                      0.0920100
     InterMesh:                                      0.0075620
 
 Mesh information
   number (Meshes):                                      69000
   max size (clones):                                      403
   avg. size (clones):                                     101
   max size (bp):                                       178783
   avg. size (bp):                                       43363
 
 Contig information
   number (Contigs):                                   1740000
   max size (sequences):                                    63
   avg. size (sequences):                                    8
   max size (bp):                                         9468
   avg. size (bp):                                        1558
 
 IntraMesh Gaps (inside clones)
   fraction:                                         0.0920100
   number (internal gaps):                             1290000
   max size (bp):                                         1000
   avg. size: (bp)                                         213
 
 InterMesh Gaps (between clones)
   fraction:                                         0.0075620
   number (external gaps):                               66000
   max size (bp):                                          889
   avg. size (bp):                                         343
   minimum new clones to close:                          66000
 
 Regions sequenced more than once, one direction only
   fraction:                                         0.1840810
   number (regions):                                   2847000
   max size (bp):                                         1239
   avg. size (bp):                                         193
 
 Regions sequenced both directions
   fraction:                                         0.4903880
   number (regions):                                   3039000
   max size (bp):                                         3142
   avg. size (bp):                                         484
 
 Regions sequenced once only
   fraction:                                         0.2259590
   number (regions):                                   4422000
   max size (bp):                                          682
   avg. size (bp):                                         153
 
 Sequencing steps required to close
   IntraMesh gaps (X):                                 1461000
   InterMesh gaps (Y):                                   87000
   IntraMesh, one strand -> both strands(Z):           4638000
   close all, both strands (Z+2*(X+Y)):                7734000
 
 Total costs
   Shotgun clone isolation and preparation (A):      699900.00
   Shotgun clone end sequencing (B):              209970000.00
   Custom primer synthesis (C+D+E):               116010000.00
      IntraMesh gaps (C):                          43830000.00
      InterMesh gaps (D):                           2610000.00
      IntraMesh 1->2 strands (E):                  69570000.00
   Custom primer sequencing (F+G+H):              116010000.00
      IntraMesh gaps (F):                          43830000.00
      InterMesh gaps (G):                           2610000.00
      IntraMesh 1->2 strands (H):                  69570000.00
   InterMesh clone isolation (I):                    330000.00
 
   Grand Total cost (A+B+C+D+E+F+G+H+I):          443019900.00