Test_shotgun is a Fortran program which simulates a shotgun sequencing project. It is a handy tool for estimating not only how much work must be done (or remains to be done) for a given project, but also how much it will cost. It simulates projects up to 1 Mbp directly in memory. For projects larger than that it simulates it in 1 Mbp region and then scales the results. It is known to compile and run on OpenVMS, Digital Unix, and IRIX. Probably it will work on other platforms, or at most, require only minimal code changes. The source code is included in: http://seqaxp.bio.caltech.edu/pub/software/shotgun.zip If you build a Win32 or Macintosh version please let me have a copy of the binary to serve from my site. Please report any bugs that you encounter.
Below you will find an example test_shotgun session and the output it produced. Note that if you run the exact same session repeatedly you will get slightly different results (since the program uses a pseudorandom number generator), which is similar to what would happen in reality (where the positions of clones that are recovered are approximately random.)
The human genome project is the biggest sequencing project around, and I always wondered what it was really going to cost. Using test_shotgun, it was a simple enough matter to run a series of cost estimates with varying sequencing strategies and process costs. The results seemed fairly interesting for policy reasons, if not scientific ones, so I wrote it up as a short technical comment and submitted it to Nature (who rejected it without review) and Science (who rejected it after a review, and eight months of sitting on it.) The Science review was, well, I'll let you all judge for yourselves, but even sending them a rebuttal to the review did not change their minds. So, for your enjoyment and edification, I'm "publishing" the Science version here. The .zip distribution mentioned above includes the raw datafiles used to construct the figure in this paper, as well as the routines used to generate those datafiles.
David Mathog
mathog@seqaxp.bio.caltech.edu
Manager, sequence analysis facility
Biology division 156-29
Caltech
1200 E. California Blvd.
Pasadena, CA 91125
Example test_shotgun session
(Responses to prompts are in bold.)
Test_Shotgun
This program tests various cloning strategies on a
project. Internally, it models this project with an area
of 1000000 bases of DNA. If your project is larger
than this the results will be scaled to match. If your
project is smaller then only part of this region will
be utilized.
The total number of fragments (after scaling) that
can be modeled is 50000,(which corresponds to
a maximum density of one clone start every 40 base pairs.
This program uses the term Mesh in addition to Contig. Usually
Contigs refer to overlapping stretches of sequence.
During the simulation the program can tell which
clones overlap even if there is not enough common
sequence to form a classical Contig. In order to avoid
confusion, each such region of overlapping clones
(not usually the same as overlapping sequences) is called
a Mesh.
You may vary at each phase:
the number of clones added
the size (unit cost) of these clones
the number of bases (cost) for each end sequence
After all phases have been completed the program will
estimate the costs and steps needed to finish the project.
The project is finished when every region has been
sequenced at least once in each direction and the minimum
end overlaps are satisfied.
Note that Mesh/Contig numbers only scale if the number
is more than one, and the scale value is greater than 1
That is, multiple Mesh/Contigs are found and the project is
bigger than the internal work area.
You may optionally specify costs:
per clone (isolation, purification, storage)
per sequenced end (sequencing, data entry)
In those locations where costs are entered, the prompt
will include (,cost) and you would respond
with (being sure to include the decimal point):
400, 12.35
for "400 bases at a cost of 12.35"
If no cost is provided it defaults initially to 0.0
and subsequently to the previous assigned value
Input the name of the output file
test_shotgun.out
Input one text line that describes this project (<=60 characters)
Mythical sequencing project
Input the amount of project DNA, in base pairs
3000000000
The next set of questions apply to finishing off a
project once all initial phases have been completed.
Input the cost of obtaining an oligonucleotide primer
15.00
Input the amount of sequence recoverable from a
primer walking sequencing step (,cost)
500,15.0
Input the size of clones that will be used to close
gaps between Mesh/Contigs (,cost). (NOTE. if clones
are to be selected by hybridization with a single
probe then this size should be 1/2 the actual size
of the clones as each clone will be randomly positioned
with respect to that probe so, on average, only 1/2 of
each clone will be new sequence.)
4000,5.00
Input the minimum overlap to merge a Mesh/Contig
AND for overlap in primer walking
50
DNA sequencing project phase: 1
Input the size of each clone (,cost per clone)
2000,0.10
Input the amount of sequence from each end(,cost)
500,15.0
Input the number of such clones to process
7000000
0 exits, any other number to add more fragments
0
Project: Mythical sequencing project
Project constants
Genome size (base pairs): 3000000000.
Scaling from internal subsequence: 3000.00
Required overlap for Mesh merge: 50
Bases of sequence from a custom primer: 450
Bases in clones used to close mesh gaps: 4000
Cost to obtain a clone outside a Mesh/contig: 5.00
Cost to make a custom primer: 15.00
Cost to sequence from a custom primer: 15.00
Phase : 1
Number of clones added: 7000000
Size of each clone added: 2000
Bases of sequence from each end primer: 500
Number of genome units of coverage: 4.67
Number of genome units of sequence: 2.33
Unit cost: clone isolation and preparation 0.10
Unit cost: end primer sequencing: 15.00
Total phase cost for clone preparation: 700000.00
Total phase cost for end sequencing: 210000000.00
Project status after phase: 1
Summary of coverage fractions:
included in clones: 0.9924380
sequenced: 0.9004280
both directions: 0.4903880
multiply, one direction: 0.1840810
once: 0.2259590
gaps: 0.0995720
IntraMesh: 0.0920100
InterMesh: 0.0075620
Mesh information
number (Meshes): 69000
max size (clones): 403
avg. size (clones): 101
max size (bp): 178783
avg. size (bp): 43363
Contig information
number (Contigs): 1740000
max size (sequences): 63
avg. size (sequences): 8
max size (bp): 9468
avg. size (bp): 1558
IntraMesh Gaps (inside clones)
fraction: 0.0920100
number (internal gaps): 1290000
max size (bp): 1000
avg. size: (bp) 213
InterMesh Gaps (between clones)
fraction: 0.0075620
number (external gaps): 66000
max size (bp): 889
avg. size (bp): 343
minimum new clones to close: 66000
Regions sequenced more than once, one direction only
fraction: 0.1840810
number (regions): 2847000
max size (bp): 1239
avg. size (bp): 193
Regions sequenced both directions
fraction: 0.4903880
number (regions): 3039000
max size (bp): 3142
avg. size (bp): 484
Regions sequenced once only
fraction: 0.2259590
number (regions): 4422000
max size (bp): 682
avg. size (bp): 153
Sequencing steps required to close
IntraMesh gaps (X): 1461000
InterMesh gaps (Y): 87000
IntraMesh, one strand -> both strands(Z): 4638000
close all, both strands (Z+2*(X+Y)): 7734000
Total costs
Shotgun clone isolation and preparation (A): 699900.00
Shotgun clone end sequencing (B): 209970000.00
Custom primer synthesis (C+D+E): 116010000.00
IntraMesh gaps (C): 43830000.00
InterMesh gaps (D): 2610000.00
IntraMesh 1->2 strands (E): 69570000.00
Custom primer sequencing (F+G+H): 116010000.00
IntraMesh gaps (F): 43830000.00
InterMesh gaps (G): 2610000.00
IntraMesh 1->2 strands (H): 69570000.00
InterMesh clone isolation (I): 330000.00
Grand Total cost (A+B+C+D+E+F+G+H+I): 443019900.00