ust the Facts: A Basic Introduction
to the Science Underlying NCBI Resources
.
SYSTEMATICS
AND MOLECULAR PHYLOGENETICS
Classifying Organisms
Have you ever noticed that when you see an insect
or a bird, there is real satisfaction in giving it
a name, and an uncomfortable uncertainty when you
can't? Along these same lines, consider the bewildering
number and variety of organisms that live, or have
lived, on this earth. If we did not know what to call
these organisms, how could we communicate ideas about
them, let alone the history of life? Thanks to taxonomy,
the field of science that classifies life into groups,
we can discuss just about any organism, from bacteria
to man.
Carolus Linnaeus pioneered the grouping of organisms
based on scientific names using Latin. His system
of giving an organism a scientific name of two parts,
sometimes more, is called binomial nomenclature, or
"two-word naming". His scheme was based
on physical similarities and differences, referred
to as characters. Today, taxonomic classification
is much more complex and takes into account cellular
types and organization, biochemical similarities,
and genetic similarities. Taxonomy is but one aspect
of a much larger field called systematics.
Taxonomic Classification
Taxonomic ranks approximate evolutionary distances
among groups of organisms. For example, species belonging
to two different superkingdoms are most distantly
related (their common ancestor diverged in the distant
past), with progressively more exclusive groups indicated
by phylum, class and so on, down to infraspecific
ranks, or ranks occurring within a species. Infraspecific
ranks, such as subspecies, varietas, and forma, denote
the closest evolutionary relationship. See the simplified
classification of humans below.
Taxonomists, scientists who classify living organisms,
define a species as any group of closely related organisms
that can produce fertile offspring. Two organisms
are more closely "related" as they approach
the level of species, that is, they have more genes
in common. The level of species can be further divided
into smaller segments. A population is the smallest
unit of a species and is made up of organisms of the
same species. Sometimes, a population will physically
alter over time to suit the needs of its environment.
This is called a cline and can make members of the
same species look different.
Taxonomic Classification of Man
Homo sapiens
Superkingdom: Eukaryota
Kingdom: Metazoa
Phylum: Chordata
Class: Mammalia
Order: Primata
Family: Hominidae
Genus: Homo
Species: sapiens
What Is Phylogenetic Systematics?
Carolus Linnaeus was also credited with pioneering
systematics, the field of science dealing with the
diversity of life and the relationship between life's
components. Systematics reaches beyond taxonomy to
elucidate new methods and theories that can be used
to classify species based on similarity of traits
and possible mechanisms of evolution, a change in
the gene pool of a population over time.
Phylogenetic systematics is that field of biology
that does deal with identifying and understanding
the evolutionary relationships among the many different
kinds of life on earth, both living (extant) and dead
(extinct). Evolutionary theory states that similarity
among individuals or species is attributable to common
descent, or inheritance from a common ancestor. Thus,
the relationships established by phylogenetic systematics
often describe a species' evolutionary history and,
hence, its phylogeny, the historical relationships
among lineages or organisms or their parts, such as
their genes.
Charles Darwin was the first to recognize that the
systematic hierarchy represented a rough approximation
of evolutionary history. However, it was not until
the 1950s that the German entomologist Willi Hennig
proposed that systematics should reflect the known
evolutionary history of lineages as closely as possible,
an approach he called phylogenetic systematics. The
followers of Hennig were disparagingly referred to
as "cladists" by his opponents, because
of the emphasis on recognizing only monophyletic groups,
a group plus all of its descendents, or clades. However,
the cladists quickly adopted that term as a helpful
label, and nowadays, cladistic approaches to systematics
are used routinely.
Understanding the Evolutionary Process
Genetic Variation: Changes in a Gene Pool
Evolution is not always discrete with clearly defined
boundaries that pinpoint the origin of a new species,
nor is it a steady continuum. Evolution requires genetic
variation which results from changes within a gene
pool, the genetic make-up of a specific population.
A gene pool is the combination of all the alleles
—alternative forms of a genetic locus—for
all traits that population may exhibit. Changes in
a gene pool can result from mutation—variation
within a particular gene—or from changes in
gene frequency—the proportion of an allele in
a given population.
How Does Genetic Variation Occur?
Every organism possesses a genome that contains all
of the biological information needed to construct
and maintain a living example of that organism. The
biological information contained in a genome is encoded
in the nucleotide sequence of its DNA or RNA molecules
and is divided into discrete units called genes. The
information stored in a gene is read by proteins,
which attach to the genome and initiate a series of
reactions called gene expression.
Every time a cell divides, it must make a complete
copy of its genome, a process called DNA replication.
DNA replication must be extremely accurate to avoid
introducing mutations, or changes in the nucleotide
sequence of a short region of the genome. Inevitably,
some mutations do occur, usually in one of two ways;
either from errors in DNA replication or from damaging
effects of chemical agents or radiation that react
with DNA and change the structure of individual nucleotides.
Many of these mutations result in a change that has
no effect on the functioning of the genome, referred
to as silent mutations. Silent mutations include virtually
all changes that happen in the non-coding components
of genes and gene-related sequences.
Mutations in the coding regions of genes are much
more important. Here we must consider the importance
of the same mutation in a somatic cell compared with
a germ line cell. A somatic cell is any cell of an
organism other than a reproductive cell, such as a
sperm or egg cell. A germ cell line is any line of
cells that gives rise to gametes and is continuous
through the generations. Because a somatic cell does
not pass on copies of its genome to the next generation,
a somatic cell mutation is important only for the
organism in which it occurs and has no potential evolutionary
impact. In fact, most somatic mutations have no significant
effect because there are many other identical cells
in the same tissue.
On the other hand, mutations in germ cells can be
transmitted to the next generation and will then be
present in all of the cells of an individual who inherits
that mutation. Even still, mutations within germ line
cells may not change the phenotype of the organism
in any significant way. Those mutations that do have
an evolutionary effect can be divided into two categories,
loss-of-function mutations and gain-of-function mutations.
A loss-of-function mutation results in reduced or
abolished protein function. Gain-of-function mutations,
which are much less common, confer an abnormal activity
on a protein.
The randomness with which mutations can occur is an
important concept in biology and is a requirement
of the Darwinian view of evolution, which holds that
changes in the characteristics of an organism occur
by chance and are not influenced by the environment
in which the organism lives. Beneficial changes within
an organism are then positively selected for, whereas
harmful changes are negatively selected.
The Drivers of Evolution: Selection, Drift, and Founder
Effects
We just discussed that new alleles appear in a population
because of mutations that occur in the reproductive
cells of an organism. This means that many genes are
polymorphic, that is, two or more alleles for that
gene are present in a population. Each of these alleles
has its own allele or gene frequency, a measure of
how common an allele is in a population. Allele frequencies
vary over time because of two conditions, natural
selection and random drift.
Natural Selection
Natural selection is the process whereby one genotype,
the hereditary constitution of an individual, leaves
more offspring than another genotype because of superior
life attributes, termed fitness. Natural selection
acts on genetic variation by conferring a survival
advantage to those individuals harboring a particular
mutation that tends to favor a changing environmental
condition. These individuals then reproduce and pass
on this "new" gene, altering their gene
pool. Natural selection, therefore, decreases the
frequencies of alleles that reduce the fitness of
an organism and increases the frequency of alleles
that improve fitness.
"Natural Selection" is the principle by
which each slight variation, if useful, is preserved.
Charles Darwin
It is important to point out that natural selection
does not always represent progress, only adaptation
to a changing surrounding, that is, evolution attributable
to natural selection is devoid of intent— something
does not evolve to better itself, only to adapt. Because
environments are always changing, what was once an
advantageous mutation can often become a liability
further down the evolutionary line.
Random Drift
The term random drift actually encompasses a number
of distinct processes, sometimes referred to as outcomes.
They include indiscriminate parent sampling, the founder
effect, and fluctuations in the rate of evolutionary
processes such as selection, migration, and mutation.
Parent sampling is the process of determining which
organisms of one generation will be the parents of
the next generation. Parent sampling may be discriminate,
that is, with regard to fitness differences, or indiscriminate,
without regard to fitness differences. Discriminate
parent sampling is generally considered natural selection,
whereas indiscriminate parent sampling is considered
random drift.
What Is Sampling?
Suppose a population of red and brown squirrels share
a habitat with a color blind predator. Although the
predator is color blind, the brown squirrels seem
to die in greater numbers than the red squirrels,
suggesting that the brown squirrels just seem to be
unlucky enough to come into contact with the predator
more often. As a result, the frequency of brown squirrels
in the next generation is reduced. More red squirrels
survive to reproduce, or are sampled, but it is without
regard to any differences in fitness between the two
groups. The physical differences of the groups do
not play a causal role in the differences in reproductive
success.
Now, lets say that the predator is not color blind
and can now see the red squirrels better than the
brown squirrels, resulting in a better survival rate
for the brown squirrels. This would be a case of discriminate
parent sampling, or natural selection.
Founder Effect
Another important cause of genetic drift is the founder
effect, the difference between the gene pool of a
population as a whole and that of a newly isolated
population of the same species. The founder effect
occurs when populations are started from a small number
of pioneer individuals of one original population.
Because of small sample size, the new population could
have a much different genetic ratio than the original
population. An example of the founder effect would
be when a plant population results from a single seed.
Thus far, we have discussed natural selection and
random drift as events that occur in isolation from
one another. However, in most populations, the two
processes will be occurring at the same time. Furthermore,
there is great debate over whether, in particular
instances and in general, natural selection is more
prevalent that random drift.
Phylogenetic Trees: Presenting Evolutionary Relationships
Systematics describes the pattern of relationships
among taxa and is intended to help us understand the
history of all life. But history is not something
we can see—it has happened once and leaves only
clues as to the actual events. Scientists use these
clues to build hypotheses, or models, of life's history.
In phylogenetic studies, the most convenient way of
visually presenting evolutionary relationships among
a group of organisms is through illustrations called
phylogenetic trees.
A phylogenetic tree, as described to the left.
* Node: represents a taxonomic unit. This can be
either an existing species or an ancestor.
* Branch: defines the relationship between the taxa
in terms of descent and ancestry.
* Topology: the branching patterns of the tree.
* Branch length: represents the number of changes
that have occurred in the branch.
* Root: the common ancestor of all taxa.
* Distance scale: scale that represents the number
of differences between organisms or sequences.
* Clade: a group of two or more taxa or DNA sequences
that includes both their common ancestor and all of
their descendents.
* Operational Taxonomic Unit (OTU): taxonomic level
of sampling selected by the user to be used in a study,
such as individuals, populations, species, genera,
or bacterial strains.
A phylogenetic tree is composed of nodes, each representing
a taxonomic unit (species, populations, individuals),
and branches, which define the relationship between
the taxonomic units in terms of descent and ancestry.
Only one branch can connect any two adjacent nodes.
The branching pattern of the tree is called the topology,
and the branch length usually represents the number
of changes that have occurred in the branch. This
is called a scaled branch. Scaled trees are often
calibrated to represent the passage of time. Such
trees have a theoretical basis in the particular gene
or genes under analysis. Branches can also be unscaled,
which means that the branch length is not proportional
to the number of changes that has occurred, although
the actual number may be indicated numerically somewhere
on the branch. Phylogenetic trees may also be either
rooted or unrooted. In rooted trees, there is a particular
node, called the root, representing a common ancestor,
from which a unique path leads to any other node.
An unrooted tree only specifies the relationship among
species, without identifying a common ancestor, or
evolutionary path.
Figure 1. Possible Ways of Drawing a Tree
Figure 1. Possible ways of drawing a tree.
Phylogenetic trees, a convenient way of representing
evolutionary relationships among a group of organisms,
can be drawn in various ways. Branches on phylogenetic
trees may be scaled (top panel) representing the amount
of evolutionary change, time, or both, when there
is a molecular clock, or they may be unscaled (middle
panel) and have no direct correspondence with either
time or amount of evolutionary change. Phylogenetic
trees may be rooted (top and middle panels) or unrooted
(bottom panels). In the case of unrooted trees, branching
relationships between taxa are specified by the way
they are connected to each other, but the position
of the common ancestor is not. For example, on an
unrooted tree with five species, there are five branches
(four external, one internal) on which the tree can
be rooted. Rooting on each of the five branches has
different implications for evolutionary relationships.
Text and figures adapted with permission from A. Vierstraete,
University of Ghent, Belgium.
Methods of Phylogenetic Analysis
Two major groups of analyses exist to examine phylogenetic
relationships: phenetic methods and cladistic methods.
It is important to note that phenetics and cladistics
have had an uneasy relationship over the last 40 years
or so. Most of today's evolutionary biologists favor
cladistics, although a strictly cladistic approach
may result in counterintuitive results.
Phenetic Method of Analysis
Phenetics, also known as numerical taxonomy, involves
the use of various measures of overall similarity
for the ranking of species. There is no restriction
on the number or type of characters (data) that can
be used, although all data must be first converted
to a numerical value, without any character "weighting".
Each organism is then compared with every other for
all characters measured, and the number of similarities
(or differences) is calculated. The organisms are
then clustered in such a way that the most similar
are grouped close together and the more different
ones are linked more distantly. The taxonomic clusters,
called phenograms, that result from such an analysis
do not necessarily reflect genetic similarity or evolutionary
relatedness. The lack of evolutionary significance
in phenetics has meant that this system has had little
impact on animal classification, and as a consequence,
interest in and use of phenetics has been declining
in recent years.
Cladistic Method of Analysis
An alternative approach to diagramming relationships
between taxa is called cladistics. The basic assumption
behind cladistics is that members of a group share
a common evolutionary history. Thus, they are more
closely related to one another than they are to other
groups of organisms. Related groups of organisms are
recognized because they share a set of unique features
(apomorphies) that were not present in distant ancestors
but which are shared by most or all of the organisms
within the group. These shared derived characteristics
are called synapomorphies. Therefore, in contrast
to phenetics, cladistics groupings do not depend on
whether organisms share physical traits but depend
on their evolutionary relationships. Indeed, in cladistic
analyses two organisms may share numerous characteristics
but still be considered members of different groups.
Cladistic analysis entails a number of assumptions.
For example, species are assumed to arise primarily
by bifurcation, or separation, of the ancestral lineage;
species are often considered to become extinct upon
hybridization (crossbreeding); and hybridization is
assumed to be rare or absent. In addition, cladistic
groupings must possess the following characteristics:
all species in a grouping must share a common ancestor
and all species derived from a common ancestor must
be included in the taxon. The application of these
requirements results in the following terms being
used to describe the different ways in which groupings
can be made:
* A monophyletic grouping is one in which all species
share a common ancestor, and all species derived from
that common ancestor are included. This is the only
form of grouping accepted as valid by cladists.
* A paraphyletic grouping is one in which all species
share a common ancestor, but not all species derived
from that common ancestor are included.
* A polyphyletic grouping is one in which species
that do not share an immediate common ancestor are
lumped together, while excluding other members that
would link them.
The Origins of Molecular Phylogenetics
Macromolecular data, meaning gene (DNA) and protein
sequences, are accumulating at an increasing rate
because of recent advances in molecular biology. For
the evolutionary biologist, the rapid accumulation
of sequence data from whole genomes has been a major
advance, because the very nature of DNA allows it
to be used as a "document" of evolutionary
history. Comparisons of the DNA sequences of various
genes between different organisms can tell a scientist
a lot about the relationships of organisms that cannot
otherwise be inferred from morphology, or an organism's
outer form and inner structure. Because genomes evolve
by the gradual accumulation of mutations, the amount
of nucleotide sequence difference between a pair of
genomes from different organisms should indicate how
recently those two genomes shared a common ancestor.
Two genomes that diverged in the recent past should
have fewer differences than two genomes whose common
ancestor is more ancient. Therefore, by comparing
different genomes with each other, it should be possible
to derive evolutionary relationships between them,
the major objective of molecular phylogenetics.
Molecular phylogenetics attempts to determine the
rates and patterns of change occurring in DNA and
proteins and to reconstruct the evolutionary history
of genes and organisms. Two general approaches may
be taken to obtain this information. In the first
approach, scientists use DNA to study the evolution
of an organism. In the second approach, different
organisms are used to study the evolution of DNA.
Whatever the approach, the general goal is to infer
process from pattern: the processes of organismal
evolution deduced from patterns of DNA variation and
processes of molecular evolution inferred from the
patterns of variations in the DNA itself.
Molecular Phylogenetic Analysis: Fundamental Elements
Nucleotide and protein sequences can also be used
to generate trees. DNA, RNA, and protein sequences
can be considered as phenotypic traits. The sequences
depict the relationship of genes and usually of the
organism in which the genes are found.
As we just discussed, macromolecules, especially
gene and protein sequences, have surpassed morphological
and other organismal characters as the most popular
forms of data for phylogenetic analyses. Therefore,
this next section will concentrate only on molecular
data.
It is important to point out that a single, all-purpose
recipe does not exist for phylogenetic analysis of
molecular data. Although numerous algorithms, procedures,
and computer programs have been developed, their reliability
and practicality are, in all cases, dependent upon
the size and structure of the dataset under analysis.
The merits and shortfalls of these various methods
are subject to much scientific debate, because the
danger of generating incorrect results is greater
in computational molecular phylogenetics than in many
other fields of science. Occasionally, the limiting
factor in such analyses is not so much the computational
method used, but the users' understanding of what
the method is actually doing with the data. Therefore,
the goal of this section is to demonstrate to the
reader that practical analysis should be thought of
both as a search for a correct model (analysis) as
well as a search for the correct tree (outcome).
Phylogenetic tree-building models presume particular
evolutionary models. For any given set of data, these
models may be violated because of various occurrences,
such as the transfer of genetic material between organisms.
Therefore, when interpreting a given analysis, a person
should always consider the model used and entertain
possible explanations for the results obtained. For
example, models used in molecular phylogenetic analysis
methods make "default" assumptions, including:
* The sequence is correct and originates from the
specified source.
* The sequences are homologous—all descended
in some way from a shared ancestral sequence.
* Each position in a sequence alignment is homologous
with every other in that alignment.
* Each of the multiple sequences included in a common
analysis has a common phylogenetic history with the
other sequences.
* The sampling of taxa is adequate to resolve the
problem under study.
* Sequence variation among the samples is representative
of the broader group.
* The sequence variability in the sample contains
phylogenetic signal adequate to resolve the problem
under study.
The Four Steps of Phylogenetic Analysis
A straightforward phylogenetic analysis consists
of four steps:
1. Alignment—building the data model and extracting
a dataset.
2. Determining the substitution model—consider
sequence variation.
3. Tree building.
4. Tree evaluation.
Tree Building: Key Features of DNA-based Phylogenetic
Trees
Studies of gene and protein evolution often involve
the comparison of homologs, sequences that have common
origins but may or may not have common activity. Sequences
that share an arbitrary level of similarity determined
by alignment of matching bases are homologous. These
sequences are inherited from a common ancestor that
possessed similar structure, although the ancestor
may be difficult to determine because it has been
modified through descent.
Homologs are most commonly defined as orthologs,
paralogs, or xenologs.
Orthologs are homologs produced by speciation—they
represent genes derived from a common ancestor that
diverged because of divergence of the organism. Orthologs
tend to have similar function.
Paralogs are homologs produced by gene duplication
and represent genes derived from a common ancestral
gene that duplicated within an organism and then diverged.
Paralogs tend to have different functions.
Xenologs are homologs resulting from the horizontal
transfer of a gene between two organisms. The function
of xenologs can be variable, depending on how significant
the change in context was for the horizontally moving
gene. In general, though, the function tends to be
similar.
A typical gene-based phylogenetic tree is depicted
below. This tree shows the relationship between four
homologous genes: A, B, C, and D. The topology of
this tree consists of four external nodes (A, B, C,
and D), each one representing one of the four genes,
and two internal nodes (e and f) representing ancestral
genes. The branch lengths indicate the degree of evolutionary
differences between the genes. This particular tree
is unrooted—it is only an illustration of the
relationships between genes A, B, C, and D and does
not signify anything about the series of evolutionary
events that led to these genes.
Image depicting a typical gene-based phylogenetic
tree.
The second panel, below, depicts three rooted trees
that can be drawn from the unrooted tree shown above,
each representing the different evolutionary pathways
possible between these four genes. A rooted tree is
often referred to as an inferred tree. This is to
emphasize that this type of illustration depicts only
the series of evolutionary events that are inferred
from the data under study and may not be the same
as the true tree or the tree that depicts the actual
series of evolutionary events that occurred.
Three examples of rooted trees which can be drawn
from the unrooted tree described above.
To distinguish between the pathways, the phylogenetic
analysis must include at least one outgroup, a gene
that is less closely related to A, B, C, and D than
these genes are to each other (panel below). Outgroups
enable the root of the tree to be located and the
correct evolutionary pathway to be identified. Let's
say that the four homologous genes used in the previous
tree examples come from human, chimpanzee, gorilla,
and orangutan. In this case, an outgroup could be
a gene from another primate, such as baboon, which
is known to have branched away from the four species
above before the common ancestor of the species.
A diagram of a phylogenetic tree including an outgroup.
Gene Trees Versus Species Trees—Why Are They
Different?
It is assumed that a gene tree, because it is based
on molecular data, will be a more accurate and less
ambiguous representation of the species tree than
that obtainable by morphological comparisons. This
may indeed be the case, but it does not mean that
the gene tree is the same as the species tree. For
this to be true, the internal nodes in both trees
would have to be precisely equivalent, and they are
not. An internal node in a gene tree indicates the
divergence of an ancestral gene into two genes with
different DNA sequences, usually resulting from a
mutation of one sort or another. An internal node
in a species tree represents what is called a speciation
event, whereby the population of the ancestral species
splits into two groups that are no longer able to
interbreed. These two events, mutation and speciation,
do not always occur at the same time.
Molecular Phylogenetics Terminology
* Monophyletic: two or more DNA sequences that are
derived from a single common ancestral DNA sequence.
* Clade: a group of monophyletic DNA sequences that
make up all of the sequences included in the analysis
that are descended from a particular common ancestral
sequence.
* Parsimony: an approach that decides between different
tree topologies by identifying the one that involves
the shortest evolutionary pathway. This is the pathway
that requires the smallest number of nucleotide changes
to go from the ancestral sequence, at the root of
the tree, to all of the present-day sequences that
have been compared.
* Molecular Clock Hypothesis: states that nucleotide
substitutions, or amino acid substitutions if proteins
are being compared, occur at a constant rate, that
is, the degree of difference between two sequences
can be used to assign a date to the time at which
their ancestral sequence diverged. The rate of molecular
change differs among groups of organisms, among genes,
and even among different parts of the same gene. Furthermore,
molecular clocks require calibration with fossils
to determine timing of origin of clades, and thus
their accuracy is crucially dependent on the fossil
record, or lack thereof, for the groups under study.
Fossil DNA older than about 25,000–50,000 years
is virtually empty of phylogenetic signal except in
rare instances, and therefore traditional morphological
studies of extinct and extant organisms remain a crucial
component of phylogenetic analysis.
Systematics and NCBI
The Taxonomy Project
The purpose of NCBI's Taxonomy Project is to build
a consistent phylogenetic taxonomy for the NCBI sequence
databases. The Taxonomy Database contains the names
and lineages of every organism represented by at least
one nucleotide or protein sequence in the NCBI genetic
databases. As of February 2003, this total is over
250,000 taxa. For current information, visit NCBI's
Taxonomy Statistics Web page. The database is recognized
as the standard reference by the international sequence
database collaboration (GenBank, EMBL, DDJB, and Swiss-Prot).
The Taxonomy Browser is an NCBI-derived search tool
that allows an individual to search the Taxonomy database.
Using the browser, information may be retrieved on
available nucleotide, protein, and structure records
for a particular species or higher taxon. The Taxonomy
Browser can be used to view the taxonomic position
or retrieve sequence and structural data for a particular
organism or group of organisms. Searches may be made
on the basis of whole, partial, or phonetically spelled
organism names, and direct links to organisms commonly
used in biological research are also provided. The
Entrez Taxonomy system has the ability to display
custom taxonomic trees representing user-defined subsets
of the full NCBI taxonomy.
TaxPlot, another component of the Taxonomy project,
is a research tool for conducting three-way comparisons
of different genomes. Comparisons are based on the
sequences of the proteins encoded in that organism's
genome. To use TaxPlot, one selects a reference genome
to which two other genomes will be compared. The TaxPlot
tool then uses a pre-computed BLAST result to plot
a point for each protein predicted to be included
in the reference genome.
BLAST: Detecting New Sequence Similarities
Currently, the characters most widely used for phylogenetic
analysis are DNA and protein sequences. DNA sequences
may be compared directly, or for those regions that
code for a known protein, translated into protein
sequences. Creating phylogenies from nucleotide or
amino acid sequences first requires aligning the bases
so that the differences between the sequences being
studied are easier to spot.
The introduction of NCBI's BLAST, or The Basic Local
Alignment Search Tool, in 1990 made it easier to rapidly
scan huge databases for overt homologies, or sequence
similarity, and to statistically evaluate the resulting
matches. BLAST works by comparing a user's unknown
sequence against the database of all known sequences
to determine likely matches. In a matter of seconds,
the BLAST server compares the user's sequence with
up to a million known sequences and determines the
closest matches.
Specialized BLASTs are also available for human,
mouse, microbial, and many other genomes. A single
BLAST search can compare a sequence of interest to
all other sequences stored in GenBank, NCBI's nucleotide
sequence database. In this step, a researcher has
the option of limiting the search to a specific taxonomic
group. If the full scientific name or relationship
of species of interest is not known, the user can
search for such details using NCBI's Taxonomy Browser,
which provides direct links to some of the organisms
commonly used in molecular research projects, such
as the zebrafish, fruit fly, bakers yeast, nematode,
and many more.
BLAST next tallies the differences between sequences
and assigns a "score" based on sequence
similarity. The scores assigned in a BLAST search
have a well-defined statistical interpretation, making
real sequence matches easier to distinguish from random
background hits. This is because BLAST uses a special
algorithm, or mathematical formula, that seeks local
as opposed to global alignments and is therefore able
to detect relationships among sequences that share
only isolated regions of similarity. Taxonomy-related
BLAST results are presented in three formats based
on the information found in NCBI's Taxonomy database.
The Organism Report sorts BLAST comparisons, also
called hits, by species such that all hits to a given
organism are grouped together. The Lineage Report
provides a view of the relationships between the organisms
based on NCBI's Taxonomy database. The Taxonomy Report
provides in-depth details on the relationship between
all the organisms in the BLAST hit list.
COGs: Phylogenetic Classification of Proteins
The database of Clusters of Orthologous Groups of
proteins (COGs) represents an attempt at the phylogenetic
classification of proteins, a scheme that indicates
the evolutionary relationships between organisms,
from complete genomes. Each COG includes proteins
that are thought to be orthologous, or connected through
vertical evolutionary descent. COGs may be used to
detect similarities and differences between species,
for identifying protein families and predicting new
protein functions, and to point to potential drug
targets in disease-causing species. The database is
accompanied by the COGnitor program, which assigns
new proteins, typically from newly sequenced genomes,
to pre-existing COGs. A Web page containing additional
structural and functional information is now associated
with each COG. These hyperlinked information pages
include: systematic classification of the COG members
under the different classification systems; indications
of which COG members (if any) have been characterized
genetically and biochemically; information on the
domain architecture of the proteins constituting the
COG and the three-dimensional structure of the domains
if known or predictable; a succinct summary of the
common structural and functional features of the COG
members, as well as peculiarities of individual members;
and key references.
HomoloGene
HomoloGene is a database of both curated and calculated
orthologs and homologs for the organisms represented
in NCBI's UniGene database. Curated orthologs include
gene pairs from the Mouse Genome Database (MGD) at
the Jackson Laboratory, the Zebrafish Information
(ZFIN) database at the University of Oregon, and from
published reports. Computed orthologs and homologs
are identified from BLAST nucleotide sequence comparisons
between all UniGene clusters for each pair of organisms.
HomoloGene also contains a set of triplet clusters
in which orthologous clusters in two organisms are
both orthologous to the same cluster in a third organism.
HomoloGene can be searched via the Entrez retrieval
system.
UniGene is a system for automatically partitioning
GenBank sequences into a non-redundant set of gene-oriented
clusters. Each UniGene cluster contains sequences
that represent a unique gene, as well as related information,
such as the tissue types in which the gene has been
expressed and map location.
Entrez Genome
The whole genomes of over 1,200 organisms can be
found in Entrez Genomes. The genomes represent both
completely sequenced organisms and those for which
sequencing is in progress. All three main domains
of life—bacteria, archaea, and eukaryotes—
are represented, as well as many viruses, viroids,
plasmids, and eukaryotic organelles. Data can be accessed
hierarchically starting from either an alphabetical
listing or a phylogenetic tree for complete genomes
in each of six principle taxonomic groups. One can
follow the hierarchy to a variety of graphical overviews,
including that of the whole genome of a single organism,
a single chromosome, or even a single gene. At each
level, one can access multiple views of the data,
pre-computed summaries, and links to analyses appropriate
for that level. In addition, any gene product (protein)
that is a member of a COG is linked to the COGs database.
A summary of COG functional groups is also presented
in tabular and graphical formats at the genome level.
For complete microbial genomes, pre-computed BLAST
neighbors for protein sequences, including their taxonomic
distribution and links to 3D structures, are given
in TaxTables and PDBTables, respectively. Pairwise
sequence alignments are presented graphically and
linked to NCBI's Cn3D macromolecular viewer that
allows the interactive display of three-dimensional
structures and sequence alignments.
PDBeast: Taxonomy in MMDB
NCBI's Structure Group, in collaboration with NCBI
taxonomists, has undertaken taxonomy annotation for
the three-dimensional structure data stored in the
Molecular Modeling Database (MMDB). A semi-automated
approach has been implemented in which a human expert
checks, corrects, and validates automatic taxonomic
assignments in MMDB. The PDBeast software tool was
developed by NCBI for this purpose. It pulls text
descriptions of "Source Organisms" from
either the original entries or user-specified information
and looks for matches in the NCBI Taxonomy database
to record taxonomy assignments.
The Molecular Modeling Database (MMDB) is a compilation
of three-dimensional structures of biomolecules obtained
from the Protein Data Bank (PDB). The PDB, managed
and maintained by the Research Collaboratory for Structural
Bioinformatics, is a collection of all publicly available
three-dimensional structures of proteins, nucleic
acids, carbohydrates, and a variety of other complexes
experimentally determined by X-ray crystallography
and NMR. The difference between the two databases
is that MMDB records reorganize and validate the information
stored in the database in a way that enables cross-referencing
between the chemistry and the three-dimensional structure
of macromolecules. By integrating chemical, sequence,
and structure information, MMDB is designed to serve
as a resource for structure-based homology modeling
and protein structure prediction.
The Importance of Molecular Phylogenetics
The field of molecular phylogenetics has grown, both
in size and in importance, since its inception in
the early 1990s, attributable mostly to advances in
molecular biology and more rigorous methods for phylogenetic
tree building. The importance of phylogenetics has
also been greatly enhanced by the successful application
of tree reconstruction, as well as other phylogenetic
techniques, to more diverse and perplexing issues
in biology. Today, a survey of the scientific literature
will show that molecular biology, genetics, evolution,
development, behavior, epidemiology, ecology, systematics,
conservation biology, and forensics are but a few
examples of the many disparate fields conceptually
united by the methods and theories of molecular phylogenetics.
Phylogenies are used essentially the same way in all
of these fields, either by drawing inferences from
the structure of the tree or from the way the character
states map onto the tree. Biologists can then use
these clues to build hypotheses and models of important
events in history. Broadly speaking, the relationships
established by phylogenetic trees often describe a
species' evolutionary history and, hence, its phylogeny—the
historical relationships among lineages or organisms
or their parts, such as their genes. Phylogenies may
be thought of as a natural and meaningful way to order
data, with an enormous amount of evolutionary information
contained within their branches. Scientists working
in these different areas can then use these phylogenies
to study and elucidate the biological processes occurring
at many levels of life's hierarchy.
Back to top
Revised: April 1, 2004.
NCBI NLM NIH
Privacy Statement Disclaimer A