UPGMA (Unweighted Pair Group Method with Arithmetic mean) is a simple agglomerative or bottom-up data clustering method used in bioinformatics for the creation of phylogenetic trees. UPGMA assumes a constant rate of evolution (molecular clock hypothesis), and is not a well-regarded method for inferring phylogenetic trees unless this assumption has been tested and justified for the data set being used. UPGMA was initially designed for use in protein electrophoresis studies, but is currently most often used to produce guide trees for more sophisticated phylogenetic reconstruction algorithms.
The algorithm examines the structure present in a pairwise distance matrix to then construct a rooted tree (dendrogram).
At each step, the nearest 2 clusters are combined into a higher-level cluster. The distance between any 2 clusters A and B is taken to be the average of all distances between pairs of objects "a" in A and "b" in B.
Thursday, November 5, 2009
Systematics
Biological systematics is the study of the diversity of life on the planet Earth, both past and present, and the relationships among living things through time. Relationships are visualized as evolutionary trees (synonyms: cladograms, phylogenetic trees, phylogenies). Phylogenies have two components, branching order (showing group relationships) and branch length (showing amount of evolution). Phylogenetic trees of species and higher taxa are used to study the evolution of traits and the distribution of organisms . Systematics, in other words, is used to understand the evolutionary history of life on Earth.
A comparison of phylogenetic and phenetic concepts
The term "systematics" is sometimes used synonymously with "taxonomy" and may be confused with "scientific classification." However, taxonomy is properly the describing, identifying, classifying, and naming of organisms, while "classification" is focused on placing organisms within groups that show their relationships to other organisms. All of these biological disciplines can be involved with extinct and extant organisms. However, systematics alone deals specifically with relationships through time, requiring recognition of the fossil record when dealing with the systematics of organisms.
A comparison of phylogenetic and phenetic concepts
The term "systematics" is sometimes used synonymously with "taxonomy" and may be confused with "scientific classification." However, taxonomy is properly the describing, identifying, classifying, and naming of organisms, while "classification" is focused on placing organisms within groups that show their relationships to other organisms. All of these biological disciplines can be involved with extinct and extant organisms. However, systematics alone deals specifically with relationships through time, requiring recognition of the fossil record when dealing with the systematics of organisms.
Phylogenetic Comparative Methods
Phylogenetic comparative methods (PCMs) use information on the evolutionary relationships of organisms (phylogenetic trees) to analyze the origin and maintenance of biodiversity. Biodiversity is most commonly discussed in terms of the number of species, but it can also be phrased in terms of the amount of "morphospace" that a given set of species occupies (see also Cambrian explosion). PCMs focus more on the latter. Although most studies that employ PCMs focus on extant organisms, the methods can also be applied to extinct taxa and can incorporate information from the fossil record.
Owing to their computational requirements, they are usually implemented by computer programs . PCMs can be viewed as part of evolutionary biology, systematics, phylogenetics, bioinformatics or even statistics, as most methods involve statistical procedures and principles for estimation of various parameters and drawing inferences about evolutionary processes.
What distinguishes PCMs from most traditional approaches in systematics and phylogenetics is that they typically do not attempt to infer the phylogenetic relationships of the species under study. Rather, they use an independent estimate of the phylogenetic tree (topology plus branch lengths) that is derived from a separate phylogenetic analysis, such as comparative DNA sequences that have been analyzed by maximum parsimony or maximum likelihood methods. PCMs are consumers of phylogenetic trees, not primary producers of them. Accordingly, the list of phylogenetics software shows little overlap with the programs for PCMs .
Comparison of species to elucidate aspects of biology has a long history. Charles Darwin relied on such comparisons as a major source of evidence when writing The Origin of Species. Many other fields of biology use interspecific comparison as well, including behavioral ecology, ethology, ecophysiology, comparative physiology, evolutionary physiology, functional morphology, comparative biomechanics, and the study of sexual selection.
Owing to their computational requirements, they are usually implemented by computer programs . PCMs can be viewed as part of evolutionary biology, systematics, phylogenetics, bioinformatics or even statistics, as most methods involve statistical procedures and principles for estimation of various parameters and drawing inferences about evolutionary processes.
What distinguishes PCMs from most traditional approaches in systematics and phylogenetics is that they typically do not attempt to infer the phylogenetic relationships of the species under study. Rather, they use an independent estimate of the phylogenetic tree (topology plus branch lengths) that is derived from a separate phylogenetic analysis, such as comparative DNA sequences that have been analyzed by maximum parsimony or maximum likelihood methods. PCMs are consumers of phylogenetic trees, not primary producers of them. Accordingly, the list of phylogenetics software shows little overlap with the programs for PCMs .
Comparison of species to elucidate aspects of biology has a long history. Charles Darwin relied on such comparisons as a major source of evidence when writing The Origin of Species. Many other fields of biology use interspecific comparison as well, including behavioral ecology, ethology, ecophysiology, comparative physiology, evolutionary physiology, functional morphology, comparative biomechanics, and the study of sexual selection.
Phylogeography
Phylogeography is the study of the historical processes that may be responsible for the contemporary geographic distributions of individuals. This is accomplished by considering the geographic distribution of individuals in light of the patterns associated with a gene genealogy. This term was introduced to describe geographically structured genetic signals within and among species. An explicit focus on a species' biogeography/biogeographical past sets phylogeography apart from classical population genetics and phylogenetics. Past events that can be inferred include population expansion, population bottlenecks, vicariance and migration. Recently developed approaches integrating coalescent theory or the genealogical history of alleles and distributional information can more accurately address the relative roles of these different historical forces in shaping current patterns.
Bioinformatics
Bioinformatics and computational biology involve the use of techniques including applied mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and biochemistry to solve biological problems usually on the molecular level. Research in computational biology often overlaps with systems biology. Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, and the modeling of evolution.
Distance Matrix Methods
Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore they require an MSA (what is this?) as an input. Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches. Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same interior node and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them. They are frequently used as the basis for progressive and iterative types of multiple sequence alignment. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees
Horizontal Gene Transfer
Horizontal gene transfer (HGT), also Lateral gene transfer (LGT), is any process in which an organism transfers genetic material to another cell that is not its offspring. By contrast, vertical transfer occurs when an organism receives genetic material from its ancestor, e.g. its parent or a species from which it evolved. Most thinking in genetics has focused on the more prevalent vertical transfer, but there is a recent awareness that horizontal gene transfer is a significant phenomenon.
Comparative Genomics
Comparative genomics is the study of relationships between the genomes of different species or strains. Comparative genomics is an attempt to take advantage of the information provided by the signatures of selection to understand the function and evolutionary processes that act on genomes. While it is still a young field, it holds great promise to yield insights into many aspects of the evolution of modern species. The sheer amount of information contained in modern genomes (750 megabytes in the case of humans) necessitates that the methods of comparative genomics are automated. Gene finding is an important application of comparative genomics, as is discovery of new, non-coding functional elements of the genome.
Comparative genomics exploits both similarities and differences in the proteins, RNA, and regulatory regions of different organisms to infer how selection has acted upon these elements. Those elements that are responsible for similarities between different species should be conserved through time (stabilizing selection), while those elements responsible for differences among species should be divergent (positive selection). Finally, those elements that are unimportant to the evolutionary success of the organism will be unconserved (selection is neutral).
Identifying the mechanisms of eukaryotic genome evolution by comparative genomics is one of the important goals of the field. It is however often complicated by the multiplicity of events that have taken place throughout the history of individual lineages, leaving only distorted and superimposed traces in the genome of each living organism. For this reason comparative genomics studies of small model organisms (for example yeast) are of great importance to advance our understanding of general mechanisms of evolution.
Having come a long way from its initial use of finding functional proteins, comparative genomics is now concentrating on finding regulatory regions and siRNA molecules. Recently, it has been discovered that distantly related species often share long co
Comparative genomics exploits both similarities and differences in the proteins, RNA, and regulatory regions of different organisms to infer how selection has acted upon these elements. Those elements that are responsible for similarities between different species should be conserved through time (stabilizing selection), while those elements responsible for differences among species should be divergent (positive selection). Finally, those elements that are unimportant to the evolutionary success of the organism will be unconserved (selection is neutral).
Identifying the mechanisms of eukaryotic genome evolution by comparative genomics is one of the important goals of the field. It is however often complicated by the multiplicity of events that have taken place throughout the history of individual lineages, leaving only distorted and superimposed traces in the genome of each living organism. For this reason comparative genomics studies of small model organisms (for example yeast) are of great importance to advance our understanding of general mechanisms of evolution.
Having come a long way from its initial use of finding functional proteins, comparative genomics is now concentrating on finding regulatory regions and siRNA molecules. Recently, it has been discovered that distantly related species often share long co
Gene Transfer
Organisms can generally inherit genes in two ways: from parent to offspring (vertical gene transfer), or by horizontal or lateral gene transfer, in which genes jump between unrelated organisms, a common phenomenon in prokaryotes.
Lateral gene transfer has complicated the determination of phylogenies of organisms since inconsistencies have been reported depending on the gene chosen.
Carl Woese came up with the three-domain theory of life (eubacteria, archaea and eukaryotes) based on his discovery that the genes encoding ribosomal RNA are ancient and distributed over all lineages of life with little or no lateral gene transfer. Therefore rRNA are commonly recommended as molecular clocks for reconstructing phylogenies.
This has been particularly useful for the phylogeny of microorganisms, to which the species concept does not apply and which are too morphologically simple to be classified based on phenotypic traits.
Lateral gene transfer has complicated the determination of phylogenies of organisms since inconsistencies have been reported depending on the gene chosen.
Carl Woese came up with the three-domain theory of life (eubacteria, archaea and eukaryotes) based on his discovery that the genes encoding ribosomal RNA are ancient and distributed over all lineages of life with little or no lateral gene transfer. Therefore rRNA are commonly recommended as molecular clocks for reconstructing phylogenies.
This has been particularly useful for the phylogeny of microorganisms, to which the species concept does not apply and which are too morphologically simple to be classified based on phenotypic traits.
Ontogeny
Ontogeny (also ontogenesis or morphogenesis) describes the origin and the development of an organism from the fertilized egg to its mature form. Ontogeny is studied in developmental biology, developmental psychology, and developmental psychobiology.
In more general terms, ontogeny is defined as the history of structural change in a unity, which can be a cell, an organism, or a society of organisms, without the loss of the organization that allows that unity to exist.
In more general terms, ontogeny is defined as the history of structural change in a unity, which can be a cell, an organism, or a society of organisms, without the loss of the organization that allows that unity to exist.
History of Molecular Phylogeny
Molecular systematics was pioneered by Charles G. Sibley (birds), Herbert C. Dessauer (herpetology), and Morris Goodman (primates), followed by Allan C. Wilson, Robert K. Selander, and John C. Avise (who studied various groups). Work with protein electrophoresis began around 1956. Although the results were not quantitative and did not initially improve on morphological classification, they provided tantalizing hints that long-held notions of the classifications of birds, for example, needed substantial revision. In the period of 1974–1986, DNA-DNA hybridization was the dominant technique.
Cluster Analysis
Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. The computational task of classifying the data set into k clusters is often referred to as k-clustering.
Besides the term data clustering (or just clustering), there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis.
Besides the term data clustering (or just clustering), there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis.
DNA-DNA Hybridization
DNA-DNA hybridization generally refers to a molecular biology technique that measures the degree of genetic similarity between pools of DNA sequences. It is usually used to determine the genetic distance between two species. When several species are compared that way, the similarity values allow the species to be arranged in a phylogenetic tree; it is therefore one possible approach to carrying out molecular systematics.
Charles Sibley and Jon Ahlquist, pioneers of the technique, used DNA-DNA hybridization to examine the phylogenetic relationships of avians (the Sibley-Ahlquist taxonomy) and primates. Critics argue that the technique is inaccurate for comparison of closely related species, as any attempt to measure differences between orthologous sequences between organisms is overwhelmed by the hybridization of paralogous sequences within an organism's genome. DNA sequencing and computational comparisons of sequences is now generally the method for determining genetic distance, although the technique is still used in microbiology to help identify bacteria.
Charles Sibley and Jon Ahlquist, pioneers of the technique, used DNA-DNA hybridization to examine the phylogenetic relationships of avians (the Sibley-Ahlquist taxonomy) and primates. Critics argue that the technique is inaccurate for comparison of closely related species, as any attempt to measure differences between orthologous sequences between organisms is overwhelmed by the hybridization of paralogous sequences within an organism's genome. DNA sequencing and computational comparisons of sequences is now generally the method for determining genetic distance, although the technique is still used in microbiology to help identify bacteria.
Genotype
The genotype is the genetic constitution of an individual, that is the specific allele makeup of the individual, usually with reference to a specific character under consideration . For instance, the human albino gene has two allelic forms, dominant A and recessive a, and there are three possible genotypes- AA (homozygous dominant), Aa (heterozygous), and aa (homozygous recessive).
It is a generally accepted theory that inherited genotype, transmitted epigenetic factors, and non-hereditary environmental variation contribute to the phenotype of an individual.
Non-hereditary DNA mutations are not classically understood as representing the individuals' genotype. Hence, scientists and doctors sometimes talk for example about the (geno)type of a particular cancer, that is the genotype of the disease as distinct from the diseased
It is a generally accepted theory that inherited genotype, transmitted epigenetic factors, and non-hereditary environmental variation contribute to the phenotype of an individual.
Non-hereditary DNA mutations are not classically understood as representing the individuals' genotype. Hence, scientists and doctors sometimes talk for example about the (geno)type of a particular cancer, that is the genotype of the disease as distinct from the diseased
Nucleic Acid
A nucleic acid is a macromolecule composed of chains of monomeric nucleotide. In biochemistry these molecules carry genetic information or form structures within cells. The most common nucleic acids are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). Nucleic acids are universal in living things, as they are found in all cells and viruses. Nucleic acid was first discovered by Friedrich Miescher.
Artificial nucleic acids include peptide nucleic acid (PNA), Morpholino and locked nucleic acid (LNA), as well as glycol nucleic acid (GNA) and threose nucleic acid (TNA). Each of these is distinguished from naturally-occurring DNA or RNA by changes to the backbone of the molecule.
The term "nucleic acid" is the generic name for a family of biopolymers, named for their role in the cell nucleus. The monomers from which nucleic acids are constructed are called nucleotides.
Each nucleotide consists of three components: a nitrogenous heterocyclic base, which is either a purine or a pyrimidine; a pentose sugar; and a phosphate group. Nucleic acid types differ in the structure of the sugar in their nucleotides - DNA contains 2-deoxyriboses while RNA contains ribose (where the only difference is the presence of a hydroxyl group). Also, the nitrogenous bases found in the two nucleic acid types are different: adenine, cytosine, and guanine are found in both RNA and DNA, while thymine only occurs in DNA and uracil only occurs in RNA. Other rare nucleic acid bases can occur, for example inosine in strands of mature transfer RNA.
Nucleic acids are usually either single-stranded or double-stranded, though structures with three or more strands can form. A double-stranded nucleic acid consists of two single-stranded nucleic acids held together by hydrogen bonds, such as in the DNA double helix. In contrast, RNA is usually single-stranded, but any given strand may fold back upon itself to form secondary structure as in tRNA and rRNA. Within cells, DNA is usually double-stranded, though some viruses have single-stranded DNA as their genome. Retroviruses have single-stranded RNA as their genome.
Artificial nucleic acids include peptide nucleic acid (PNA), Morpholino and locked nucleic acid (LNA), as well as glycol nucleic acid (GNA) and threose nucleic acid (TNA). Each of these is distinguished from naturally-occurring DNA or RNA by changes to the backbone of the molecule.
The term "nucleic acid" is the generic name for a family of biopolymers, named for their role in the cell nucleus. The monomers from which nucleic acids are constructed are called nucleotides.
Each nucleotide consists of three components: a nitrogenous heterocyclic base, which is either a purine or a pyrimidine; a pentose sugar; and a phosphate group. Nucleic acid types differ in the structure of the sugar in their nucleotides - DNA contains 2-deoxyriboses while RNA contains ribose (where the only difference is the presence of a hydroxyl group). Also, the nitrogenous bases found in the two nucleic acid types are different: adenine, cytosine, and guanine are found in both RNA and DNA, while thymine only occurs in DNA and uracil only occurs in RNA. Other rare nucleic acid bases can occur, for example inosine in strands of mature transfer RNA.
Nucleic acids are usually either single-stranded or double-stranded, though structures with three or more strands can form. A double-stranded nucleic acid consists of two single-stranded nucleic acids held together by hydrogen bonds, such as in the DNA double helix. In contrast, RNA is usually single-stranded, but any given strand may fold back upon itself to form secondary structure as in tRNA and rRNA. Within cells, DNA is usually double-stranded, though some viruses have single-stranded DNA as their genome. Retroviruses have single-stranded RNA as their genome.
Genetic Material
Genetic material is used to store the genetic information of an organic life form. For all currently known living organisms, the genetic material is almost exclusively Deoxyribonucleic Acid (DNA). Some viruses use (Ribonucleic Acid) RNA as their genetic material.
The first genetic material is generally believed to have been RNA, initially manifested by self-replicating RNA molecules floating on bodies of water. This hypothetical period in the evolution of cellular life is known as the RNA world. This hypothesis is based on RNA's ability to act both as genetic material and as a catalyst, known as ribozyme or a ribosome. However, once proteins, which can form enzymes, came into existence, the more stable molecule DNA became the dominant genetic material, a situation continued today. Not only does DNA's double-stranded nature allow for correction of mutations but RNA is inherently unstable. Modern cells use RNA mainly for the building of proteins from DNA instructions, in the form of messenger RNA, ribosomal RNA, and transfer RNA.
Both RNA and DNA are macromolecules composed of nucleotides, of which there are four available in each molecule. Three nucleotides compose a codon, a sort of "genetic word", which is like an amino acid in a protein. The cod
The first genetic material is generally believed to have been RNA, initially manifested by self-replicating RNA molecules floating on bodies of water. This hypothetical period in the evolution of cellular life is known as the RNA world. This hypothesis is based on RNA's ability to act both as genetic material and as a catalyst, known as ribozyme or a ribosome. However, once proteins, which can form enzymes, came into existence, the more stable molecule DNA became the dominant genetic material, a situation continued today. Not only does DNA's double-stranded nature allow for correction of mutations but RNA is inherently unstable. Modern cells use RNA mainly for the building of proteins from DNA instructions, in the form of messenger RNA, ribosomal RNA, and transfer RNA.
Both RNA and DNA are macromolecules composed of nucleotides, of which there are four available in each molecule. Three nucleotides compose a codon, a sort of "genetic word", which is like an amino acid in a protein. The cod
Phylogenetic Nomenclature
Phylogenetic nomenclature is formulated in terms of evolution and common descent rather than the type specimens, categorical ranks, and morphological characters. The latter is used most commonly in cladistic analysis. Taxon names are strictly connected to phylogenetic tree topology and evolutionary history. In taxonomy, each name is attached to a clade taxonomic group containing a common ancestor and all its descendants. Phylogenetic nomenclature discards categorical ranks. The problem with ranks are evident when one considers biodiversity lineages and clades. Questions like "how many lineages are there?" or "how many clades are there?" become pointless, since there are no answers. These are relative concepts, illustrating the fractal nature of the tree of life and the need to let a phylogenetic hypothesis be the focus, rather than the categories, when biodiversity is quantified. Phylogenetic nomenclature helps to put focus on phylogenetic trees by offering an explicit link between names and parts of species history, that is, clades.
PHYLOGENETIC
In biology, phylogenetics (Greek: phyle = tribe, race and genetikos = relative to birth, from genesis = birth) is the study of evolutionary relatedness among various groups of organisms (e.g., species, populations). Also known as phylogenetic systematics or cladistics, phylogenetics treats each species as a group of lineage-connected individuals. Taxonomy, the classification of organisms according to similarity, has been richly informed by phylogenetics but remains methodologically and logically distinct.
Evolution is regarded as a branching process, whereby populations are altered over time and may speciate into separate branches, hybridize together, or terminate by extinction. This may be visualized as a multidimensional character-space that a population moves through over time. The problem posed by phylogenetics is that genetic data are only available for the present, and fossil records (osteometric data) are sporadic and less reliable. Our knowledge of how evolution operates is used to reconstruct the full tree.
Cladistics provides a simplified method of understanding phylogenetic trees. There are some terms that describe the nature of a grouping. For instance, all birds and reptiles are believed to have descended from a single common ancestor, so this taxonomic grouping (yellow in the diagram) is called monophyletic. "Modern reptile" (cyan in the diagram) is a grouping that contains a common ancestor, but does not contain all descendents of that ancestor (birds are excluded). This is an example of a paraphyletic group. A grouping such as warm-blooded animals would include only mammals and birds (red/orange in the diagram) and is called polyphyletic because the members of this grouping do not include the most recent common ancestor.
Evolution is regarded as a branching process, whereby populations are altered over time and may speciate into separate branches, hybridize together, or terminate by extinction. This may be visualized as a multidimensional character-space that a population moves through over time. The problem posed by phylogenetics is that genetic data are only available for the present, and fossil records (osteometric data) are sporadic and less reliable. Our knowledge of how evolution operates is used to reconstruct the full tree.
Cladistics provides a simplified method of understanding phylogenetic trees. There are some terms that describe the nature of a grouping. For instance, all birds and reptiles are believed to have descended from a single common ancestor, so this taxonomic grouping (yellow in the diagram) is called monophyletic. "Modern reptile" (cyan in the diagram) is a grouping that contains a common ancestor, but does not contain all descendents of that ancestor (birds are excluded). This is an example of a paraphyletic group. A grouping such as warm-blooded animals would include only mammals and birds (red/orange in the diagram) and is called polyphyletic because the members of this grouping do not include the most recent common ancestor.
Subscribe to:
Comments (Atom)
 
