Browsing by Subject "biochemical research methods"
Now showing 1 - 20 of 30
- Results Per Page
- Sort Options
Item 3-Dimensional Structure Of A Hemichrome Hemoglobin From Caudina arenicola(1995-09) Mitchell, David T.; Ernst, Stephen R.; Wu, Wei-Xin; Hackert, Marvin L.; Mitchell, David T.; Ernst, Stephen R.; Wu, Wei-Xin; Hackert, Marvin L.The structure of a monomeric hemichrome form of an invertebrate hemoglobin, Hb-C chain, from Caudina arenicola has been refined to an R value of 0.16 using the data from 5.0 to 2.5 Angstrom resolution (R = 0.21 from 10.0 to 2.5 Angstrom resolution). Hb-C crystallizes in space group P2(1) with cell constants a=45.74, b=45.23 and c=40.92 Angstrom and beta=104.4 degrees with two monomers packed in the unit cell (V-m = 2.34 Angstrom(3) Da(-1)). The phases were determined by the multiple isomorphous replacement method with Hg2+ the major derivative. The structure consists of 157 amino acids with N- and C-terminal regions and eight a-helices forming a heme pocket. The unique feature of this structure is the hemichrome form with the proximal and distal histidines coordinated to the heme Fe atom, which is nearly in the plane of the porphyrin ring. A total of 111 solvent molecules were added to the structure using difference density peaks of at least 3 sigma over background. Interestingly, all the heme groups present in the crystal are nearly coplanar.Item Accuracy of RNA-Seq and its Dependence on Sequencing Depth(2012-08) Cai, Guoshuai; Li, Hua; Lu, Yue; Huang, Xuelin; Lee, Juhee; Muller, Peter; Ji, Yuan; Liang, Shoudan; Cai, Guoshuai; Liang, ShoudanThe cost of DNA sequencing has undergone a dramatical reduction in the past decade. As a result, sequencing technologies have been increasingly applied to genomic research. RNA-Seq is becoming a common technique for surveying gene expression based on DNA sequencing. As it is not clear how increased sequencing capacity has affected measurement accuracy of mRNA, we sought to investigate that relationship. Result: We empirically evaluate the accuracy of repeated gene expression measurements using RNA-Seq. We identify library preparation steps prior to DNA sequencing as the main source of error in this process. Studying three datasets, we show that the accuracy indeed improves with the sequencing depth. However, the rate of improvement as a function of sequence reads is generally slower than predicted by the binomial distribution. We therefore used the beta-binomial distribution to model the overdispersion. The overdispersion parameters we introduced depend explicitly on the number of reads so that the resulting statistical uncertainty is consistent with the empirical data that measurement accuracy increases with the sequencing depth. The overdispersion parameters were determined by maximizing the likelihood. We shown that our modified beta-binomial model had lower false discovery rate than the binomial or the pure beta-binomial models. Conclusion: We proposed a novel form of overdispersion guaranteeing that the accuracy improves with sequencing depth. We demonstrated that the new form provides a better fit to the data.Item ADaM: Augmenting Existing Approximate Fast Matching Algorithms with Efficient and Exact Range Queries(2014-05) Clement, Nathan L.; Thompson, Lee P.; Miranker, Daniel P.; Clement, Nathan L.; Thompson, Lee P.; Miranker, Daniel P.Drug discovery, disease detection, and personalized medicine are fast-growing areas of genomic research. With the advancement of next-generation sequencing techniques, researchers can obtain an abundance of data for many different biological assays in a short period of time. When this data is error-free, the result is a high-quality base-pair resolution picture of the genome. However, when the data is lossy the heuristic algorithms currently used when aligning next-generation sequences causes the corresponding accuracy to drop. Results: This paper describes a program, ADaM (APF DNA Mapper) which significantly increases final alignment accuracy. ADaM works by first using an existing program to align "easy" sequences, and then using an algorithm with accuracy guarantees (the APF) to align the remaining sequences. The final result is a technique that increases the mapping accuracy from only 60% to over 90% for harder-to-align sequences.Item Antibody-Independent Isolation of Circulating Tumor Cells by Continuous-Flow Dielectrophoresis(2013-01) Shim, Sangjo; Stemke-Hale, Katherine; Tsimberidou, Apostolia M.; Noshari, Jamileh; Anderson, Thomas E.; Gascoyne, Peter R. C.; Shim, Sangjo; Noshari, Jamileh; Anderson, Thomas E.; Gascoyne, Peter R. C.Circulating tumor cells (CTCs) are prognostic markers for the recurrence of cancer and may carry molecular information relevant to cancer diagnosis. Dielectrophoresis (DEP) has been proposed as a molecular marker-independent approach for isolating CTCs from blood and has been shown to be broadly applicable to different types of cancers. However, existing batch-mode microfluidic DEP methods have been unable to process 10 ml clinical blood specimens rapidly enough. To achieve the required processing rates of 106 nucleated cells/min, we describe a continuous flow microfluidic processing chamber into which the peripheral blood mononuclear cell fraction of a clinical specimen is slowly injected, deionized by diffusion, and then subjected to a balance of DEP, sedimentation and hydrodynamic lift forces. These forces cause tumor cells to be transported close to the floor of the chamber, while blood cells are carried about three cell diameters above them. The tumor cells are isolated by skimming them from the bottom of the chamber while the blood cells flow to waste. The principles, design, and modeling of the continuous-flow system are presented. To illustrate operation of the technology, we demonstrate the isolation of circulating colon tumor cells from clinical specimens and verify the tumor origin of these cells by molecular analysis. (C) 2013 American Institute of Physics. [http://dx.doi.org/10.1063/1.4774304]Item The APEX Quantitative Proteomics Tool: Generating Protein Quantitation Estimates from LC-MS/MS Proteomics Results(2008-12) Braisted, John C.; Kuntumalla, Srilatha; Vogel, Christine; Marcotte, Edward M.; Rodrigues, Alan R.; Wang, Rong; Huang, Shih0Ting; Ferlanti, Erik S.; Saeed, Alexander I.; Fleischmann, Robert D.; Peterson, Scott N.; Pieper, Rembert; Vogel, Christine; Marcotte, Edward M.Mass spectrometry (MS) based label-free protein quantitation has mainly focused on analysis of ion peak heights and peptide spectral counts. Most analyses of tandem mass spectrometry (MS/MS) data begin with an enzymatic digestion of a complex protein mixture to generate smaller peptides that can be separated and identified by an MS/MS instrument. Peptide spectral counting techniques attempt to quantify protein abundance by counting the number of detected tryptic peptides and their corresponding MS spectra. However, spectral counting is confounded by the fact that peptide physicochemical properties severely affect MS detection resulting in each peptide having a different detection probability. Lu et al. (2007) described a modified spectral counting technique, Absolute Protein Expression (APEX), which improves on basic spectral counting methods by including a correction factor for each protein (called O(i) value) that accounts for variable peptide detection by MS techniques. The technique uses machine learning classification to derive peptide detection probabilities that are used to predict the number of tryptic peptides expected to be detected for one molecule of a particular protein (O(i)). This predicted spectral count is compared to the protein's observed MS total spectral count during APEX computation of protein abundances. Results: The APEX Quantitative Proteomics Tool, introduced here, is a free open source Java application that supports the APEX protein quantitation technique. The APEX tool uses data from standard tandem mass spectrometry proteomics experiments and provides computational support for APEX protein abundance quantitation through a set of graphical user interfaces that partition thparameter controls for the various processing tasks. The tool also provides a Z-score analysis for identification of significant differential protein expression, a utility to assess APEX classifier performance via cross validation, and a utility to merge multiple APEX results into a standardized format in preparation for further statistical analysis. Conclusion: The APEX Quantitative Proteomics Tool provides a simple means to quickly derive hundreds to thousands of protein abundance values from standard liquid chromatography-tandem mass spectrometry proteomics datasets. The APEX tool provides a straightforward intuitive interface design overlaying a highly customizable computational workflow to produce protein abundance values from LC-MS/MS datasets.Item Base Calling for High-Throughput Short-Read Sequencing: Dynamic Programming Solutions(2013-04) Das, Shreepriya; Vikalo, Haris; Das, Shreepriya; Vikalo, HarisNext-generation DNA sequencing platforms are capable of generating millions of reads in a matter of days at rapidly reducing costs. Despite its proliferation and technological improvements, the performance of next-generation sequencing remains adversely affected by the imperfections in the underlying biochemical and signal acquisition procedures. To this end, various techniques, including statistical methods, are used to improve read lengths and accuracy of these systems. Development of high performing base calling algorithms that are computationally efficient and scalable is an ongoing challenge. Results: We develop model-based statistical methods for fast and accurate base calling in Illumina's next-generation sequencing platforms. In particular, we propose a computationally tractable parametric model which enables dynamic programming formulation of the base calling problem. Forward-backward and soft-output Viterbi algorithms are developed, and their performance and complexity are investigated and compared with the existing state-of-the-art base calling methods for this platform. A C code implementation of our algorithm named Softy can be downloaded from https://sourceforge.net/projects/dynamicprog. Conclusion: We demonstrate high accuracy and speed of the proposed methods on reads obtained using Illumina's Genome Analyzer II and HiSeq2000. In addition to performing reliable and fast base calling, the developed algorithms enable incorporation of prior knowledge which can be utilized for parameter estimation and is potentially beneficial in various downstream applications.Item Binding Of Flexible And Constrained Ligands To The Grb2 Sh2 Domain: Structural Effects Of Ligand Preorganization(2010-10) Clements, John H.; DeLorbe, John E.; Benfield, Aaron P.; Martin, Stephen F.; Clements, John H.; DeLorbe, John E.; Benfield, Aaron P.; Martin, Stephen F.Structures of the Grb2 SH2 domain complexed with a series of pseudopeptides containing flexible (benzyl succinate) and constrained (aryl cyclopropanedicarboxylate) replacements of the phosphotyrosine (pY) residue in tripeptides derived from Ac-pYXN-NH(2) (where X = V, I, E and Q) were elucidated by X-ray crystallography. Complexes of flexible/constrained pairs having the same pY + 1 amino acid were analyzed in order to ascertain what structural differences might be attributed to constraining the phosphotyrosine replacement. In this context, a given structural dissimilarity between complexes was considered to be significant if it was greater than the corresponding difference in complexes coexisting within the same asymmetric unit. The backbone atoms of the domain generally adopt a similar conformation and orientation relative to the ligands in the complexes of each flexible/constrained pair, although there are some significant differences in the relative orientations of several loop regions, most notably in the BC loop that forms part of the binding pocket for the phosphate group in the tyrosine replacements. These variations are greater in the set of complexes of constrained ligands than in the set of complexes of flexible ligands. The constrained ligands make more direct polar contacts to the domain than their flexible counterparts, whereas the more flexible ligand of each pair makes more single-water-mediated contacts to the domain; there was no correlation between the total number of protein-ligand contacts and whether the phosphotyrosine replacement of the ligand was preorganized. The observed differences in hydrophobic interactions between the complexes of each flexible/constrained ligand pair were generally similar to those observed upon comparing such contacts in coexisting complexes. The average adjusted B factors of the backbone atoms of the domain and loop regions are significantly greater in the complexes of constrained ligands than in the complexes of the corresponding flexible ligands, suggesting greater thermal motion in the crystalline state in the former complexes. There was no apparent correlation between variations in crystal packing and observed structural differences or similarities in the complexes of flexible and constrained ligands, but the possibility that crystal packing might result in structural variations cannot be rigorously excluded. Overall, it appears that there are more variations in the three-dimensional structure of the protein and the ligand in complexes of the constrained ligands than in those of their more flexible counterparts.Item BM-BC: A Bayesian Method of Base Calling for Solexa Sequence Data(2012-08) Ji, Yuan; Mitra, Riten; Quintana, Fernando; Jara, Alejandro; Mueller, Peter; Liu, Ping; Lu, Yue; Liang, Shoudan; Mitra, RitenBase calling is a critical step in the Solexa next-generation sequencing procedure. It compares the position-specific intensity measurements that reflect the signal strength of four possible bases (A, C, G, T) at each genomic position, and outputs estimates of the true sequences for short reads of DNA or RNA. We present a Bayesian method of base calling, BM-BC, for Solexa-GA sequencing data. The Bayesian method builds on a hierarchical model that accounts for three sources of noise in the data, which are known to affect the accuracy of the base calls: fading, phasing, and cross-talk between channels. We show that the new method improves the precision of base calling compared with currently leading methods. Furthermore, the proposed method provides a probability score that measures the confidence of each base call. This probability score can be used to estimate the false discovery rate of the base calling or to rank the precision of the estimated DNA sequences, which in turn can be useful for downstream analysis such as sequence alignment.Item Critical Assessment of Sequence-Based Protein-Protein Interaction Prediction Methods that do not Require Homologous Protein Sequences(2009-12) Park, Yungki; Park, YungkiProtein-protein interactions underlie many important biological processes. Computational prediction methods can nicely complement experimental approaches for identifying protein-protein interactions. Recently, a unique category of sequence-based prediction methods has been put forward - unique in the sense that it does not require homologous protein sequences. This enables it to be universally applicable to all protein sequences unlike many of previous sequence-based prediction methods. If effective as claimed, these new sequence-based, universally applicable prediction methods would have far-reaching utilities in many areas of biology research. Results: Upon close survey, I realized that many of these new methods were ill-tested. In addition, newer methods were often published without performance comparison with previous ones. Thus, it is not clear how good they are and whether there are significant performance differences among them. In this study, I have implemented and thoroughly tested 4 different methods on large-scale, non-redundant data sets. It reveals several important points. First, significant performance differences are noted among different methods. Second, data sets typically used for training prediction methods appear significantly biased, limiting the general applicability of prediction methods trained with them. Third, there is still ample room for further developments. In addition, my analysis illustrates the importance of complementary performance measures coupled with right-sized data sets for meaningful benchmark tests. Conclusions: The current study reveals the potentials and limits of the new category of sequence-based protein-protein interaction prediction methods, which in turn provides a firm ground for future endeavours in this important area of contemporary bioinformatics.Item Crystallization And Preliminary X-Ray Analysis Of A Chitinase From The Fungal Pathogen Coccidioides Immitis(1998-11) Hollis, Thomas; Monzingo, Arthur F.; Bortone, Kara; Schelp, Elisabeth; Cox, Rebecca; Robertus, Jon D.; Hollis, Thomas; Monzingo, Arthur F.; Bortone, Kara; Schelp, Elisabeth; Cox, Rebecca; Robertus, Jon D.Chitinase is necessary for fungal growth and cell division and, therefore, is an ideal target for the design of inhibitors which may act as antifungal agents. A chitinase from the fungal pathogen Coccidioides immitis has been expressed as a fusion protein with gluathione-S-transferase (GST), which aids in purification. After cleavage from GST, chitinase was crystallized from 30% PEG 4000 in 0.1 M sodium acetate pH 4.6. The crystals have a tetragonal crystal lattice and belong to space group P4(1)2(1)2 or P4(3)2(1)2 and diffract to 2.2 Angstrom resolution. The unit-cell parameters are a = b = 91.2, c = 95.4 Angstrom; there is only one chitinase molecule in the asymmetric unit.Item Crystallographic Study Of The Phosphoethanolamine Transferase EptC required For Polymyxin Resistance And Motility In Campylobacter jejuni(2014-10) Fage, Christopher D.; Brown, Dusty B.; Boll, Joseph M.; Keatinge-Clay, Adrian T.; Trent, M. Stephen; Fage, Christopher D.; Brown, Dusty B.; Boll, Joseph M.; Keatinge-Clay, Adrian T.; Trent, M. StephenThe foodborne enteric pathogen Campylobacter jejuni decorates a variety of its cell-surface structures with phosphoethanolamine (pEtN). Modifying lipid A with pEtN promotes cationic antimicrobial peptide resistance, whereas post-translationally modifying the flagellar rod protein FlgG with pEtN promotes flagellar assembly and motility, which are processes that are important for intestinal colonization. EptC, the pEtN transferase required for all known pEtN cell-surface modifications in C. jejuni, is a predicted inner-membrane metalloenzyme with a five-helix N-terminal transmembrane domain followed by a soluble sulfatase-like catalytic domain in the periplasm. The atomic structure of the catalytic domain of EptC (cEptC) was crystallized and solved to a resolution of 2.40 angstrom. cEptC adopts the alpha/beta/alpha fold of the sulfatase protein family and harbors a zinc-binding site. A phosphorylated Thr266 residue was observed that was hypothesized to mimic a covalent pEtN-enzyme intermediate. The requirement for Thr266 as well as the nearby residues Asn308, Ser309, His358 and His440 was ascertained via in vivo activity assays on mutant strains. The results establish a basis for the design of pEtN transferase inhibitors.Item Dielectrophoresis has Broad Applicability to Marker-Free Isolation of Tumor Cells from Blood by Microfluidic Systems(2013-01) Shim, Sangjo; Stemke-Hale, Katherine; Noshari, Jamileh; Becker, Frederick F.; Gascoyne, Peter R. C.; Shim, SangjoThe number of circulating tumor cells (CTCs) found in blood is known to be a prognostic marker for recurrence of primary tumors, however, most current methods for isolating CTCs rely on cell surface markers that are not universally expressed by CTCs. Dielectrophoresis (DEP) can discriminate and manipulate cancer cells in microfluidic systems and has been proposed as a molecular marker-independent approach for isolating CTCs from blood. To investigate the potential applicability of DEP to different cancer types, the dielectric and density properties of the NCI-60 panel of tumor cell types have been measured by dielectrophoretic field-flow fractionation (DEP-FFF) and compared with like properties of the subpopulations of normal peripheral blood cells. We show that all of the NCI-60 cell types, regardless of tissue of origin, exhibit dielectric properties that facilitate their isolation from blood by DEP. Cell types derived from solid tumors that grew in adherent cultures exhibited dielectric properties that were strikingly different from those of peripheral blood cell subpopulations while leukemia-derived lines that grew in non-adherent cultures exhibited dielectric properties that were closer to those of peripheral blood cell types. Our results suggest that DEP methods have wide applicability for the surface-marker independent isolation of viable CTCs from blood as well as for the concentration of leukemia cells from blood. (C) 2013 American Institute of Physics. [http://dx.doi.org/10.1063/1.4774307]Item Efficient Parallel and Out of Core Algorithms for Constructing Large Bi-Directed De Bruijn Graphs(2010-11) Kundeti, Varmsi K.; Rajasekaran, Sanguthevar; Dinh, Hieu; Vaughn, Matthew; Thapar, Vishal; Kundeti, Varmsi K.; Rajasekaran, Sanguthevar; Dinh, HieuAssembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Theta(n Sigma) messages (Sigma being the size of the alphabet). Results: In this paper we present a Theta(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Sigma. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Theta(n log(n/B)/B log(M/B)) (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster - both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem. Conclusions: The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.Item Epifire: An Open Source C ++ Library and Application for Contact Network Epidemiology(2012-05) Hladish, Thomas; Melamud, Eugene; Barrera, Luis Alberto; Galvani, Alison; Meyers, Lauren Ancel; Hladish, Thomas; Barrera, Luis Alberto; Meyers, Lauren AncelContact network models have become increasingly common in epidemiology, but we lack a flexible programming framework for the generation and analysis of epidemiological contact networks and for the simulation of disease transmission through such networks. Results: Here we present EpiFire, an applications programming interface and graphical user interface implemented in C++, which includes a fast and efficient library for generating, analyzing and manipulating networks. Network-based percolation and chain-binomial simulations of susceptible-infected-recovered disease transmission, as well as traditional non-network mass-action simulations, can be performed using EpiFire. Conclusions: EpiFire provides an open-source programming interface for the rapid development of network models with a focus in contact network epidemiology. EpiFire also provides a point-and-click interface for generating networks, conducting epidemic simulations, and creating figures. This interface is particularly useful as a pedagogical tool.Item Exploring Biological Network Structure with Clustered Random Networks(2009-12) Bansal, Shweta; Khandelwal, Shashank; Meyers, Lauren Ancel; Meyers, Lauren AncelComplex biological systems are often modeled as networks of interacting units. Networks of biochemical interactions among proteins, epidemiological contacts among hosts, and trophic interactions in ecosystems, to name a few, have provided useful insights into the dynamical processes that shape and traverse these systems. The degrees of nodes (numbers of interactions) and the extent of clustering (the tendency for a set of three nodes to be interconnected) are two of many well-studied network properties that can fundamentally shape a system. Disentangling the interdependent effects of the various network properties, however, can be difficult. Simple network models can help us quantify the structure of empirical networked systems and understand the impact of various topological properties on dynamics. Results: Here we develop and implement a new Markov chain simulation algorithm to generate simple, connected random graphs that have a specified degree sequence and level of clustering, but are random in all other respects. The implementation of the algorithm (ClustRNet: Clustered Random Networks) provides the generation of random graphs optimized according to a local or global, and relative or absolute measure of clustering. We compare our algorithm to other similar methods and show that ours more successfully produces desired network characteristics. Finding appropriate null models is crucial in bioinformatics research, and is often difficult, particularly for biological networks. As we demonstrate, the networks generated by ClustRNet can serve as random controls when investigating the impacts of complex network features beyond the byproduct of degree and clustering in empirical networks. Conclusion: ClustRNet generates ensembles of graphs of specified edge structure and clustering. These graphs allow for systematic study of the impacts of connectivity and redundancies on network function and dynamics. This process is a key step in unraveling the functional consequences of the structural properties of empirical biological systems and uncovering the mechanisms that drive these systems.Item Expression, Crystallization And Preliminary X-Ray Crystallographic Analysis Of Cystathionine Gamma-Synthase (Xometb) From Xanthomonas Oryzae Pv. Oryzae(2012-12) Ngo, Ho-Phuong-Thuy; Kim, Jin-Kwang; Kim, Seung-Hwan; Pham, Tan-Viet; Tran, Thi-Huyen; Nguyen, Dinh-Duc; Kim, Jeong-Gu; Chung, Sumi; Ahn, Yeh-Jin; Kang, Lin-Woo; Chung, SumiCystathionine gamma-synthase (CGS) catalyzes the first step in the transsulfuration pathway leading to the formation of cystathionine from O-succinylhomoserine and l-cysteine through a gamma-replacement reaction. As an antibacterial drug target against Xanthomonas oryzae pv. oryzae (Xoo), CGS from Xoo (XometB) was cloned, expressed, purified and crystallized. The XometB crystal diffracted to 2.4 angstrom resolution and belonged to the tetragonal space group I4(1), with unit-cell parameters a = b = 165.4, c = 241.7 angstrom. There were four protomers in the asymmetric unit, with a corresponding solvent content of 73.9%.Item Fast and Accurate Methods for Phylogenomic Analyses(2011-10) Yang, Jimmy; Warnow, Tandy; Yang, Jimmy; Warnow, TandySpecies phylogenies are not estimated directly, but rather through phylogenetic analyses of different gene datasets. However, true gene trees can differ from the true species tree (and hence from one another) due to biological processes such as horizontal gene transfer, incomplete lineage sorting, and gene duplication and loss, so that no single gene tree is a reliable estimate of the species tree. Several methods have been developed to estimate species trees from estimated gene trees, differing according to the specific algorithmic technique used and the biological model used to explain differences between species and gene trees. Relatively little is known about the relative performance of these methods. Results: We report on a study evaluating several different methods for estimating species trees from sequence datasets, simulating sequence evolution under a complex model including indels (insertions and deletions), substitutions, and incomplete lineage sorting. The most important finding of our study is that some fast and simple methods are nearly as accurate as the most accurate methods, which employ sophisticated statistical methods and are computationally quite intensive. We also observe that methods that explicitly consider errors in the estimated gene trees produce more accurate trees than methods that assume the estimated gene trees are correct. Conclusions: Our study shows that highly accurate estimations of species trees are achievable, even when gene trees differ from each other and from the species tree, and that these estimations can be obtained using fairly simple and computationally tractable methods.Item Genome-Scale Cluster Analysis of Replicated Microarrays Using Shrinkage Correlation Coefficient(2008-06) Yao, Jianchao; Chang, Chunqi; Salmi, Mari L.; Hung, Yeung S.; Loraine, Ann; Roux, Stanley J.; Salmi, Mari L.; Roux, Stanley J.Currently, clustering with some form of correlation coefficient as the gene similarity metric has become a popular method for profiling genomic data. The Pearson correlation coefficient and the standard deviation (SD)-weighted correlation coefficient are the two most widely-used correlations as the similarity metrics in clustering microarray data. However, these two correlations are not optimal for analyzing replicated microarray data generated by most laboratories. An effective correlation coefficient is needed to provide statistically sufficient analysis of replicated microarray data. Results: In this study, we describe a novel correlation coefficient, shrinkage correlation coefficient (SCC), that fully exploits the similarity between the replicated microarray experimental samples. The methodology considers both the number of replicates and the variance within each experimental group in clustering expression data, and provides a robust statistical estimation of the error of replicated microarray data. The value of SCC is revealed by its comparison with two other correlation coefficients that are currently the most widely-used (Pearson correlation coefficient and SD-weighted correlation coefficient) using statistical measures on both synthetic expression data as well as real gene expression data from Saccharomyces cerevisiae. Two leading clustering methods, hierarchical and k-means clustering were applied for the comparison. The comparison indicated that using SCC achieves better clustering performance. Applying SCC-based hierarchical clustering to the replicated microarray data obtained from germinating spores of the fern Ceratopteris richardii, we discovered two clusters of genes with shared expression patterns during spore germination. Functional analysis suggested that some of the genetic mechanisms that control germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns. Conclusion: This study shows that SCC is an alternative to the Pearson correlation coefficient and the SD-weighted correlation coefficient, and is particularly useful for clustering replicated microarray data. This computational approach should be generally useful for proteomic data or other high-throughput analysis methodology.Item Joint Haplotype Assembly and Genotype Calling via Sequential Monte Carlo Algorithm(2015-07) Ahn, Soyeon; Vikalo, Haris; Ahn, Soyeon; Vikalo, HarisGenetic variations predispose individuals to hereditary diseases, play important role in the development of complex diseases, and impact drug metabolism. The full information about the DNA variations in the genome of an individual is given by haplotypes, the ordered lists of single nucleotide polymorphisms (SNPs) located on chromosomes. Affordable high-throughput DNA sequencing technologies enable routine acquisition of data needed for the assembly of single individual haplotypes. However, state-of-the-art high-throughput sequencing platforms generate data that is erroneous, which induces uncertainty in the SNP and genotype calling procedures and, ultimately, adversely affect the accuracy of haplotyping. When inferring haplotype phase information, the vast majority of the existing techniques for haplotype assembly assume that the genotype information is correct. This motivates the development of methods capable of joint genotype calling and haplotype assembly. Results: We present a haplotype assembly algorithm, ParticleHap, that relies on a probabilistic description of the sequencing data to jointly infer genotypes and assemble the most likely haplotypes. Our method employs a deterministic sequential Monte Carlo algorithm that associates single nucleotide polymorphisms with haplotypes by exhaustively exploring all possible extensions of the partial haplotypes. The algorithm relies on genotype likelihoods rather than on often erroneously called genotypes, thus ensuring a more accurate assembly of the haplotypes. Results on both the 1000 Genomes Project experimental data as well as simulation studies demonstrate that the proposed approach enables highly accurate solutions to the haplotype assembly problem while being computationally efficient and scalable, generally outperforming existing methods in terms of both accuracy and speed. Conclusions: The developed probabilistic framework and sequential Monte Carlo algorithm enable joint haplotype assembly and genotyping in a computationally efficient manner. Our results demonstrate fast and highly accurate haplotype assembly aided by the re-examination of erroneously called genotypes.Item Microfluidic Enrichment of Small Proteins from Complex Biological Mixture on Nanoporous Silica Chip(2011-03) Hu, Ye; Gopal, Ashwini; Lin, Kevin; Peng, Yang; Tasciotti, Ennio; Zhang, Xiojing John; Ferrari, Mauro; Gopal, Ashwini; Lin, Kevin; Zhang, Xiojing JohnThe growing field of miniaturized diagnostics is hindered by a lack of pre-analysis treatments that are capable of processing small sample volumes for the detection of low concentration analytes in a high-throughput manner. This letter presents a novel, highly efficient method for the extraction of low-molecular weight (LMW) proteins from biological fluids, represented by a mixture of standard proteins, using integrated microfluidic systems. We bound a polydimethylsiloxane layer patterned with a microfluidic channel onto a well-defined nanoporous silica substrate. Using rapid, pressure-driven fractionation steps, this system utilizes the size-exclusion properties of the silica nanopores to remove high molecular weight proteins while simultaneously isolating and enriching LMW proteins present in the biological sample. The introduction of the microfluidic component offers important advantages such as high reproducibility, a simple user interface, controlled environment, the ability to process small sample volumes, and precise quantification. This solution streamlines high-throughput proteomics research on many fronts and may find broad acceptance and application in clinical diagnostics and point of care detection. (C) 2011 American Institute of Physics. [doi: 10.1063/1.3528237]