Functionality and efficiency improvement of miRCheck, a popular program for microRNA structure prediction
Plant and animals both contain non-coding small RNAs that play important roles in their growth, development and responses to biotic and abiotic stresses. MicroRNAs are short (20-24nt), endogenously expressed, and a well characterized class of small RNAs, which are derived from processing longer, hairpin-like precursors. Discovery of most miRNAs relies upon either of two methods: i) molecular cloning of small RNAs or ii) prediction of miRNA genes based on conserved sequences, and secondary structures of known miRNAs using computational tools. miRCheck is a popular computational tool chain comprised of various Perl scripts for identifying and profiling plant miRNA genes. The program serves two purposes: 1. Identifying miRNA homologs in target genomic or cDNA sequences using a given small RNA library, 2. Searching for all potential miRNA precursor loci across the target genome based on evolutionarily-conserved structural features without any reference small RNA library to compare with. miRCheck builds upon several popular tools like patscan, RNAfold, einverted, and connects them to provide a complete tool chain for identifying miRNAs. Although miRCheck is a very well designed tool chain, it still has a few issues that need to be addressed to enhance its functionality and efficiency. This work analyzes the working mechanism of miRCheck, proposes some methods to enhance its efficiency and functionality, and implements those in a modified tool chain, py-miRCheck, in Python. To process a long genome sequence, miRCheck looks at small segments and serially evaluates them leading to long run times. Even in a high performance computing node, it takes days to process a standard sized reference genome obtained from NCBI repository. It highlights the inefficiency in the program. On the functionality side, there are several issues that need to be addressed for usability improvements: i) lack of parameterized design, ii) procedural design, iii) lack of GUI interface for running the tool chain, and iv) deployment-related problems. In this work, I address all these areas and also parallelize the tool chain to improve its efficiency by over a factor of 3. I also provide a Django-based prototype web front end to submit queries on a genome sequence. In summary, this work improves the usability of this tool chain to a great extent.