By Prajkta Chivte, Ph.D., DNASTAR Technical Sales Scientist
May 23, 2024 | Molecular Biology
By Prajkta Chivte, Ph.D., DNASTAR Technical Sales Scientist
May 23, 2024 | Molecular Biology
Prajkta Chivte recently completed her doctoral studies in Biochemistry, where she established novel biomarkers for diagnosing COVID-19 using mass spectrometry. As a Technical Sales Scientist at DNASTAR, she works directly with a diverse range of domestic and international customers to understand their research needs and guide them to the appropriate Lasergene software workflows.
This blog post answers some common questions about gene homology that I receive in my role as a Technical Sales Scientist. Whether you are a curious undergraduate, a lab-based scientist, or simply fascinated by the process of evolution, I hope you’ll learn something new about this topic.
In the first part of this blog post, I’ll answer some basic questions related to gene homology. Next, I’ll show how easy it is to use Lasergene software to reveal evolutionary connections between genomes. Finally, I’ll share a real-world application of this workflow that was pioneered by DNASTAR scientists and one of our customers at the University of Groningen.
The concept of phenotypical homology was first introduced by Richard Owen (1804-1892) when describing homologous structures between species. Once Charles Darwin (1809-1882) published his works on evolutionary theory, these homologous structures were reinterpreted as showing derivation from a common ancestral structure. As we know now, analogous structures may indeed point to a common ancestor, but can also arise independently in virtually unrelated species.
With the advent of nucleotide and protein sequencing technologies and the development of specialized bioinformatics software, it is now possible to go beyond phenotypes and instead determine whether organisms share a common ancestor by comparing their DNA, RNA, and protein sequences. This type of analysis is called “sequence homology” or “gene homology.”
The link between sequence homology and sequence similarity (or identity) is often misunderstood. Simply put, sequence similarity indicates the percentage of similar residues between two sequences. Sequence similarity is a quantitative parameter, so we can say that two sequences “share 55% similarity.”
By contrast, sequence homology is an inference drawn from the results of sequence similarity, and always involves a qualitative statement. Sequences are either homologous or nonhomologous. An analogy is that a person can either be pregnant or not pregnant; they can’t be 55% pregnant. Because sequence homology is qualitative, it is not possible to calculate a “percent homology” for a pair of sequences. They can have a “percent similarity,” but they either do, or do not, share a common ancestor.
Genes that share a common evolutionary origin are referred to as homologs, which are further categorized into three classes: orthologs (speciation), paralogs (duplication), and xenologs (horizontal gene transfer).
Not only does sequence homology help in performing phylogenetic analysis and understanding evolutionary relationships, but also assists in inferring/predicting the functions of genes, unveiling insights into various genetic diseases. Recently, this knowledge has been applied with great success in drug discovery pipelines. However, researchers from diverse fields can benefit from gene homology analysis due to the multi-dimensional nature of the results.
– Biotechnologists and pharmacists can use homology analysis to further drug discovery and protein engineering, as well as to identify therapeutic targets.
– Evolutionary biologists can trace evolutionary relationships between species.
– Molecular biologists can discover structural similarities between the genes of closely related species.
– Microbiologists and virologists can examine pathogenicity and virulence by studying the homologs within a certain family or genus.
– Environmental scientists and ecologists can evaluate the genetic diversity and population structure of a given ecosystem.
– Anthropologists and archaeologists can determine human migration patterns.
As you can see, gene homology has become an essential workflow for researchers across a wide range of biological, environmental, medical, and social sciences fields.
The ability to do gene homology analysis was added in Lasergene 17.6, which was released in 2024. Setup only takes a minute or two, and the workflow supports phylogenetic analysis of much larger nucleotide sequences than those supported by most multiple sequence alignment algorithms.
Step 1: In MegAlign Pro, use Align > Align by Gene Homology to launch the wizard at the Reference Sequence screen (Figure 1).
Use the buttons on the right to add a reference sequence from your computer or from the DNASTAR Cloud Data Drive; or to download a genome template from the DNASTAR website or from NCBI’s Entrez database. Then click Next.
Step 2: In the Input Sequences wizard screen (Figure 2), add the sequences that you wish to align to the reference.
This screen also offers a number of options for grouping and naming replicate sets. When finished, click Next.
(Note that if your starting point is raw unassembled data—particularly long-read data obtained from PacBio HiFi—you should first de novo assemble it in SeqMan NGen before uploading it here. )
Step 3: In the Analysis Options screen (Figure 3), customize options related to post-assembly analysis, if desired.
Step 4: Click Next to proceed to a screen where you can name the project and choose a phylogenetic tree-building algorithm. Then click Next again to choose whether to run the alignment on your local computer or on the cloud.
Once alignment begins, MegAlign Pro starts by identifying a subset of homologous genes in bacterial genomes or eukaryotic chromosomes. This step enables the homology alignment algorithm to better differentiate between closely related species, rather than relying on a single gene for phylogenetic analysis. The algorithm then creates a concatenated protein from each genome and aligns these against a reference sequence using the powerful and highly accurate MAFFT algorithm.
After successfully completing the alignment, MegAlign Pro generates the usual distance table and aligned sequences view, but also creates a phylogenetic tree and a homologs view (Figure 4). The homologs view contains two customizable tables with a summary of all the identified homologs (or “unique to reference”) along with their % coverage and % similarity. These last two statistics are valuable in assessing the quality of the alignment.
In addition, several useful tab-delimited text files are automatically created and can be used for additional downstream analysis. For example, one important folder that is generated is the Concatenated Proteins folder. This folder contains files for all translated proteins of all the Shared Homologs, concatenated into the same genomic order as they occur in the reference genome.
Our new gene homology analysis workflow was used to solve a medical mystery even before the workflow was formally released in Lasergene 17.6. This innovative research was recently described in Frontiers in Cellular and Infection Microbiology. Click here to read the full article online.
Here’s a brief summary of the situation and the solution:
A lung transplant patient in Europe had an infection, but multiple attempts to culture bacteria from pus samples had failed. This meant that PCR was out of the question. To find a solution, Artur J. Sabat and his team of researchers in the Netherlands and Germany collaborated with DNASTAR scientist Tim Durfee and DNASTAR software developer Schuyler Baldwin.
First, samples from the lung infection were used to isolate the bacterium. Next, next-gen sequencing was performed to obtain Illumina and Oxford Nanopore raw sequencing data. The raw data was assembled and aligned to a human genome using DNASTAR’s SeqMan NGen. This step easily separated the human from the non-human (bacterial) sequences.
The unaligned non-human reads were then assembled using the de novo assembly workflow in SeqMan NGen, which also performed genome polishing. Manual editing was carried out in SeqMan Ultra.
Finally, MegAlign Pro was used to analyze gene homology against bacterial references that included 42 Mycoplasma species. The phylogenetic trees generated using the homologs clearly indicated enhanced resolution (i.e., improved differentiation) of the closely associated species for the Mycoplasma genus. The team successfully identified the patient’s bacterium as Mycoplasma faucium and was able to publish the first complete circular bacterial genome of that species. They also identified three mobile genetic elements in M. faucium that were never before reported.
Another result of this study was the identification of susceptible resistance of the bacterium to tetracyclines due to horizontal gene transfer. This finding is critical for the treatment of Mycoplasma infection as fully antibiotic-resistant M. faucium can be anticipated in the near future. Lastly, such in-depth sequence analyses can also help us understand the virulence factors as well as the defense mechanisms of these pathogens.
This collaboration is a great example of a quick and efficient use of bioinformatics applications to address challenges in the medical field. According to the authors, “this study represents the first-ever acquisition of a complete circularized bacterial genome directly from a patient sample obtained from invasive infection of a primary sterile site using culture-independent, PCR-free clinical metagenomics.” This approach could also open doors for analyzing other complex or unknown pathogens.
I highly recommend watching our webinar, Gene Homology at Scale, where my colleague, Matt Keyser, does a live demonstration of this workflow and discusses how to interpret the results.
© 2024 — DNASTAR Privacy Policy