Last month DNASTAR’s Senior Product Manager Matt Keyser participated in DNASTAR’s second Ask Me Anything, or AMA, on Reddit. An AMA is a crowdsourced interview where Reddit users can leave questions for the interviewee to answer and vote on other questions they would like to see answered. Matt answered quite a few questions on topics from genomics to his career path in bioinformatics. Here are some of the top questions Matt received.
Question: The genomics field seems to be getting a ton of attention these days due to CRISPR and next-gen sequencing tech. I was curious about the data science tools and algorithms used to compare genomes or in other words, to compare patients with various conditions. The idea is that by including a genetic sequencing into diagnostic steps, you can get a better understanding of what protein expressions etc. lead to people being susceptible or immune from certain diseases. In order to do this well, is generic clustering used? Are there specialized algorithms to do this? I guess the basic question is if you wanted to compare the DNA of various people in a study, how is that done from a software or data science approach?
Answer: There’s a lot of different tools that are used. Patients get their DNA (or RNA) sequences and then you use software like DNASTAR to compare that sequence data to a reference genome. DNASTAR software is a bit unique in that most of the setup is done initially in SeqMan NGen. So, the user has raw sequence data and a couple hours later a set of variations for each patient along with variant annotation that help resolve the interesting variants. However, it still takes some scientific knowledge to further identify the truly impactful variants. Other software programs do similar things, but you usually need to run each step in the process separately, which may require scripting or at least a lot more user intervention. So, many users will pay to have a core facility process raw data and generate VCF output which still can be difficult to analyze, but streamlines the workflow quite a bit.
Question: I’m an undergraduate student working in a molecular bio research lab. I’m currently doing bioinformatics and learning it on my own. I feel like most of it is just using existing data to produce more data. Also, where can I find accurate resources to mutate residues, predict secondary structure, and dock proteins and simulate? Alpha fold would be nice.
Answer: Bioinformatics could be mining existing data and there certainly is a lifetimes’ worth of data out there waiting for analysis. However, DNASTAR software also is widely used to assemble/align raw sequence data, that is, data that was generated from a real biological sample.
The DNASTAR Lasergene Protein package provide tools for Protein Design (introducing mutations) and then predicting structural changes and protein docking interactions. There’s a ton of information on our website about these tools. Also, our SeqBuilder Pro application offers a great UI for modifying PCR primers to introduce mutations for protein design.
Question: I enjoyed your webinars and thanks for the opportunity to ask. I would like to have some help on genomics. I have VCF results from a panel and now I have managed to annotate and see the SNPs on Array star. What are the next steps in order to see if there is a special mutation on my genes of interest? Because each gene has a lot of SNPs, how to choose which I have to go on? Do I have to compare all the samples for each SNPs, or I can select somehow the most important? I need kind of workflow after annotation.
Answer: That is the big challenge with variant analysis. Even with initial filtering there still may a lot more SNP variants remaining than is practical to analyze. One strategy is to compare samples and identify variants that are common or unique to the sample groups. Also, for some model organisms, DNASTAR provides Genome Template Packages that can be used at assembly time (or VCF file import time) that provide additional annotation that can be used to identify important variants. For example, GERP scores that assign weight (evolutionary conservation) to variations and can be used to identify variants that occur at more impactful locations. If you are working with human data, we also offer the Variant Annotation Database that provides a huge amount of additional variant annotation from dbNFSP, MasterMind (Genomenon), 1000 Genomes allele and genotype frequencies. We have a recently updated VAD page that explains these databases in more detail.
Question: Sort of a general question from a layman but, do you have any insights on the progress being made confidently predicting epigenetic expression?
Answer: I think real time single molecule sequencing (PacBio and Nanopore) provides a means to directly detect nucleotide modifications and is a significant improvement over previous methods for epigenetic analysis. I do not have much personal experience here, but the DNASTAR development team is focused on supporting long read analysis tools and this is an area where latest generation of sequencing provides many advantages.
Question: Hi Matt, How long do you think before the entire process from raw reads to variant interpretation is entirely automated in the biomedical field?
Answer: It is certainly moving in that direction and small assays or panel can already be fully automated when specific variants are already known. However, variant interpretation is another thing. A typical human exome assembly with yield 10-15K variants and even after several layers of filtering, you are still left with dozens if not hundreds of variants that are potentially interesting. I think this interpreting even this filtered subset of variants will require knowledgeable human intervention, at least for the next few years.
Learn more about our genomics, protein and molecular biology workflows on our website or head over to Reddit to see the full list of answered questions.
2 Comments
Leave your reply.