RemNote Community
Community

Large‑Scale and Biological Sequencing Applications

Understand the differences between exome and whole‑genome sequencing, the principles and challenges of RNA‑Seq, and the main methods for protein sequencing and related concepts.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What specific subset of DNA does exome sequencing target?
1 of 11

Summary

Understanding Large-Scale Sequencing Methods Introduction Determining the sequences of DNA, RNA, and proteins is fundamental to modern biology, but doing so at scale requires different approaches depending on the target molecule and biological question. This section explores the major sequencing methodologies used to study genomes, transcriptomes (the collection of all expressed RNA), and proteins. DNA Sequencing at Scale Exome Versus Whole Genome Sequencing When we sequence DNA, we can take two different approaches based on our research goals. Exome sequencing focuses narrowly on the exome—the subset of DNA that actually encodes proteins. Since only about 1-2% of the human genome codes for proteins, exome sequencing is much faster and less expensive than sequencing everything. This approach is useful when you suspect a disease-causing mutation likely affects protein-coding genes. Whole genome sequencing, by contrast, determines the complete nuclear DNA sequence of an organism. This gives you the full picture, including regulatory regions, introns, and other non-coding sequences. While more comprehensive, it requires more time and resources. The choice between these approaches depends on your research question. If you're hunting for disease mutations that affect proteins, exome sequencing is efficient. If you want to understand regulatory changes or structural variants, whole genome sequencing is necessary. Modern Sequencing Technologies The development of second-generation (next-generation) and third-generation sequencing systems has transformed genomics. Technologies from companies like Illumina, 454, Applied Biosystems Inc, and others have become increasingly cost-effective alternatives to traditional Sanger sequencing (shown in the image above, which demonstrates the classic "ladder" pattern of DNA fragments separated by size). One particularly important technology is pyrosequencing, which can generate several-fold coverage of a bacterial genome in a single run. This means the entire genome gets sequenced multiple times, allowing for error correction and high confidence in the results. <extrainfo> The specific brand names and technical details of various sequencing platforms are less critical to understand than the conceptual advantage they provide: massive parallelization of sequencing reactions in a single run, making large-scale genomic projects feasible. </extrainfo> RNA Sequencing Why Reverse Transcription is Necessary RNA is chemically unstable compared to DNA, and standard sequencing methods are optimized for DNA. Therefore, to sequence RNA, extracted RNA molecules must first be reverse transcribed into complementary DNA (cDNA) fragments. This cDNA can then be sequenced using standard DNA sequencing methods. This reverse transcription step is critical because it converts the fragile RNA molecule into stable cDNA that sequencers can handle. Focusing on Messenger RNA Not all RNA is equally informative. Cells contain many types of RNA: messenger RNA (mRNA), which carries protein-coding instructions; ribosomal RNA (rRNA), which is structural; and small regulatory RNAs, among others. To identify which genes are actively expressed in a cell, researchers can enrich for mRNA by removing rRNA and small RNA in the laboratory. This enrichment process strips away the less informative sequences and leaves behind the mRNA that represents the genes actually being used by the cell. RNA-Seq: Reading Expression Profiles The result of sequencing enriched mRNA populations is called RNA-Seq (when using next-generation sequencing methods). The RNA-Seq profile tells you which genes are actively expressed in cells at a particular time or under particular conditions. This is invaluable for understanding disease mechanisms or how cells respond to stimuli. For example, comparing the RNA-Seq profile of cancer cells to healthy cells reveals which genes are abnormally active in cancer—potential therapeutic targets. A related technique, microRNA sequencing, specifically sequences small regulatory RNAs that control gene expression. The Complexity of Eukaryotic mRNA Mapping Here's a tricky aspect unique to eukaryotic organisms: eukaryotic mRNA sequences are often non-colinear with their DNA templates. This means the mRNA doesn't perfectly match the DNA in the genome, even for the same gene. Why? Eukaryotic genes contain introns—non-coding sequences inserted within the protein-coding sequence. During RNA processing, these introns are removed through a process called splicing, leaving only the exons (coding sequences) in the mature mRNA. When you sequence the mRNA and try to map it back to the genome, the reads don't align directly because the introns are present in the genomic DNA but absent in the mRNA. This makes data analysis more complex: the bioinformatics pipeline must account for exon-intron boundaries and may need to reconstruct which exons were joined together during splicing. Protein Sequencing Direct Versus Indirect Protein Sequencing Proteins can be sequenced directly through several methods: Edman degradation (which removes amino acids one at a time from the N-terminus), peptide mass fingerprinting (identifying peptides by their mass), mass spectrometry (measuring protein fragments), and protease digests (cutting proteins and analyzing the pieces). However, here's a crucial practical point: when the gene encoding a protein is already known, sequencing the DNA and computationally inferring the protein sequence is usually easier than direct protein sequencing. This is because the genetic code—the rules governing how nucleotide sequences translate to amino acid sequences—allows you to predict the protein directly from the DNA sequence. This DNA-based approach is faster, cheaper, and less technically demanding than isolating a protein and sequencing it directly. Partial Protein Sequencing for Gene Identification Sometimes you don't need the entire protein sequence. Determining only part of a protein's amino acid sequence can be sufficient to identify the gene clone that encodes it. By sequencing just the first 20-30 amino acids of an unknown protein, researchers can search databases to find the matching gene, which can then be cloned for further study. <extrainfo> Related Concepts Worth Knowing The Genetic Code: The genetic code describes how nucleotide sequences are translated into amino acid sequences. There are 64 possible three-nucleotide combinations (codons), which encode the 20 standard amino acids plus start and stop signals. Understanding this code is essential for predicting proteins from DNA sequences. Sequence Motif: A sequence motif is a short recurring pattern in DNA or protein sequences. Motifs often indicate functional domains—parts of the protein that perform specific functions. Recognizing motifs helps researchers predict what a newly discovered gene might do. Pathogenomics: Pathogenomics studies the genomes of disease-causing organisms. This field is particularly important for understanding bacterial and viral pathogens and developing treatments. While interesting, this is more of a specific application area than a core methodological concept. </extrainfo>
Flashcards
What specific subset of DNA does exome sequencing target?
All protein-coding genes
What is the primary goal of whole genome sequencing?
To determine the complete nuclear DNA sequence of an organism
Into what form must RNA be converted before it can be sequenced?
Complementary DNA (cDNA) fragments
What process is used to turn extracted RNA into complementary DNA for sequencing?
Reverse transcription
What does an RNA sequencing profile indicate about a cell's state?
Which genes are actively being expressed
Why is eukaryotic messenger RNA often non-colinear with its DNA template?
Because introns are removed during splicing
What is the specific term used for whole transcriptome sequencing using next-generation methods?
RNA-Seq
What can be achieved by determining even a partial amino-acid sequence of a protein?
Identification of the gene clone that encodes it
What process does the genetic code describe?
How nucleotide sequences are translated into amino-acid sequences
What is the primary focus of study in pathogenomics?
The genomes of disease-causing organisms
What is a sequence motif in the context of DNA or protein sequences?
A short recurring pattern

Quiz

Which of the following is a commonly used method for determining protein sequences?
1 of 10
Key Concepts
Sequencing Techniques
Exome sequencing
Whole genome sequencing
Next‑generation sequencing
Pyrosequencing
RNA‑Seq
microRNA sequencing
Protein Analysis
Edman degradation
Mass spectrometry (protein sequencing)
Genetic code
Genomic Studies
Pathogenomics
Sequence motif