Large‑Scale and Biological Sequencing Applications
Understand the differences between exome and whole‑genome sequencing, the principles and challenges of RNA‑Seq, and the main methods for protein sequencing and related concepts.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What specific subset of DNA does exome sequencing target?
1 of 11
Summary
Understanding Large-Scale Sequencing Methods
Introduction
Determining the sequences of DNA, RNA, and proteins is fundamental to modern biology, but doing so at scale requires different approaches depending on the target molecule and biological question. This section explores the major sequencing methodologies used to study genomes, transcriptomes (the collection of all expressed RNA), and proteins.
DNA Sequencing at Scale
Exome Versus Whole Genome Sequencing
When we sequence DNA, we can take two different approaches based on our research goals.
Exome sequencing focuses narrowly on the exome—the subset of DNA that actually encodes proteins. Since only about 1-2% of the human genome codes for proteins, exome sequencing is much faster and less expensive than sequencing everything. This approach is useful when you suspect a disease-causing mutation likely affects protein-coding genes.
Whole genome sequencing, by contrast, determines the complete nuclear DNA sequence of an organism. This gives you the full picture, including regulatory regions, introns, and other non-coding sequences. While more comprehensive, it requires more time and resources.
The choice between these approaches depends on your research question. If you're hunting for disease mutations that affect proteins, exome sequencing is efficient. If you want to understand regulatory changes or structural variants, whole genome sequencing is necessary.
Modern Sequencing Technologies
The development of second-generation (next-generation) and third-generation sequencing systems has transformed genomics. Technologies from companies like Illumina, 454, Applied Biosystems Inc, and others have become increasingly cost-effective alternatives to traditional Sanger sequencing (shown in the image above, which demonstrates the classic "ladder" pattern of DNA fragments separated by size).
One particularly important technology is pyrosequencing, which can generate several-fold coverage of a bacterial genome in a single run. This means the entire genome gets sequenced multiple times, allowing for error correction and high confidence in the results.
<extrainfo>
The specific brand names and technical details of various sequencing platforms are less critical to understand than the conceptual advantage they provide: massive parallelization of sequencing reactions in a single run, making large-scale genomic projects feasible.
</extrainfo>
RNA Sequencing
Why Reverse Transcription is Necessary
RNA is chemically unstable compared to DNA, and standard sequencing methods are optimized for DNA. Therefore, to sequence RNA, extracted RNA molecules must first be reverse transcribed into complementary DNA (cDNA) fragments. This cDNA can then be sequenced using standard DNA sequencing methods.
This reverse transcription step is critical because it converts the fragile RNA molecule into stable cDNA that sequencers can handle.
Focusing on Messenger RNA
Not all RNA is equally informative. Cells contain many types of RNA: messenger RNA (mRNA), which carries protein-coding instructions; ribosomal RNA (rRNA), which is structural; and small regulatory RNAs, among others.
To identify which genes are actively expressed in a cell, researchers can enrich for mRNA by removing rRNA and small RNA in the laboratory. This enrichment process strips away the less informative sequences and leaves behind the mRNA that represents the genes actually being used by the cell.
RNA-Seq: Reading Expression Profiles
The result of sequencing enriched mRNA populations is called RNA-Seq (when using next-generation sequencing methods). The RNA-Seq profile tells you which genes are actively expressed in cells at a particular time or under particular conditions. This is invaluable for understanding disease mechanisms or how cells respond to stimuli.
For example, comparing the RNA-Seq profile of cancer cells to healthy cells reveals which genes are abnormally active in cancer—potential therapeutic targets.
A related technique, microRNA sequencing, specifically sequences small regulatory RNAs that control gene expression.
The Complexity of Eukaryotic mRNA Mapping
Here's a tricky aspect unique to eukaryotic organisms: eukaryotic mRNA sequences are often non-colinear with their DNA templates. This means the mRNA doesn't perfectly match the DNA in the genome, even for the same gene.
Why? Eukaryotic genes contain introns—non-coding sequences inserted within the protein-coding sequence. During RNA processing, these introns are removed through a process called splicing, leaving only the exons (coding sequences) in the mature mRNA. When you sequence the mRNA and try to map it back to the genome, the reads don't align directly because the introns are present in the genomic DNA but absent in the mRNA.
This makes data analysis more complex: the bioinformatics pipeline must account for exon-intron boundaries and may need to reconstruct which exons were joined together during splicing.
Protein Sequencing
Direct Versus Indirect Protein Sequencing
Proteins can be sequenced directly through several methods: Edman degradation (which removes amino acids one at a time from the N-terminus), peptide mass fingerprinting (identifying peptides by their mass), mass spectrometry (measuring protein fragments), and protease digests (cutting proteins and analyzing the pieces).
However, here's a crucial practical point: when the gene encoding a protein is already known, sequencing the DNA and computationally inferring the protein sequence is usually easier than direct protein sequencing. This is because the genetic code—the rules governing how nucleotide sequences translate to amino acid sequences—allows you to predict the protein directly from the DNA sequence.
This DNA-based approach is faster, cheaper, and less technically demanding than isolating a protein and sequencing it directly.
Partial Protein Sequencing for Gene Identification
Sometimes you don't need the entire protein sequence. Determining only part of a protein's amino acid sequence can be sufficient to identify the gene clone that encodes it. By sequencing just the first 20-30 amino acids of an unknown protein, researchers can search databases to find the matching gene, which can then be cloned for further study.
<extrainfo>
Related Concepts Worth Knowing
The Genetic Code: The genetic code describes how nucleotide sequences are translated into amino acid sequences. There are 64 possible three-nucleotide combinations (codons), which encode the 20 standard amino acids plus start and stop signals. Understanding this code is essential for predicting proteins from DNA sequences.
Sequence Motif: A sequence motif is a short recurring pattern in DNA or protein sequences. Motifs often indicate functional domains—parts of the protein that perform specific functions. Recognizing motifs helps researchers predict what a newly discovered gene might do.
Pathogenomics: Pathogenomics studies the genomes of disease-causing organisms. This field is particularly important for understanding bacterial and viral pathogens and developing treatments. While interesting, this is more of a specific application area than a core methodological concept.
</extrainfo>
Flashcards
What specific subset of DNA does exome sequencing target?
All protein-coding genes
What is the primary goal of whole genome sequencing?
To determine the complete nuclear DNA sequence of an organism
Into what form must RNA be converted before it can be sequenced?
Complementary DNA (cDNA) fragments
What process is used to turn extracted RNA into complementary DNA for sequencing?
Reverse transcription
What does an RNA sequencing profile indicate about a cell's state?
Which genes are actively being expressed
Why is eukaryotic messenger RNA often non-colinear with its DNA template?
Because introns are removed during splicing
What is the specific term used for whole transcriptome sequencing using next-generation methods?
RNA-Seq
What can be achieved by determining even a partial amino-acid sequence of a protein?
Identification of the gene clone that encodes it
What process does the genetic code describe?
How nucleotide sequences are translated into amino-acid sequences
What is the primary focus of study in pathogenomics?
The genomes of disease-causing organisms
What is a sequence motif in the context of DNA or protein sequences?
A short recurring pattern
Quiz
Large‑Scale and Biological Sequencing Applications Quiz Question 1: Which of the following is a commonly used method for determining protein sequences?
- Edman degradation (correct)
- Southern blot analysis
- Polymerase chain reaction (PCR)
- Northern blot analysis
Large‑Scale and Biological Sequencing Applications Quiz Question 2: Why might determining only a portion of a protein’s amino‑acid sequence be useful?
- It can identify the corresponding gene clone (correct)
- It reveals the protein’s full three‑dimensional structure
- It determines the protein’s enzymatic activity
- It predicts post‑translational modifications
Large‑Scale and Biological Sequencing Applications Quiz Question 3: Exome sequencing is designed to capture DNA that encodes which category of genes?
- Protein‑coding genes (correct)
- Ribosomal RNA genes
- MicroRNA genes
- Intronic non‑coding regions
Large‑Scale and Biological Sequencing Applications Quiz Question 4: Whole genome sequencing provides the sequence of which cellular component?
- The complete nuclear DNA (correct)
- Mitochondrial DNA only
- Only protein‑coding regions
- Only ribosomal RNA genes
Large‑Scale and Biological Sequencing Applications Quiz Question 5: RNA‑Seq expression profiles are primarily used to identify what?
- Actively expressed genes (correct)
- DNA methylation patterns
- Chromosome number
- Protein tertiary structure
Large‑Scale and Biological Sequencing Applications Quiz Question 6: Sequencing of all cellular RNAs with next‑generation methods is known as what?
- RNA‑Seq (correct)
- ChIP‑Seq
- ATAC‑Seq
- Ribo‑Seq
Large‑Scale and Biological Sequencing Applications Quiz Question 7: The genetic code specifies the relationship between which two biological sequences?
- Nucleotide triplets and amino acids (correct)
- DNA double helix and histones
- RNA polymerase and ribosome
- Promoter regions and transcription factors
Large‑Scale and Biological Sequencing Applications Quiz Question 8: What term describes the relationship between eukaryotic messenger RNA and its DNA template after introns have been removed?
- Non‑colinear (correct)
- Complementary
- Identical
- Reverse‑oriented
Large‑Scale and Biological Sequencing Applications Quiz Question 9: When the gene encoding a protein is known, why is sequencing the DNA usually preferred over direct protein sequencing?
- DNA sequencing is generally easier and more efficient (correct)
- Protein sequencing provides more accurate amino‑acid identification
- DNA cannot be amplified, making it more reliable
- Protein sequencing yields information about gene regulation
Large‑Scale and Biological Sequencing Applications Quiz Question 10: Pathogenomics primarily studies the genomes of which type of organisms?
- Disease‑causing organisms (correct)
- Beneficial symbionts
- Endangered species
- Model laboratory organisms
Which of the following is a commonly used method for determining protein sequences?
1 of 10
Key Concepts
Sequencing Techniques
Exome sequencing
Whole genome sequencing
Next‑generation sequencing
Pyrosequencing
RNA‑Seq
microRNA sequencing
Protein Analysis
Edman degradation
Mass spectrometry (protein sequencing)
Genetic code
Genomic Studies
Pathogenomics
Sequence motif
Definitions
Exome sequencing
A targeted DNA sequencing method that captures and reads all protein‑coding regions of the genome.
Whole genome sequencing
The comprehensive determination of an organism’s entire nuclear DNA sequence.
Next‑generation sequencing
High‑throughput DNA sequencing technologies, including second‑ and third‑generation platforms, that dramatically reduce cost and time.
Pyrosequencing
A sequencing‑by‑synthesis technique that detects pyrophosphate release to generate short reads, often used for bacterial genomes.
RNA‑Seq
Whole‑transcriptome sequencing that quantifies and characterizes RNA molecules using next‑generation sequencing.
microRNA sequencing
A specialized RNA‑Seq approach that profiles small regulatory RNAs (microRNAs).
Edman degradation
A stepwise chemical method for determining the N‑terminal amino‑acid sequence of a peptide.
Mass spectrometry (protein sequencing)
An analytical technique that measures peptide masses to infer protein sequences and modifications.
Genetic code
The set of rules by which nucleotide triplets (codons) are translated into specific amino acids during protein synthesis.
Pathogenomics
The study of the genomes of pathogenic organisms to understand disease mechanisms and evolution.
Sequence motif
A short, recurring pattern in DNA or protein sequences that often has a biological function.