Introduction to Biological Databases
Understand the major categories of biological databases, key resources within each type, and how they enable modern bioinformatics applications.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the general definition of a biological database?
1 of 16
Summary
Introduction to Biological Databases
What Are Biological Databases?
A biological database is a structured, computer-accessible collection of information about living organisms. Think of it as a digital library specifically designed for molecular and genetic information. Before databases became widespread, researchers stored their experimental results in paper notebooks—a system that had obvious limitations. Today, biological databases store information in standardized formats that anyone can access, download, and analyze from anywhere in the world.
The key advantage of standardized formats is data reuse. Instead of repeating experiments to verify results or answer new questions, researchers can access previously collected data, compare their findings with others' work, and build new knowledge on top of existing discoveries. This democratization of data accelerates scientific progress and prevents wasted effort.
How Biological Databases Are Organized
Biological databases are categorized by the type of data they store. Understanding these categories will help you know which database to use for different research questions:
Sequence databases store the raw genetic information: strings of nucleotides (DNA/RNA) or amino acids (proteins)
Structure databases contain three-dimensional models showing how proteins and other macromolecules are physically shaped
Annotation and functional databases add biological meaning to raw sequences—explaining what genes do and how they work
Expression and variation databases capture quantitative measurements of gene activity and catalog natural genetic differences between individuals
Each category serves a different purpose, and researchers often use multiple databases together to answer complex questions.
Sequence Databases: Storing Raw Genetic Information
GenBank and UniProt
The two foundational sequence databases you need to know are:
GenBank is the primary repository for DNA and RNA sequences. Researchers who discover or study new genes submit their sequences to GenBank, where they receive a stable, citable identifier that uniquely identifies that sequence. This means future researchers can refer to the exact sequence by its GenBank accession number, ensuring reproducibility and clarity.
UniProt serves the same function for protein sequences—it stores amino acid sequences submitted by researchers worldwide. Like GenBank, UniProt assigns stable identifiers to each protein sequence.
Why does this matter? Before centralized sequence databases, comparing your newly discovered gene with known genes required tedious manual searches through published papers. Now you can search millions of sequences instantly and download data for further analysis.
Structure Databases: Understanding Molecular Shape
The Protein Data Bank (PDB)
While sequence databases tell you the order of building blocks (nucleotides or amino acids), the Protein Data Bank (PDB) shows you the actual three-dimensional structure of macromolecules. Researchers determine these structures using three experimental techniques:
X-ray crystallography freezes proteins as crystals and uses X-rays to determine their positions
Nuclear magnetic resonance (NMR) spectroscopy uses magnetic fields to determine atomic positions in solution
Cryo-electron microscopy freezes proteins at ultra-low temperatures and images them with electron beams
The PDB is crucial because a molecule's shape directly determines its function. A protein's three-dimensional structure explains how it binds to other molecules, catalyzes reactions, or responds to signals. By exploring PDB structures, researchers can understand not just what a protein sequence is, but how that sequence creates a functional machine.
Annotation and Functional Databases: Adding Biological Meaning
Knowing a gene's sequence or structure is only the beginning. To truly understand what a gene does, you need annotation databases that add biological context.
Gene Ontology (GO)
The Gene Ontology (GO) provides a standardized vocabulary for describing what genes and their protein products do. Without standardization, different researchers might describe the same function using different words, making comparisons difficult. GO solves this by providing consistent terms across three categories:
Biological process: What cellular or organismal activity does this gene contribute to? (e.g., "glucose metabolism," "apoptosis")
Molecular function: What molecular activity does the protein perform? (e.g., "enzyme activity," "DNA binding")
Cellular component: Where in the cell does this gene product work? (e.g., "mitochondria," "cell membrane")
This standardization allows researchers to search and compare genes systematically.
Pathway Databases: KEGG and Reactome
While GO describes individual gene functions, KEGG (Kyoto Encyclopedia of Genes and Genomes) and Reactome map how genes and their protein products work together in metabolic pathways and signaling cascades. Instead of viewing genes in isolation, these databases show you how enzymes and metabolites connect in chains of reactions—for example, the complete pathway of how your cells break down glucose for energy.
<extrainfo>
The image above shows the STRING database, which is another important annotation resource that specifically maps protein-protein interactions—showing which proteins physically bind to or functionally interact with each other.
</extrainfo>
Expression and Variation Databases: Capturing Differences
Expression Databases
While sequence databases show what genes exist, expression databases show when and where genes are active. Two major resources capture this quantitative information:
Gene Expression Omnibus (GEO) and Array Express both store measurements of RNA or protein levels across different tissues, developmental stages, disease states, or experimental conditions. These databases are essential for understanding how the same gene can produce different outcomes depending on cellular context.
For example, a gene might be highly active in brain tissue but silent in liver tissue. Expression databases reveal these patterns, helping researchers understand gene regulation and disease mechanisms.
Variation Databases
Humans don't all have identical DNA. Genetic variation databases catalog natural differences in DNA sequences found across populations.
The Database of Single Nucleotide Polymorphisms (dbSNP) catalogs the most common type of variation: single nucleotide polymorphisms (SNPs), where a single DNA letter differs between individuals. This database helps researchers understand genetic diversity and identify variants associated with disease risk.
The Clinical Variation database focuses specifically on genetic variants that have medical significance—mutations known to cause or increase risk for disease. This database directly supports clinical diagnosis and treatment decisions.
Why Biological Databases Matter: Integration and Applications
The true power of biological databases emerges when they work together. Modern bioinformatics pipelines integrate multiple databases to answer complex questions:
A researcher might search GenBank for a newly discovered gene sequence, then use that sequence to find homologous proteins in UniProt
They might retrieve the three-dimensional structure from PDB to understand molecular interactions
They could annotate the gene using Gene Ontology to identify its biological function
They might check KEGG or Reactome to see how it fits into known cellular pathways
Finally, they could examine expression databases to see where and when the gene is active
This integrated approach enables sophisticated applications like comparative genomics (comparing genes across species), drug discovery (finding proteins to target with therapeutic compounds), and personalized medicine (understanding how genetic variants affect individual patients).
Biological databases have become the foundational infrastructure of modern biological science. Rather than repeating experiments, researchers build on the collective knowledge stored in these standardized, accessible repositories.
Flashcards
What is the general definition of a biological database?
A structured collection of information about living organisms that can be stored, searched, and retrieved using computers.
On what basis are biological databases generally grouped or categorized?
The type of data they store.
What specific types of data are stored in sequence databases?
Raw nucleotide or amino acid strings.
What is the function of the Genetic Sequence Database (GenBank)?
It stores DNA and RNA sequences submitted by researchers.
What is the function of the Universal Protein Resource (UniProt)?
It stores protein amino acid sequences submitted by researchers.
What is provided to each submitted sequence in a database to allow it to be cited in future work?
A stable identifier.
What kind of data is contained within a structure database?
Three-dimensional models of macromolecules.
The Protein Data Bank (PDB) stores structures obtained by which three experimental methods?
X-ray crystallography
Nuclear magnetic resonance (NMR) spectroscopy
Cryogenic electron microscopy (cryo-EM)
What biological relationship can researchers explore using the Protein Data Bank?
How a macromolecule's shape relates to its function.
What is the purpose of annotation and functional databases in relation to raw sequences?
They add biological meaning to the raw sequences.
The Gene Ontology (GO) provides standardized terms for which three roles of a gene product?
Specific biological processes
Molecular functions
Cellular components
What is the primary role of the Kyoto Encyclopedia of Genes and Genomes (KEGG)?
Mapping enzymes and metabolites into metabolic or signaling pathways.
Besides KEGG, which other database maps enzymes and metabolites into metabolic and signaling pathways?
The Reactome database.
What data is captured by expression and variation databases?
Quantitative expression data and catalogs of genetic variants.
What is the purpose of the Database of Single Nucleotide Polymorphisms (dbSNP)?
To catalog genetic variants found across populations.
Which database is used to catalog genetic variants that have clinical significance?
The Clinical Variation (ClinVar) database.
Quiz
Introduction to Biological Databases Quiz Question 1: How are biological databases primarily categorized?
- By the type of data they store. (correct)
- By the geographic location of the hosting server.
- By the year they were created.
- By the alphabetical order of their names.
Introduction to Biological Databases Quiz Question 2: Which database stores DNA and RNA sequences submitted by researchers?
- GenBank (correct)
- UniProt
- Protein Data Bank
- Gene Expression Omnibus
Introduction to Biological Databases Quiz Question 3: Which resource contains experimentally determined three‑dimensional structures of proteins and nucleic acids?
- Protein Data Bank (correct)
- GenBank
- Gene Ontology
- Database of Single Nucleotide Polymorphisms
Introduction to Biological Databases Quiz Question 4: What underlies modern bioinformatics pipelines by linking multiple resources?
- Integration of many databases. (correct)
- Manual literature review of each gene.
- Use of only one central database.
- Reliance on physical specimen collections.
How are biological databases primarily categorized?
1 of 4
Key Concepts
Biological Databases
Biological database
GenBank
UniProt
Protein Data Bank (PDB)
Gene Ontology (GO)
Kyoto Encyclopedia of Genes and Genomes (KEGG)
Reactome
Gene Expression Omnibus (GEO)
Database of Single Nucleotide Polymorphisms (dbSNP)
Clinical Variation database
Definitions
Biological database
A structured, computer‑accessible collection of information about living organisms used for storage, search, and retrieval of biological data.
GenBank
The National Center for Biotechnology Information’s primary repository for publicly submitted DNA and RNA sequences.
UniProt
A comprehensive resource that provides curated protein sequence and functional information.
Protein Data Bank (PDB)
An archive of experimentally determined three‑dimensional structures of proteins and nucleic acids.
Gene Ontology (GO)
A standardized vocabulary describing gene product attributes across species, including biological processes, molecular functions, and cellular components.
Kyoto Encyclopedia of Genes and Genomes (KEGG)
A database that maps genes, enzymes, and metabolites onto metabolic and signaling pathways.
Reactome
An open‑access curated pathway database that details biochemical reactions and their roles in cellular processes.
Gene Expression Omnibus (GEO)
A public repository for high‑throughput gene expression and other functional genomics data sets.
Database of Single Nucleotide Polymorphisms (dbSNP)
A catalog of genetic variation, primarily single‑base changes, observed in human populations.
Clinical Variation database
A specialized repository that links genetic variants to clinical phenotypes and disease relevance.