File format Study Guide
Study Guide
📖 Core Concepts
File format – How information is encoded for storage; may define low‑level bits and high‑level organization (e.g., markup, tables).
Standardized vs. ad‑hoc – Formats can be open, proprietary, or informal conventions.
Specification – Published document that describes a format’s structure and validation rules; increases program support.
Filename extension – Suffix (e.g., .html, .gif) that OSes use to guess a file’s format.
Internal metadata – Data inside the file that identifies the format (file header, magic number) and may describe content (size, author).
External metadata – Information stored by the OS (POSIX extended attributes, MIME type like text/html).
File structure types –
Unstructured – Raw memory dump, no built‑in extensibility.
Chunk‑based – Data placed in labeled “chunks” with length or delimiters.
Directory‑based – Internal directory table pointing to data blocks (e.g., zip).
---
📌 Must Remember
Magic number = small datum at start of file; a reliable indicator of format.
File header = larger, possibly human‑readable block that can include magic number + other metadata.
Extensions are not unique – the same suffix may serve multiple formats.
Renaming a file does not convert its format; it only changes how programs interpret it.
Hiding extensions can mask malicious executables (e.g., photo.jpg.exe).
Chunk identifiers are often human‑readable tags; unknown chunks are safely skipped.
Directory‑based files can be exploited (zip bombs) – treat them with caution.
---
🔄 Key Processes
Identifying a file’s format
Check filename extension (quick but unreliable).
Read magic number at file start → confirm expected format.
If missing/incorrect, examine file header for readable tags or length fields.
Consult external metadata (MIME type, extended attributes) as a fallback.
Reverse‑engineering an undocumented format
Open the file in a hex/text editor.
Locate the magic number or recognizable chunk tags.
Map observed byte patterns to known structures (e.g., length fields).
Iterate by creating test files and observing program behavior.
Extending a chunk‑based format
Define a new chunk identifier (unique tag).
Include a length field so parsers can skip unknown chunks.
Update the file header if needed, but maintain backward compatibility.
---
🔍 Key Comparisons
Extension vs. Magic Number
Extension: easy, OS‑level, can be changed arbitrarily.
Magic Number: embedded, hard to fake, reliable for format verification.
Unstructured vs. Chunk‑Based vs. Directory‑Based
Unstructured: raw dump, no self‑describing structure, low portability.
Chunk‑Based: self‑describing pieces, easy to skip unknown data, moderate extensibility.
Directory‑Based: internal index, high extensibility, more complex parsing, potential security risks.
Internal vs. External Metadata
Internal: stored inside file (header, magic number); travels with file.
External: stored by OS (MIME type, extended attributes); can be lost when file moves across systems.
---
⚠️ Common Misunderstandings
“Changing the extension converts the file.”
It only changes the label; the underlying bytes stay the same.
“A correct magic number guarantees an uncorrupted file.”
It only indicates the file looks like the format; data can still be corrupted.
“All .txt files are plain ASCII.”
Text files can use any character encoding (UTF‑8, UTF‑16, etc.).
“If a format has a specification, it is open.”
Specs can be proprietary; “open” refers to licensing, not merely existence of a spec.
---
🧠 Mental Models / Intuition
“File as a book” – The cover (filename/extension) gives a first impression, but the title page (magic number) tells you the true identity, and the table of contents (header/metadata) guides you through the chapters (chunks or directories).
“Chunk = Lego brick” – Each chunk has a label (brick type) and size (how many studs); unknown bricks are simply ignored, keeping the structure intact.
---
🚩 Exceptions & Edge Cases
Some formats share extensions (e.g., .txt may be plain text or a script).
Binary headers that are not human‑readable require hex editors to inspect.
MIME types can be ambiguous (application/octet-stream is a generic fallback).
Zip‑like directory‑based files may contain nested directories that exceed OS path length limits.
---
📍 When to Use Which
Quick OS check → look at filename extension.
Programmatic validation → read the magic number (first few bytes).
Detailed inspection / debugging → parse the file header (human‑readable tags, length fields).
Cross‑platform file sharing → rely on standardized specifications and MIME types.
Designing a new format → prefer chunk‑based for easy forward compatibility; choose directory‑based when random access to many parts is needed.
---
👀 Patterns to Recognize
Magic number pattern – Fixed byte sequence at offset 0 (e.g., 0x89 0x50 0x4E 0x47 for PNG).
Chunk delimiter pattern – <Tag><Length><Data> repeated throughout file.
Header‑first‑metadata – Human‑readable strings like <?xml or GIF89a at the start.
Extension‑MIME mismatch – File shows .html but MIME type is application/pdf → likely mislabeling.
---
🗂️ Exam Traps
“The extension alone determines the format.” – Wrong; extensions are unreliable without internal checks.
Choosing a format based on popularity alone – May ignore necessary specification availability or security considerations.
Assuming all chunk‑based formats are safe – Some may embed malicious data in unknown chunks.
Confusing MIME type “type/subtype” with file extension – They are related but not interchangeable; MIME is OS/Internet level, extension is file‑system level.
Believing a missing magic number means the file is not that format – Some formats use only a header or rely on external metadata; absence isn’t conclusive.
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or