External Sorting in C# / .NET 8
Sort 1 GB of data with 1 MB of RAM. K-way external merge sort implementation using binary min-heap.
Handles datasets larger than available memory by splitting input into sorted chunks on disk and merging them with O(N log K) k-way merge. Generic IExternalSorter<T> — plug in any record type, comparer, and serializer.
Keywords: external sort, external merge sort, k-way merge, out-of-core sorting, large file sorting, limited memory sorting, disk-based sort, C#, .NET 8, binary min-heap
Architecture
ExternalSorting.Core/
├── Core/
│ ├── IExternalSorter<T> — main contract: Stream → sorted Stream
│ ├── ISerializer<T> — binary serialization for any record type
│ ├── SortOptions — memory, merge, parallelism, replacement-selection
│ └── SortMetrics — items, chunks, merge passes, timing
├── Pipeline/
│ └── ExternalSorter<T> — orchestrator with three chunk strategies:
│ • serial (1 thread, 1 recycled buffer)
│ • parallel (N workers, bounded queue)
│ • replacement selection (~2x larger runs)
├── Merge/
│ └── MinHeap<T> — O(log K) binary min-heap with ReplaceMin
└── IO/
├── ChunkWriter/Reader — buffered binary chunk I/O with headers
├── RecordSerializer — SortRecord (ulong + string) binary format
├── TextRecordIO — legacy "number. text" format parser
└── DataGenerator — random test data generation
Algorithm: K-Way External Merge Sort
External merge sort handles datasets that don’t fit in RAM by splitting the work into two phases: chunk creation (fits in memory) and multi-pass merging (disk-based).
Phase 1 — Chunk Creation
Input stream (N items, unsorted)
│
▼
┌─────────────────────────────┐
│ Read M items into memory │ M = MaxMemoryBytes / EstimatedItemSize
│ Sort in-memory (Array.Sort)│ O(M log M) per chunk
│ Write sorted chunk to disk │ Binary format with item count header
└─────────────────────────────┘
│ repeat until input exhausted
▼
Chunk₀ Chunk₁ Chunk₂ ... Chunk_{C-1} (C = ⌈N/M⌉ chunks)
Each chunk is a self-contained binary file: [int32: count][item₀][item₁]...[item_{M-1}].
Phase 2 — K-Way Merge
Merge K sorted chunks at a time using a binary min-heap of size K:
Pass 0: C chunks → ⌈C/K⌉ merged chunks
Pass 1: ⌈C/K⌉ chunks → ⌈C/K²⌉ merged chunks
...
Pass P: 1 final sorted output
Total passes: P = ⌈log_K(C)⌉
Each merge step:
Chunk A: [1, 5, 9, ...] ──┐
Chunk B: [2, 3, 8, ...] ──┤
Chunk C: [4, 6, 7, ...] ──┼──→ MinHeap (size K=3) ──→ Output: [1, 2, 3, 4, 5, 6, ...]
│
│ ExtractMin: O(log K)
│ Insert replacement from same chunk: O(log K)
│ Total: O(N log K) per pass
Why MinHeap? The old implementation used List.Sort() on every extraction — O(K log K) per item, O(NK log K) total. MinHeap gives O(N log K), which is orders of magnitude faster for large K (8-way, 16-way merge).
ReplaceMin fast path. When the source that just yielded the min still has more data, the merge loop overwrites the heap root in place via MinHeap.ReplaceMin (one SiftDown) instead of doing ExtractMin + Insert (SiftDown + SiftUp). Measured 30% speedup at K=8 and 34% at K=16 on the merge inner loop — see Benchmarks.
Phase 1 alternatives — Replacement Selection
The simple chunking above produces runs that are exactly M items long. Knuth’s Replacement Selection (TAOCP Vol. 3, §5.4.1) does better: by keeping the heap “live” across the entire input stream and routing items to a “next run” when they would break sorted order, it produces runs that average 2 × M for random input.
heap = M items from input, all tagged "run 0"
current_run = 0
while heap not empty:
(run, item) = heap.extractMin()
if run != current_run:
close current chunk file, open a new one for the new run
current_run = run
write item to current chunk
next = read one item from input
if next >= just-emitted item:
heap.insert((current_run, next)) # extends current run
else:
heap.insert((current_run + 1, next)) # frozen for next run
Result on a 50K random dataset with 32 KB heap: 74 chunks → 38 chunks (49% fewer), 3 merge passes → 2 (one fewer disk pass), 32% less memory allocated. Best case (already-sorted input) collapses the entire stream into a single run; worst case (reverse-sorted) degenerates to M-sized runs with no improvement and no regression.
Opt in via SortOptions.UseReplacementSelection = true. Inherently single-threaded so it ignores DegreeOfParallelism.
Phase 1 alternatives — Pipelined parallel chunking
When SortOptions.DegreeOfParallelism > 1, chunk creation runs as a producer/consumer pipeline:
Reader (1 thread) ──► [bounded queue, capacity = parallelism × 2] ──► Workers (N threads)
│ │
read input into buffer.Sort() +
per-chunk buffer ChunkWriter.Write
The bounded BlockingCollection caps in-flight buffers so memory growth is bounded by ~(parallelism + 1) × MaxMemoryBytes. A linked CancellationTokenSource propagates worker faults back to the reader so a disk-full error tears the pipeline down cleanly instead of deadlocking. Measured 1.12× speedup at the sweet spot (P=4 on a 4-physical-core box) for in-memory workloads — wider for disk-bound ones because writes overlap with sorting.
Concrete Example
Sort 10M records with 64 MB memory, 8-way merge:
Input: 10,000,000 records (158 MB on disk)
Phase 1 — Chunk Creation:
M = 64 MB / 48 bytes ≈ 1,300,000 items per chunk
C = ⌈10M / 1.3M⌉ = 8 chunks
Each chunk: ~20 MB, internally sorted
Chunk₀: [Apple:1, Apple:5, Banana:2, ...] (1.3M items, sorted)
Chunk₁: [Apple:3, Cherry:8, Date:1, ...] (1.3M items, sorted)
...
Chunk₇: [Mango:4, Zucchini:9, ...] (remaining items, sorted)
Phase 2 — 8-Way Merge:
Pass 0: merge all 8 chunks in one pass (K=8 ≥ C=8)
MinHeap seeded with first item from each chunk:
Heap: [(Apple:1, chunk0), (Apple:3, chunk1), ..., (Mango:4, chunk7)]
Loop 10M times:
1. ExtractMin → smallest item across all chunks O(log 8) = O(3)
2. Write to output
3. Read next item from same chunk, insert to heap O(log 8) = O(3)
Total comparisons: 10M × 2 × log₂(8) = 60M comparisons
Result: single sorted file, 10M items in order
Time: 9.8s (6.1s chunking + 3.3s merging)
Complexity
| Metric | Formula | 10M example |
|---|---|---|
| Chunk count | C = ⌈N/M⌉ | 8 |
| Merge passes | P = ⌈log_K(C)⌉ | 1 |
| Comparisons per pass | O(N log K) | ~60M |
| Total comparisons | O(N log K × P) | ~60M |
| Disk I/O passes | P + 1 (chunk + merge) | 2 |
| Total bytes read/written | O(N × (P + 1)) | ~316 MB × 2 |
Key insight: Increasing K reduces passes (fewer disk I/O rounds) but increases heap operations per item. K=8 to K=16 is the sweet spot for most workloads — one merge pass handles up to K^1 = 8-16 chunks, and two passes handle up to K^2 = 64-256 chunks (billions of records).
Installation
dotnet add package ExternalSorting.Core --version 1.0.3
Quick Start
# Build
dotnet build
# Run tests (51 tests)
dotnet test
# Sort 100K records (quick check)
dotnet run --project src/ExternalSorting.Console -- -n 100000 -m 8
# Sort 1M records (release mode, faster)
dotnet run --project src/ExternalSorting.Console -c Release -- -n 1000000 -m 16 -k 4
# Sort 10M records (stress test)
dotnet run --project src/ExternalSorting.Console -c Release -- -n 10000000 -m 64 -k 8
Usage
dotnet run --project src/ExternalSorting.Console -- [options]
Options:
-n, --count <N> Number of records to generate (default: 1M)
-m, --memory <MB> Memory budget in MB (default: 64)
-k, --merge-way <K> K-way merge factor (default: 8)
-i, --input <file> Use existing binary input file (skip generation)
-h, --help Show help
Examples
# Sort 1M records with 16MB memory, 8-way merge
dotnet run --project src/ExternalSorting.Console -c Release -- -n 1000000 -m 16 -k 8
# Sort 10M records with 64MB memory
dotnet run --project src/ExternalSorting.Console -c Release -- -n 10000000 -m 64
Programmatic API
var serializer = new RecordSerializer();
var comparer = Comparer<SortRecord>.Default;
var options = new SortOptions
{
MaxMemoryBytes = 64 * 1024 * 1024, // 64 MB
MergeWayCount = 8,
BufferSize = 64 * 1024, // FileStream buffer
// Phase 3.1 — pipelined parallel chunk creation. One reader thread
// feeds N sort+write workers via a bounded queue. Default = ProcessorCount.
DegreeOfParallelism = Environment.ProcessorCount,
// Phase 3.2 — Replacement Selection. Produces ~2x larger runs on
// random input → fewer chunks → fewer merge passes. Mutually
// exclusive with parallel chunking (single-heap algorithm).
UseReplacementSelection = false,
OnProgress = (phase, pct) => Console.Write($"\r{phase} {pct:F0}%"),
};
var sorter = new ExternalSorter<SortRecord>(serializer, comparer, options);
using var input = File.OpenRead("input.bin");
using var output = File.Create("output.bin");
sorter.Sort(input, output);
Console.WriteLine(sorter.LastMetrics); // Items: 1,000,000, Chunks: 3, ...
Picking a chunk strategy
| Workload | Recommended | Why |
|---|---|---|
| Default / unknown | parallel (default) | linear-ish speedup with cores, no algorithm risk |
| Memory-constrained, random input | UseReplacementSelection = true |
~50% fewer chunks → one fewer merge pass → less disk I/O |
| Mostly-sorted input | UseReplacementSelection = true |
best case collapses entire stream into a single run |
| Reverse-sorted input | parallel | RS degenerates to M-sized runs, parallel still wins |
| Single-core or strict memory cap | DegreeOfParallelism = 1 |
original serial path, single recycled buffer, lowest GC |
Custom record types
Implement ISerializer<T> for any type:
public record LogEntry(DateTime Timestamp, string Message);
public class LogSerializer : ISerializer<LogEntry>
{
public int EstimatedItemSize => 8 + 100;
public void Write(BinaryWriter w, LogEntry item) { ... }
public LogEntry Read(BinaryReader r) { ... }
}
Performance
| Records | Data Size | Memory | Merge | Chunks | Passes | Time | Verified |
|---|---|---|---|---|---|---|---|
| 100K | 1.6 MB | 8 MB | 4-way | 1 | 0 | 0.1s | OK |
| 1M | 16 MB | 16 MB | 8-way | 3 | 1 | 1.3s | OK |
| 10M | 158 MB | 64 MB | 8-way | 8 | 1 | 9.8s | OK |
| 60M | 948 MB | 1 MB | 8-way | 2,747 | 4 | 84s | OK |
The last row demonstrates the core interview problem: sort 1 GB of data with only 1 MB of RAM — a classic system design / algorithms challenge.
Tests
51 tests covering:
- MinHeap: insert, extract, duplicates, replace, 10K random
- Serializer: binary roundtrip, text parse/format, comparison logic
- Chunk I/O: write/read roundtrip, empty, dispose cleanup, 10K items
- ExternalSorter:
- basics: empty, single, sorted, reverse, duplicates, multi-chunk, multi-pass, 10K random, cancellation, temp cleanup, metrics
- Replacement Selection: empty, single, ascending → 1 chunk (best case), descending → M-sized runs (worst case), random input produces noticeably fewer chunks than simple chunking, correctness on 5K random, cancellation
- Parallel chunk creation: byte-identical output across
DegreeOfParallelism∈ {1,2,4,8} (Theory), 25-iteration determinism stress (max contention), chunk count invariant across parallelism, pre-cancelled CT unblocks pipeline cleanly, RS overrides parallelism when both options are set
- DataGenerator: binary/text generation, deterministic seed
dotnet test
Benchmarks
A separate tests/ExternalSorting.Benchmarks/ project measures the
perf-relevant inner loops with BenchmarkDotNet.
# Run all benchmark suites (~1-2 min total, ShortRun config)
dotnet run -c Release --project tests/ExternalSorting.Benchmarks -- --filter '*'
# Or pick one
dotnet run -c Release --project tests/ExternalSorting.Benchmarks -- --filter '*MergeBenchmarks*'
Three suites:
MergeBenchmarks — MinHeap.ReplaceMin vs ExtractMin + Insert
Isolates the merge inner loop from disk I/O. K pre-sorted in-memory sources, two methods running the same merge with different heap operation patterns.
Intel Core i3-10100T, 4 physical cores, .NET 8.0.25:
| Method | K | Mean | Ratio |
|---|---|---|---|
| Merge_ExtractMin_Insert | 8 | 46.92 ms | 1.00 |
| Merge_ReplaceMin | 8 | 32.88 ms | 0.70 |
| Merge_ExtractMin_Insert | 16 | 66.13 ms | 1.00 |
| Merge_ReplaceMin | 16 | 43.41 ms | 0.66 |
ReplaceMin is 30–34% faster on the inner merge loop. The win
scales slightly with K because deeper heaps = bigger SiftUp savings.
ChunkStrategyBenchmarks — Replacement Selection vs simple chunking
Same dataset (50K random records, 32 KB heap), only the chunk algorithm differs.
| Method | Mean | Allocated | Chunks | Merge passes |
|---|---|---|---|---|
| Sort_Simple_Chunking | 63.31 ms | 22.96 MB | 74 | 3 |
| Sort_Replacement_Selection | 63.82 ms | 15.63 MB | 38 | 2 |
RS halves the chunk count and saves a full merge pass on disk.
Wall-clock time looks identical because the heap operations during
chunking are more expensive than Array.Sort (compensating for the
merge phase saving on this in-memory benchmark) — but memory
allocation drops 32% and on real disk-bound workloads the one
fewer merge pass dominates.
SortBenchmarks — DegreeOfParallelism sweep
End-to-end sort of a 50K-record in-memory dataset with tiny per-chunk memory (chunk phase dominant) so the parallelism comparison is meaningful.
| Parallelism | Mean | Speedup |
|---|---|---|
| 1 (serial) | 61.6 ms | 1.00× |
| 2 | 57.3 ms | 1.08× |
| 4 ⭐ | 55.0 ms | 1.12× |
| 8 | 56.7 ms | 1.09× |
P=4 is the sweet spot because the test box has 4 physical cores; hyperthreading (P=8) adds context-switch overhead that cancels the marginal SMT win for this CPU+memory-bound workload. On real disk I/O the gap widens because parallel writes overlap with subsequent buffer sorting.
Project Structure
external-sorting/
├── ExternalSorting.sln
├── src/
│ ├── ExternalSorting.Core/ — library (algorithm + I/O)
│ └── ExternalSorting.Console/ — CLI application
└── tests/
├── ExternalSorting.Tests/ — xUnit + FluentAssertions (51 tests)
└── ExternalSorting.Benchmarks/ — BenchmarkDotNet perf suites
Key Design Decisions
- Generic
T: Sort any type, not just strings — plug in your ownISerializer<T>andIComparer<T> - MinHeap merge: O(N log K) vs old code’s O(NK log K) — orders of magnitude faster for large K
ReplaceMinfast path: merge inner loop overwrites the heap root in place when the source still has data, saving the SiftUp half of anExtractMin + Insertpair (30–34% measured speedup, see Benchmarks)- Three chunk strategies: serial (lowest GC), parallel pipeline (default, ~1.12× speedup), Replacement Selection (~50% fewer chunks for random input). Dispatcher picks one in
SortOptions. - Binary format: 3-5x faster I/O than text parsing
- Memory-adaptive chunking: Chunk size computed from
MaxMemoryBytes / EstimatedItemSize - Automatic cleanup: Temp directory deleted in
finallyblock,ChunkFileimplementsIDisposable - CancellationToken: Cooperative cancellation at chunk and merge boundaries; parallel pipeline uses a linked CTS so a worker fault unblocks the reader instead of deadlocking
- Progress reporting: Callback with phase + percentage for UI integration
Requirements
- .NET 8.0 SDK
# Ubuntu/Debian
sudo apt-get install -y dotnet-sdk-8.0
# macOS
brew install dotnet-sdk
# Verify
dotnet --version
References
- External sorting — Wikipedia
- K-way merge algorithm — Wikipedia
- Binary heap — Wikipedia
- Knuth, TAOCP Vol. 3 — Sorting and Searching, Ch. 5.4: External Sorting
- Sort 1GB with 1MB RAM — classic system design problem
- .NET 8 documentation
Related Topics
If you’re studying this problem, you might also be interested in:
- MapReduce — distributed external sorting at scale
- B-tree — disk-optimized data structure using similar I/O principles
- Replacement selection sort — alternative initial run generation (produces longer runs than naive chunking) — implemented here, opt-in via
SortOptions.UseReplacementSelection - Polyphase merge sort — optimized tape merge schedule
- Tournament sort — alternative to binary heap for k-way merge
License
MIT