Skip to the content.

External Sorting in C# / .NET 8

NuGet

Sort 1 GB of data with 1 MB of RAM. K-way external merge sort implementation using binary min-heap.

Source code on GitHub

Handles datasets larger than available memory by splitting input into sorted chunks on disk and merging them with O(N log K) k-way merge. Generic IExternalSorter<T> — plug in any record type, comparer, and serializer.

Keywords: external sort, external merge sort, k-way merge, out-of-core sorting, large file sorting, limited memory sorting, disk-based sort, C#, .NET 8, binary min-heap

Architecture

ExternalSorting.Core/
├── Core/
│   ├── IExternalSorter<T>    — main contract: Stream → sorted Stream
│   ├── ISerializer<T>        — binary serialization for any record type
│   ├── SortOptions            — memory, merge, parallelism, replacement-selection
│   └── SortMetrics            — items, chunks, merge passes, timing
├── Pipeline/
│   └── ExternalSorter<T>     — orchestrator with three chunk strategies:
│                                  • serial (1 thread, 1 recycled buffer)
│                                  • parallel (N workers, bounded queue)
│                                  • replacement selection (~2x larger runs)
├── Merge/
│   └── MinHeap<T>            — O(log K) binary min-heap with ReplaceMin
└── IO/
    ├── ChunkWriter/Reader     — buffered binary chunk I/O with headers
    ├── RecordSerializer       — SortRecord (ulong + string) binary format
    ├── TextRecordIO           — legacy "number. text" format parser
    └── DataGenerator          — random test data generation

Algorithm: K-Way External Merge Sort

External merge sort handles datasets that don’t fit in RAM by splitting the work into two phases: chunk creation (fits in memory) and multi-pass merging (disk-based).

Phase 1 — Chunk Creation

Input stream (N items, unsorted)
        │
        ▼
┌─────────────────────────────┐
│  Read M items into memory   │  M = MaxMemoryBytes / EstimatedItemSize
│  Sort in-memory (Array.Sort)│  O(M log M) per chunk
│  Write sorted chunk to disk │  Binary format with item count header
└─────────────────────────────┘
        │ repeat until input exhausted
        ▼
Chunk₀  Chunk₁  Chunk₂  ...  Chunk_{C-1}     (C = ⌈N/M⌉ chunks)

Each chunk is a self-contained binary file: [int32: count][item₀][item₁]...[item_{M-1}].

Phase 2 — K-Way Merge

Merge K sorted chunks at a time using a binary min-heap of size K:

Pass 0: C chunks → ⌈C/K⌉ merged chunks
Pass 1: ⌈C/K⌉ chunks → ⌈C/K²⌉ merged chunks
...
Pass P: 1 final sorted output

Total passes: P = ⌈log_K(C)⌉

Each merge step:

Chunk A:  [1, 5, 9, ...]     ──┐
Chunk B:  [2, 3, 8, ...]     ──┤
Chunk C:  [4, 6, 7, ...]     ──┼──→  MinHeap (size K=3)  ──→  Output: [1, 2, 3, 4, 5, 6, ...]
                                │
                                │     ExtractMin: O(log K)
                                │     Insert replacement from same chunk: O(log K)
                                │     Total: O(N log K) per pass

Why MinHeap? The old implementation used List.Sort() on every extraction — O(K log K) per item, O(NK log K) total. MinHeap gives O(N log K), which is orders of magnitude faster for large K (8-way, 16-way merge).

ReplaceMin fast path. When the source that just yielded the min still has more data, the merge loop overwrites the heap root in place via MinHeap.ReplaceMin (one SiftDown) instead of doing ExtractMin + Insert (SiftDown + SiftUp). Measured 30% speedup at K=8 and 34% at K=16 on the merge inner loop — see Benchmarks.

Phase 1 alternatives — Replacement Selection

The simple chunking above produces runs that are exactly M items long. Knuth’s Replacement Selection (TAOCP Vol. 3, §5.4.1) does better: by keeping the heap “live” across the entire input stream and routing items to a “next run” when they would break sorted order, it produces runs that average 2 × M for random input.

heap = M items from input, all tagged "run 0"
current_run = 0
while heap not empty:
    (run, item) = heap.extractMin()
    if run != current_run:
        close current chunk file, open a new one for the new run
        current_run = run
    write item to current chunk
    next = read one item from input
    if next >= just-emitted item:
        heap.insert((current_run, next))      # extends current run
    else:
        heap.insert((current_run + 1, next))  # frozen for next run

Result on a 50K random dataset with 32 KB heap: 74 chunks → 38 chunks (49% fewer), 3 merge passes → 2 (one fewer disk pass), 32% less memory allocated. Best case (already-sorted input) collapses the entire stream into a single run; worst case (reverse-sorted) degenerates to M-sized runs with no improvement and no regression.

Opt in via SortOptions.UseReplacementSelection = true. Inherently single-threaded so it ignores DegreeOfParallelism.

Phase 1 alternatives — Pipelined parallel chunking

When SortOptions.DegreeOfParallelism > 1, chunk creation runs as a producer/consumer pipeline:

Reader (1 thread) ──► [bounded queue, capacity = parallelism × 2] ──► Workers (N threads)
       │                                                                    │
   read input into                                                      buffer.Sort() +
   per-chunk buffer                                                     ChunkWriter.Write

The bounded BlockingCollection caps in-flight buffers so memory growth is bounded by ~(parallelism + 1) × MaxMemoryBytes. A linked CancellationTokenSource propagates worker faults back to the reader so a disk-full error tears the pipeline down cleanly instead of deadlocking. Measured 1.12× speedup at the sweet spot (P=4 on a 4-physical-core box) for in-memory workloads — wider for disk-bound ones because writes overlap with sorting.

Concrete Example

Sort 10M records with 64 MB memory, 8-way merge:

Input: 10,000,000 records (158 MB on disk)

Phase 1 — Chunk Creation:
  M = 64 MB / 48 bytes ≈ 1,300,000 items per chunk
  C = ⌈10M / 1.3M⌉ = 8 chunks
  Each chunk: ~20 MB, internally sorted

  Chunk₀: [Apple:1, Apple:5, Banana:2, ...]     (1.3M items, sorted)
  Chunk₁: [Apple:3, Cherry:8, Date:1, ...]      (1.3M items, sorted)
  ...
  Chunk₇: [Mango:4, Zucchini:9, ...]            (remaining items, sorted)

Phase 2 — 8-Way Merge:
  Pass 0: merge all 8 chunks in one pass (K=8 ≥ C=8)
  
  MinHeap seeded with first item from each chunk:
  Heap: [(Apple:1, chunk0), (Apple:3, chunk1), ..., (Mango:4, chunk7)]
  
  Loop 10M times:
    1. ExtractMin → smallest item across all chunks    O(log 8) = O(3)
    2. Write to output
    3. Read next item from same chunk, insert to heap  O(log 8) = O(3)
  
  Total comparisons: 10M × 2 × log₂(8) = 60M comparisons

Result: single sorted file, 10M items in order
Time: 9.8s (6.1s chunking + 3.3s merging)

Complexity

Metric Formula 10M example
Chunk count C = ⌈N/M⌉ 8
Merge passes P = ⌈log_K(C)⌉ 1
Comparisons per pass O(N log K) ~60M
Total comparisons O(N log K × P) ~60M
Disk I/O passes P + 1 (chunk + merge) 2
Total bytes read/written O(N × (P + 1)) ~316 MB × 2

Key insight: Increasing K reduces passes (fewer disk I/O rounds) but increases heap operations per item. K=8 to K=16 is the sweet spot for most workloads — one merge pass handles up to K^1 = 8-16 chunks, and two passes handle up to K^2 = 64-256 chunks (billions of records).

Installation

dotnet add package ExternalSorting.Core --version 1.0.3

Quick Start

# Build
dotnet build

# Run tests (51 tests)
dotnet test

# Sort 100K records (quick check)
dotnet run --project src/ExternalSorting.Console -- -n 100000 -m 8

# Sort 1M records (release mode, faster)
dotnet run --project src/ExternalSorting.Console -c Release -- -n 1000000 -m 16 -k 4

# Sort 10M records (stress test)
dotnet run --project src/ExternalSorting.Console -c Release -- -n 10000000 -m 64 -k 8

Usage

dotnet run --project src/ExternalSorting.Console -- [options]

Options:
  -n, --count <N>      Number of records to generate (default: 1M)
  -m, --memory <MB>    Memory budget in MB (default: 64)
  -k, --merge-way <K>  K-way merge factor (default: 8)
  -i, --input <file>   Use existing binary input file (skip generation)
  -h, --help           Show help

Examples

# Sort 1M records with 16MB memory, 8-way merge
dotnet run --project src/ExternalSorting.Console -c Release -- -n 1000000 -m 16 -k 8

# Sort 10M records with 64MB memory
dotnet run --project src/ExternalSorting.Console -c Release -- -n 10000000 -m 64

Programmatic API

var serializer = new RecordSerializer();
var comparer = Comparer<SortRecord>.Default;
var options = new SortOptions
{
    MaxMemoryBytes = 64 * 1024 * 1024,  // 64 MB
    MergeWayCount = 8,
    BufferSize = 64 * 1024,             // FileStream buffer

    // Phase 3.1 — pipelined parallel chunk creation. One reader thread
    // feeds N sort+write workers via a bounded queue. Default = ProcessorCount.
    DegreeOfParallelism = Environment.ProcessorCount,

    // Phase 3.2 — Replacement Selection. Produces ~2x larger runs on
    // random input → fewer chunks → fewer merge passes. Mutually
    // exclusive with parallel chunking (single-heap algorithm).
    UseReplacementSelection = false,

    OnProgress = (phase, pct) => Console.Write($"\r{phase} {pct:F0}%"),
};

var sorter = new ExternalSorter<SortRecord>(serializer, comparer, options);

using var input = File.OpenRead("input.bin");
using var output = File.Create("output.bin");
sorter.Sort(input, output);

Console.WriteLine(sorter.LastMetrics); // Items: 1,000,000, Chunks: 3, ...

Picking a chunk strategy

Workload Recommended Why
Default / unknown parallel (default) linear-ish speedup with cores, no algorithm risk
Memory-constrained, random input UseReplacementSelection = true ~50% fewer chunks → one fewer merge pass → less disk I/O
Mostly-sorted input UseReplacementSelection = true best case collapses entire stream into a single run
Reverse-sorted input parallel RS degenerates to M-sized runs, parallel still wins
Single-core or strict memory cap DegreeOfParallelism = 1 original serial path, single recycled buffer, lowest GC

Custom record types

Implement ISerializer<T> for any type:

public record LogEntry(DateTime Timestamp, string Message);

public class LogSerializer : ISerializer<LogEntry>
{
    public int EstimatedItemSize => 8 + 100;
    public void Write(BinaryWriter w, LogEntry item) { ... }
    public LogEntry Read(BinaryReader r) { ... }
}

Performance

Records Data Size Memory Merge Chunks Passes Time Verified
100K 1.6 MB 8 MB 4-way 1 0 0.1s OK
1M 16 MB 16 MB 8-way 3 1 1.3s OK
10M 158 MB 64 MB 8-way 8 1 9.8s OK
60M 948 MB 1 MB 8-way 2,747 4 84s OK

The last row demonstrates the core interview problem: sort 1 GB of data with only 1 MB of RAM — a classic system design / algorithms challenge.

Tests

51 tests covering:

dotnet test

Benchmarks

A separate tests/ExternalSorting.Benchmarks/ project measures the perf-relevant inner loops with BenchmarkDotNet.

# Run all benchmark suites (~1-2 min total, ShortRun config)
dotnet run -c Release --project tests/ExternalSorting.Benchmarks -- --filter '*'

# Or pick one
dotnet run -c Release --project tests/ExternalSorting.Benchmarks -- --filter '*MergeBenchmarks*'

Three suites:

MergeBenchmarks — MinHeap.ReplaceMin vs ExtractMin + Insert

Isolates the merge inner loop from disk I/O. K pre-sorted in-memory sources, two methods running the same merge with different heap operation patterns.

Intel Core i3-10100T, 4 physical cores, .NET 8.0.25:

Method K Mean Ratio
Merge_ExtractMin_Insert 8 46.92 ms 1.00
Merge_ReplaceMin 8 32.88 ms 0.70
Merge_ExtractMin_Insert 16 66.13 ms 1.00
Merge_ReplaceMin 16 43.41 ms 0.66

ReplaceMin is 30–34% faster on the inner merge loop. The win scales slightly with K because deeper heaps = bigger SiftUp savings.

ChunkStrategyBenchmarks — Replacement Selection vs simple chunking

Same dataset (50K random records, 32 KB heap), only the chunk algorithm differs.

Method Mean Allocated Chunks Merge passes
Sort_Simple_Chunking 63.31 ms 22.96 MB 74 3
Sort_Replacement_Selection 63.82 ms 15.63 MB 38 2

RS halves the chunk count and saves a full merge pass on disk. Wall-clock time looks identical because the heap operations during chunking are more expensive than Array.Sort (compensating for the merge phase saving on this in-memory benchmark) — but memory allocation drops 32% and on real disk-bound workloads the one fewer merge pass dominates.

SortBenchmarks — DegreeOfParallelism sweep

End-to-end sort of a 50K-record in-memory dataset with tiny per-chunk memory (chunk phase dominant) so the parallelism comparison is meaningful.

Parallelism Mean Speedup
1 (serial) 61.6 ms 1.00×
2 57.3 ms 1.08×
4 ⭐ 55.0 ms 1.12×
8 56.7 ms 1.09×

P=4 is the sweet spot because the test box has 4 physical cores; hyperthreading (P=8) adds context-switch overhead that cancels the marginal SMT win for this CPU+memory-bound workload. On real disk I/O the gap widens because parallel writes overlap with subsequent buffer sorting.

Project Structure

external-sorting/
├── ExternalSorting.sln
├── src/
│   ├── ExternalSorting.Core/         — library (algorithm + I/O)
│   └── ExternalSorting.Console/      — CLI application
└── tests/
    ├── ExternalSorting.Tests/        — xUnit + FluentAssertions (51 tests)
    └── ExternalSorting.Benchmarks/   — BenchmarkDotNet perf suites

Key Design Decisions

Requirements

# Ubuntu/Debian
sudo apt-get install -y dotnet-sdk-8.0

# macOS
brew install dotnet-sdk

# Verify
dotnet --version

References

If you’re studying this problem, you might also be interested in:

License

MIT