External Sorting

CodesCoddler
4 min readMar 6, 2024

External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. These sorting algorithms are particularly useful when the data being sorted do not fit into a computer’s main memory, and instead reside in the slower external memory (usually a hard drive).

External sorting algorithms generally follow a sort-and-merge paradigm. This involves dividing data into chunks that can fit into memory, sorting each chunk independently, and then merging the sorted chunks back together.

Sorting a large file that can’t fit into memory is a common problem that can be tackled using an algorithm called External Sorting. Here’s a simplified version of how you can do it:

  1. Divide and Conquer: Split the large file into smaller chunks that can fit into memory. Read each chunk into memory, sort it (using any efficient in-memory sorting algorithm like quicksort, mergesort, heapsort, etc.) and write it back to disk. Now you have a set of sorted smaller files.
  2. Multi-way Merge: Perform a k-way merge operation on these sorted chunks. This involves opening all the files simultaneously, and merging them together in a sorted manner.

Here is a more detailed step-by-step process:

  1. Read and Sort Chunks: Read a chunk of data that fits into memory from the large file. Sort this chunk in memory using any efficient sorting algorithm and write the sorted chunk to a temporary file. Repeat this process until all data in the large file has been read, sorted, and written to temporary files. You should now have a list of temporary files, each containing a sorted chunk of data.
  2. Merge Sorted Chunks: This is the core of the external sorting algorithm. The goal is to merge the sorted chunks into a single sorted file. Since all chunks are sorted, you can do this by repeatedly picking the smallest element from the chunks. However, you can’t load all chunks into memory at once, so you need to use a priority queue.
  • Load the first few elements from each chunk into a priority queue. The priority queue should be a min-heap that keeps track of the smallest element and its origin chunk.
  • Repeatedly extract the smallest element from the priority queue and write it to the final sorted file. Whenever an element is extracted, load the next element from the same chunk into the priority queue. If a chunk is fully consumed, remove it from consideration.
  • Continue this process until all elements have been written to the final sorted file. The priority queue ensures that you’re always picking the smallest element from the chunks.

3. Cleanup: Delete the temporary files. You should now have a single file that contains all the data from the original file, sorted.

A sample code

import java.io.*;
import java.util.*;

public class ExternalSort {

static int N = 20; // size of the file in disk
static String fileName = "test.txt";
static int M = 5; // max items the memory buffer can hold

public static void externalSort() {
String tfile = "temp-file-";
int[] buffer = new int[M < N ? M : N];

try {
FileReader fr = new FileReader(fileName);
BufferedReader br = new BufferedReader(fr);
int slices = (int) Math.ceil((double) N / M);

int i, j;
i = j = 0;
// Iterate through the elements in the file
for (i = 0; i < slices; i++) {
// Read M-element chunk at a time from the file
for (j = 0; j < (M < N ? M : N); j++) {
String t = br.readLine();
if (t != null)
buffer[j] = Integer.parseInt(t);
else
break;
}
// Sort M elements
Arrays.sort(buffer);

// Write the sorted numbers to temp file
FileWriter fw = new FileWriter(tfile + Integer.toString(i) + ".txt");
PrintWriter pw = new PrintWriter(fw);
for (int k = 0; k < j; k++)
pw.println(buffer[k]);

pw.close();
fw.close();
}

br.close();
fr.close();

// Now open each file and merge them, then write back to disk
int[] topNums = new int[slices];
BufferedReader[] brs = new BufferedReader[slices];

for (i = 0; i < slices; i++) {
brs[i] = new BufferedReader(new FileReader(tfile + Integer.toString(i) + ".txt"));
String t = brs[i].readLine();
if (t != null)
topNums[i] = Integer.parseInt(t);
else
topNums[i] = Integer.MAX_VALUE;
}

FileWriter fw = new FileWriter("output.txt");
PrintWriter pw = new PrintWriter(fw);

for (i = 0; i < N; i++) {
int min = topNums[0];
int minFile = 0;

for (j = 0; j < slices; j++) {
if (min > topNums[j]) {
min = topNums[j];
minFile = j;
}
}

pw.println(min);
String t = brs[minFile].readLine();
if (t != null)
topNums[minFile] = Integer.parseInt(t);
else
topNums[minFile] = Integer.MAX_VALUE;

}
for (i = 0; i < slices; i++)
brs[i].close();

pw.close();
fw.close();
} catch (Exception e) {
e.printStackTrace();
}
}

public static void main(String[] args) {
externalSort();
}
}

--

--