# In-Place Unstable Sorting: A Fast Sequence Calculation To Improve Shell Sort Performance Without Auxiliary Memory or Hard-Coded Increments

Eightomic developed a bitwise calculation for an optimal sequence of unstable Shell Sort gaps with a library in C99 as a substantial improvement to Ciura, Knuth, Sedgwick and Tokuda sequences.

## Library

### Source

```
void eightomic_sort_ascending(unsigned long input_count,
unsigned short *input) {
unsigned short _input;
unsigned long gap = (input_count >> 5) + (input_count >> 3) + 1;
unsigned long i;
unsigned long j;
while (gap > 0) {
i = gap;
while (i < input_count) {
_input = input[i];
j = i;
while (
j >= gap &&
input[j - gap] > _input
) {
input[j] = input[j - gap];
j -= gap;
}
input[j] = _input;
i++;
}
if (
gap > 3 ||
gap == 1
) {
gap = (gap >> 5) + (gap >> 2);
} else {
gap = 1;
}
}
}
void eightomic_sort_descending(unsigned long input_count,
unsigned short *input) {
unsigned short _input;
unsigned long gap = (input_count >> 5) + (input_count >> 3) + 1;
unsigned long i;
unsigned long j;
while (gap > 0) {
i = gap;
while (i < input_count) {
_input = input[i];
j = i;
while (
j >= gap &&
input[j - gap] < _input
) {
input[j] = input[j - gap];
j -= gap;
}
input[j] = _input;
i++;
}
if (
gap > 3 ||
gap == 1
) {
gap = (gap >> 5) + (gap >> 2);
} else {
gap = 1;
}
}
}
```

### Reference

eightomic_sort_ascending() is the sorting function in ascending order that accepts the following 2 arguments.

input_count is the count of elements in the input array.

input is the unsigned short array to sort. The data type is interchangeable with any integral data type.

The return value data type is void.

eightomic_sort_descending() is the sorting function in descending order that accepts the following 2 arguments.

input_count is the count of elements in the input array.

input is the unsigned short array to sort. The data type is interchangeable with any integral data type.

The return value data type is void.

### Requirements

C compiler with C99 (ISO/IEC 9899:1999) standard compatibility.

## Explanation

This gap sequence algorithm is designed to both increase the speed and decrease resource usage for all Shell Sort implementations.

It's portable for both 32-bit and 64-bit systems.

It meets compliance, portability and security requirements on all devices without extra stack calls required from QuickSort partitions.

It doesn't use modulus, multiplication or division arithmetic operations.

Before sorting, it doesn't pre-calculate the upper limit with a loop starting from 0. Instead, the gap numbers in each sorting instance are dynamically-calculated with inconsistencies among different input_count values.

The sequence calculation formula derived from the following attempt to optimize Cocktail Shaker Sort with unstable increments, which is about 55% as fast as Shell Sort.

```
#include <stdbool.h>
void optimized_cocktail_shaker_sort_ascending(unsigned long input_count,
unsigned short *input) {
unsigned short _input;
unsigned long gap;
unsigned long i;
unsigned long j;
bool is_sorted;
if (input_count > 1) {
i = 0;
j = (input_count >> 12) + (input_count >> 7) + 1;
is_sorted = false;
while (is_sorted == false) {
gap = (input_count >> 15) + (input_count >> 14) + 1;
is_sorted = true;
while (i < j) {
while (j != input_count) {
if (input[i] > input[j]) {
_input = input[i];
input[i] = input[j];
input[j] = _input;
}
i++;
j++;
}
if (((input_count - i) >> 12) != 0) {
gap++;
}
i += gap;
if (i >= j) {
i = j - 1;
}
while (i != 0) {
i--;
j--;
if (input[i] > input[j]) {
_input = input[i];
input[i] = input[j];
input[j] = _input;
is_sorted = false;
}
}
j--;
}
if (is_sorted == false) {
i = 0;
if (input_count > 128) {
j = (input_count >> 15) + (input_count >> 14) + 8;
} else {
j = 1;
}
}
}
}
}
```

This sorting function calculates an optimal high and low starting index based on input_count and swaps elements while iterating with both indices.

When a bound is reached, the gap between indices is decreased based on input_count, then gap is incremented and the iteration direction is reversed.

When gap is decreased to 1, it sorts as a stable Cocktail Shaker Sort and checks if the elements are sorted.

If they're not sorted, the gap value is reset.

This process repeats until all elements are sorted.

After gap reaches 1 for the first time, the starting high index is decreased to reduce iterations in subsequent passes.

Examples with calculated starting indices and gap values for different array sizes are demonstrated in the following table.

```
Input First Pass Starting Subsequent Pass
Count High Index Gap High Index
2 1 1 8
4 1 1 8
8 1 1 8
16 1 1 8
32 1 1 8
64 1 1 8
128 2 1 8
256 3 1 8
512 5 1 8
1024 9 1 8
2048 17 1 8
4096 34 1 8
8192 67 1 8
16384 133 2 9
32768 265 4 11
65536 529 7 14
131072 1057 13 20
262144 2113 25 32
524288 4225 49 56
1048576 8449 97 104
```

Although this failed to exceed the performance of any Shell Sort variant, the optimal starting gap in Shell Sort was discovered as a derivative of the aforementioned Cocktail Shaker Sort high and low double-shift optimization.

When using the discovered sequence calculation algorithm, elements are always guaranteed to be sorted within bounds using conditional statements that guarantee the final pass is always a regular insertion sort with a gap value of 1.

After each pass, gap > 3 prevents any result of gap calculation from jumping from either 3 or 2 to 0 instead of 1 based on the following calculation output table.

```
Gap Calculation
Result
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 2
9 2
10 2
```

The following speed test results were performed with 1 million randomized elements.

Compared to the following optimized Shell Sort sequence, it's 31% faster.

```
void shell_sort_ascending(unsigned long input_count, unsigned short *input) {
unsigned short _input;
unsigned long gap = input_count >> 1;
unsigned long i;
unsigned long j;
while (gap > 0) {
i = gap;
while (i < input_count) {
_input = input[i];
j = i;
while (
j >= gap &&
input[j - gap] > _input
) {
input[j] = input[j - gap];
j -= gap;
}
input[j] = _input;
i++;
}
gap >>= 1;
}
}
```

Compared to the following optimized Knuth's Sequence, it's 24% faster without requiring either division operations or pre-calculated gap increments.

Furthermore, it's still faster when the gap increments are hard-coded.

```
void knuth_sort_ascending(unsigned long input_count, unsigned short *input) {
unsigned short _input;
unsigned long gap = 1;
unsigned long i;
unsigned long j;
while (gap < input_count) {
gap += (gap << 1) + 1;
}
while (gap > 0) {
i = gap;
while (i < input_count) {
_input = input[i];
j = i;
while (
j >= gap &&
input[j - gap] > _input
) {
input[j] = input[j - gap];
j -= gap;
}
input[j] = _input;
i++;
}
gap = gap / 3;
}
}
```

Compared to the following optimized Ciura's Sequence, Sedgewick Sequence and Tokuda Sequence, it's marginally-faster without a hard-coded upper limit of elements in auxiliary memory.

```
void ciura_sort_ascending(unsigned long input_count, unsigned short *input) {
unsigned short _input;
unsigned long gaps[16] = {
227011, 100894, 44842, 19930, 8858, 3937, 1750, 701,
701, 301, 132, 57, 23, 10, 4, 1
};
unsigned long gap;
unsigned char i = 0;
unsigned long j;
unsigned long k;
while (i != 16) {
gap = gaps[i];
j = gap;
while (j < input_count) {
_input = input[j];
k = j;
while (
k >= gap &&
input[k - gap] > _input
) {
input[k] = input[k - gap];
k -= gap;
}
input[k] = _input;
j++;
}
i++;
}
}
void sedgewick_sort_ascending(unsigned long input_count,
unsigned short *input) {
unsigned short _input;
unsigned long gaps[16] = {
260609, 146305, 64769, 36289, 16001, 8929, 3905, 2161,
929, 505, 209, 109, 41, 19, 5, 1
};
unsigned long gap;
unsigned char i = 0;
unsigned long j;
unsigned long k;
while (i != 16) {
gap = gaps[i];
j = gap;
while (j < input_count) {
_input = input[j];
k = j;
while (
k >= gap &&
input[k - gap] > _input
) {
input[k] = input[k - gap];
k -= gap;
}
input[k] = _input;
j++;
}
i++;
}
}
void tokuda_sort_ascending(unsigned long input_count, unsigned short *input) {
unsigned short _input;
unsigned long gaps[16] = {
345152, 153401, 68178, 30301, 13467, 5985, 2660, 1182,
525, 233, 103, 46, 20, 9, 4, 1
};
unsigned long gap;
unsigned char i = 0;
unsigned long j;
unsigned long k;
while (i != 16) {
gap = gaps[i];
j = gap;
while (j < input_count) {
_input = input[j];
k = j;
while (
k >= gap &&
input[k - gap] > _input
) {
input[k] = input[k - gap];
k -= gap;
}
input[k] = _input;
j++;
}
i++;
}
}
```

The speed is faster with a variety of input_count values and input data types as shown in the following benchmarking table.

```
10k 2-Byte Inputs
Ciura 0.005s Tied
Eightomic 0.005s Tied
Knuth 0.005s Tied
SedgeWick 0.005s Tied
Tokuda 0.005s Tied
Shell 0.006s ----
100k 2-Byte Inputs
Ciura 0.040s ---
Eightomic 0.037s Won
Knuth 0.040s ---
SedgeWick 0.039s ---
Tokuda 0.040s ---
Shell 0.047s ---
250k 2-Byte Inputs
Ciura 0.103s ---
Eightomic 0.091s Won
Knuth 0.110s ---
SedgeWick 0.101s ---
Tokuda 0.102s ---
Shell 0.126s ---
500k 2-Byte Inputs
Ciura 0.211s ---
Eightomic 0.191s Won
Knuth 0.243s ---
SedgeWick 0.212s ---
Tokuda 0.209s ---
Shell 0.291s ---
1m 1-Byte Inputs
Ciura 0.302s ---
Eightomic 0.266s Won
Knuth 0.349s ---
SedgeWick 0.291s ---
Tokuda 0.304s ---
Shell 0.380s ---
1m 2-Byte Inputs
Ciura 0.461s ---
Eightomic 0.423s Won
Knuth 0.570s ---
SedgeWick 0.453s ---
Tokuda 0.451s ---
Shell 0.634s ---
```

It's faster than every other sequence as well, including Hibbard, Papernov & Stasevich and Pratt.