template<class BoostedCpp>

A personal blog about C++ programming and Boost libraries

Neoclassical C++: segmented iterators revisited (1)

Introduction to segmented iterators

The legendary STL library has been an inspiration in the development of the C++ language and libraries. The understanding of the concepts and code developed by Stepanov, Lee, Musser, Austern, etc. is a treasure trove for any C++ programmer.

Stepanov’s core claim was that abstraction should have near-zero overhead. He argued that generic programming, done correctly, should not impose a noticeable runtime cost (abstraction penalty) compared to hand-written specialized code. This was not just a hope but a design principle of the STL.

Following this philosophy, in 2000 Matt Austern published the paper Segmented Iterators and Hierarchical Algorithms. The paper’s core argument was that many data structures are naturally segmented (e.g. deque), and generic algorithms that ignore that feature — treating every data structure as a uniform range of elements — are unnecessarily inefficient. A new kind of iterator abstraction, in which segmentation is explicit, could make possible to write hierarchical algorithms that exploit segmentation and reduce the abstraction penalty associated with standard iterators.

The classic motivating example is std::deque, which is internally a sequence of fixed-size arrays. Every ++it on a deque iterator may cross a block boundary, so inplicitly std::find or std::fill has to evaluate, on every step, whether the current pointer has reached the end of the current block, an indirection that makes deque iteration measurably slower and also defeats most compilers’ auto-vectorisers.

A segmented iterator could be a two-level structure, allowing an algorithm to operate on each contiguous segment efficiently (for instance, calling memset or a tight inner loop on each chunk) and only handle the segment-to-segment transitions at the outer level.

This paper remained influential but its ideas were never adopted into the C++ standard. On the other hand, some libraries have internally developed those core ideas. Some examples::

Austern’s paper explained

A traditional iterator presents a flat view: increment, decrement, dereference. A segmented iterator additionally admits being decomposed into two coordinates:

  1. a segment iterator that walks the outer sequence of segments (the blocks of a deque, the chunks of a chunk-list, the buckets of a chained hash table, etc.). This iterator is not dereferenceable.
  2. a local iterator that walks the inner range inside one segment. This iterator is dereferenceable.

A hierarchical algorithm is then a generic algorithm that, when given a segmented iterator, dispatches to a two-level loop: an outer loop over the segments and, inside each segment, a simplified algorithm over a higher performance, less segmented, local_iterator that could be further segmented into new, lower-level, segment iterator/local iterator pairs. Ultimately, this recursive segmentation reaches to a non-segmentable, probably highly efficient, local_iterator.

For a deque<T> the picture is the canonical one-level segmentation case: a segment_iterator is essentially walking the block index, and the corresponding local_iterator is walking inside one block.

For a deeper container such as vector<vector<vector<T>>> the same decomposition recurses: the local_iterator produced at the outer level can be itself decomposed in a new segment_iterator (its segments are the middle vectors) and local_iterator, and the local_iterator of the innermost vector<T> is truly flat.

For transformations between iterators, local_iterators, and segment_iterators Austern proposes a segmented_iterator_traits class template that defines the primitives for segmentation-aware algorithms. Its synopsis, transcribed from the paper, is:

//General definition containing a flag identifying the iterator as non-segmented
template <class Iterator>
struct segmented_iterator_traits {
   typedef /* Type::value == false */  is_segmented_iterator;
};

//The traits class is specialized for every segmented iterator type
template <class Iterator>
struct segmented_iterator_traits<Iterator> {
   // is_segmented_iterator::value == true if Iterator is segmented
   typedef /* unspecified */ is_segmented_iterator;

   // A non-dereferenceable iterator that traverses the sequence of segments
   typedef /* unspecified */ segment_iterator;

   // An efficient dereferenceable iterator, traversing within a single segment
   // This could be further decomposed into a lower-level segment and local iterator
   typedef /* unspecified */ local_iterator;

   // Return the segment that contains the current position of it
   static segment_iterator segment(Iterator it);

   // Returns the position within the current segment of it
   static local_iterator local(Iterator it);

   // Constructs higher-level iterator from segment + local
   static Iterator compose(segment_iterator s, local_iterator l);

   // Returns a local iterator to the first element of the segment
   static local_iterator begin(segment_iterator s);

   // Returns a local iterator one past the last element of the segment
   static local_iterator   end(segment_iterator s);
};Code language: C++ (cpp)

With that trait in place, the paper’s canonical example — fill over a segmented range — is written as a three-piece walk: the tail of the first segment, the whole of every interior segment, and the head of the last segment:

template <class SegIt, class T>
void hierarchical_fill(SegIt first, SegIt last, const T& x)
{
   typedef segmented_iterator_traits<SegIt> traits;
   typename traits::segment_iterator sf = traits::segment(first);
   typename traits::segment_iterator sl = traits::segment(last);

   if (sf == sl) {   //Single segment, fast path
      std::fill(traits::local(first), traits::local(last), x);
   }
   else {            //Multi-segment case
      std::fill(traits::local(first), traits::end(sf), x);  //Partial initial
      for (++sf; sf != sl; ++sf)                            //Middle "full" segments
         std::fill(traits::begin(sf), traits::end(sf), x);
      std::fill(traits::begin(sl), traits::local(last), x); //Partial final
   }
}Code language: C++ (cpp)

The three calls to std::fill are flat loops over local_iterator, which can be as efficient as a T*. Austern’s paper claims that ~20% performance improvement could be achieved when compared to the traditional STL-like std::fill implementation.

However, we’ll see that with modern compilers, an efficient local_iterator allows much higher performance levels.

How Boost.Container is experimenting with segmentation

Boost.Container has started an experimental work on segmented algorithms closely following the original proposal. Time will tell if this experimental effort will lead to another Boost library proposal or it will remain as an implementation detail to speed up segmented data structures like deque, hub, etc.

You can find the ongoing work in the Boost.Container repository, including implementation, preliminary tests and very early benchmarks.

Initial experiments follow the original paper implementing algorithms that:

  • tag-dispatch on segmented_iterator_traits<Iter>::is_segmented_iterator.
  • When the iterator is effectively detected as segmented, for simple algorithms (more complex algorithms need more novel approaches), the algorithm decomposes the input range into first segment, middle segments, and last segment, calls itself recursively on each, and ultimately bottoms out on a flat loop over the local_iterator type.
  • When the passed iterator is not detected as segmented the algorithm collapses to a classic loop and behaves like the canonical standard library version.

Note: The multi-segment first-segment + middle-segment + end-segment approach from Austern’s paper can be simplified to single a while loop, but this can downgrade the performance for fixed size segment data structures (such as deque) as compilers can more easily auto-vectorize address ranges whose length is known at compile-time. I’ve found that single while loop with no fixed boundaries (start and end of a segment) is noticeably slower in many cases since the compiler can’t guarantee vectorization can work in every iteration.

template <class SegIt, class T>
void hierarchical_fill(SegIt first, SegIt last, const T& x) {
    typedef segmented_iterator_traits<SegIt> traits;
    typename traits::segment_iterator sf = traits::segment(first);
    typename traits::segment_iterator sl = traits::segment(last);
    typename traits::local_iterator   lf = traits::local(first);

    while (true) {
        typename traits::local_iterator le =
            (sf == sl) ? traits::local(last) : traits::end(sf);
        std::fill(lf, le, x);
        if (sf == sl) break;
        lf = traits::begin(++sf);
    }
}

The goal of this article is to show how these experimental segmented algorithms perform, starting with the classic boost::container::deque<T> container. As explained in the previous blog post, in this deque implementation the block “index” is implemented as an array of T* that point to fixed-size blocks (arrays) of type T. So for Boost’s deque::iterator:

  • segment_iterator is defined as a T** walking between blocks.
  • local_iterator is defined as a plain T* that can efficiently move inside one contiguous block of memory.

Low hanging fruit: Benchmarking single-range, single-pass algorithms with deque

Fortunately, Austern’s fill example can be fully reused on many other similar algorithms. Unbounded algorithms taking length arguments instead of ranges (such as fill_n, generate_n) need a slightly more complicated segmentation handling, but essentially follow Austern’s proposal

Note: Segmentation handling can become much more complicated when the algorithm takes more than one input iterator range and/or output iterator ranges (e.g. std::merge takes two input ranges and an output iterator, with potentially three different segmented iterator types).

Algorithms like reverse that must iterate the range forward and backwards at the same time are also a challenge.

Finally, in some cases, an implementation taking advantage of a recursively segmented iterator (general case) is more complicated than handling a single-level segmentation.

Benchmark highlights (you can find the general benchmark code here):

  • Every test runs against the same boost::container::deque<T> with a fixed block size of 128 elements per block, with size() == 100'000.
  • The per-call cost is averaged over several thousand invocations.
  • The container is filled with contiguous positive values 0, 1, 2, …, N-1 (so all elements are non-negative, distinct and strictly ascending)
  • Where it makes sense the harness probes both a (hit) and a (miss) variant that refers to the answer the algorithm returns.
  • The selected 27 sub-bencharks (16 algorithms with “hit” + “miss” variants where applicable) are summarised in the table below.
Algorithm Hit case Miss case
all_of Predicate x ≥ 0 — true everywhere; must scan every element Predicate x ≠ N/2 the algorithm short-circuits in the middle.
any_of Predicate x == N/2 short-circuits in the middle. Predicate x < 0 full pass before answering “no”.
count Value = 0 counts a single hit but still scans the whole deque. Value = -1 — never present, scans the whole deque.
count_if Predicate is_odd(x) — true for half the elements; full pass. Predicate x < 0 — never true; full pass.
fill Overwrites all N elements with T(42).
fill_n Overwrites the first N elements with T(42).
find Value = N/2 — present at index N/2; short-circuits in the middle. Value = -1 — never present; full pass.
find_if Predicate x == N/2 — true at index N/2; short-circuits in the middle. Predicate x < 0 — never true; full pass.
find_if_not Predicate x ≠ N/2 — false at index N/2; short-circuits in the middle. Predicate x ≥ 0 — true everywhere; full pass before reporting “no element fails the predicate”.
for_each Applies a summer functor that accumulates an int sum over the whole deque.
generate Overwrites every element with a counter generator producing 0, 1, 2, ….
generate_n Same as generate, invoked through the _n overload
is_partitioned Predicate x < 0 on the unmodified, all-non-negative deque, the algorithm walks the whole deque to confirm. Predicate x < 0 on a copy with -1 injected at index N/2 short-circuits there.
none_of Predicate x < 0 — never true; full pass before answering “yes, none of them”. Predicate x == N/2 — true at index N/2; short-circuits in the middle.
replace The deque is first preprocessed so every odd index holds -1; replace(-1 → -2) then rewrites half the elements (many stores). replace(-1 → -2) on the unmodified deque — -1 is never present, so the pass performs zero stores.
replace_if Predicate is_odd(x) — true for half the elements; full pass with one store every two elements. Predicate x < 0 — never true; full pass with zero stores.

Two element types are used:

  • MyInt — wraps a single int (4 bytes, trivially-comparable).
  • MyFatInt — wraps eight ints (32 bytes), comparing only its first field.

The algorithm is executed in three modes:

  • seg (SEGmented): Calls the Boost implementation and follows the segmented path (segmented_iterator_traits<iterator>::is_segmented_iterator::value == true)
  • nsg (Non SeGmented): Calls the Boost implementation but does not follow the segmented path (segmented_iterator_traits<iterator>::is_segmented_iterator::value == false), it uses a simple, classic STL-like loop.
  • std (STanDard): Calls the std:: algorithm

For every algorithm and every value type the benchmark prints three columns:

Column What it isolates Meaning of ratio > 1.0
nsg/seg Same algorithm, same compiler, same iterator object advertising or not segmentation Boost segmented path is X times faster than the Boost non-segmented path
std/seg How the segmented version compares to the platform’s stock implementation. Boost segmented path is X times faster than the std algorithm
std/nsg How the platform’s stock implementation compares to the same flat loop without the segmentation tag. Boost non-segmented path is X times faster than the std algorithm

Tested compilers / standard libraries:

  • MSVC 2022 — Toolset v14.44 (VS 2022), Microsoft STL — Windows, /O2 /std:c++20.
  • MSVC 2026 — Toolset v14.51 (VS 2026), Microsoft STL — Windows, /O2 /std:c++20
  • Clang-cl 20 — LLVM bundled with VS 2026, Microsoft STL — Windows, /O2 /std:c++20 /EHsc.
  • GCC 16 — WSL Ubuntu 26.04, libstdc++ -O3 -std=c++20.
  • Clang 22 + libstdc++ — WSL Ubuntu 26.04, libstdc++, -O3 -std=c++20.
  • Clang 22 + libc++ — WSL Ubuntu 26.04, libc++ -O3 -std=c++20 -stdlib=libc++.

For each compiler we build the same translation unit twice, first one making sure no loop unroll hint is used, and second one suggesting unrolling (a #pragma unroll(4) or compiler-specific equivalent injected at the top of every per-segment inner loop).

Configuration A: No inner-loop unroll hint

In this configuration the segmented algorithm and local algorithm loops implemented by Boost.Container do not emit any #pragma unroll hint inside the per-segment inner loop. The compiler is free to vectorise or unroll on its own.

Let’s see the geometrical mean over the 27 sub-bencharks (16 algorithms with “hit” + “miss” variants where applicable):

T = MyInt (4 bytes):

Compiler nsg/seg std/seg std/nsg
MSVC 2022 1.71 1.66 0.97
MSVC 2026 5.93 5.96 1.01
clang-cl 2026 3.99 3.70 0.93
GCC 16.0 1.96 1.92 0.98
Clang 22.1 + libstdc++ 4.11 3.35 0.82
Clang 22.1 + libc++ 4.09 3.99 0.98

T = MyFatInt (32 bytes):

Compiler nsg/seg std/seg std/nsg
MSVC 2022 1.40 1.37 0.98
MSVC 2026 3.31 2.83 0.86
clang-cl 2026 1.51 1.43 0.95
GCC 16.0 1.25 1.22 0.98
Clang 22.1 + libstdc++ 1.37 1.18 0.86
Clang 22.1 + libc++ 1.33 1.31 0.99

A per-compiler breakdown is given in the Annex so each compiler can be inspected on its own; for now we focus on the geomeans and on the broad shape of the per-algorithm distribution.

This results shows huge differences between compilers:

  • Generally the Boost non-segmented implementation is close to the std algorithm performance (except for the libstdc++ overload, more about this soon).
  • Segmented algorithms achieve good or great performance boost, higher than the results Austern measured back in time. Probably processor and memory architecture changes have made the gap between flat and segmented iterations bigger since the year 2000.
  • MSVC 2026 with the recent v14.51 toolset the segmented overload runs 5.9× faster than the flat fallback on MyInt and 3.3× faster on MyFatInt.
  • Clang variants are also big winners with the small int type.
  • Looking at the per-algorithm column makes it clear: when there is no longer the block boundary check that is being amortised, compilers with aggresive auto-vectorization and/or interleaving can take advantage of SIMD instructions obtaining huge performance gains.
  • The results clearly show that Clang autovectorization is more aggresive than GCC’s and recent MSVC toolset improvements (VS 2022 vs 2026) can lead to dramatic performance differences for simple types.

The individual algorithms with the most speedup for segmented vs non-segmented on MSVC 2026 (MyInt) are:

Algorithm nsg/seg (MSVC 2026)
fill 17.16
count(miss) 9.31
count(hit) 9.30
find(miss) 7.69
find(hit) 7.68
none_of(miss) 7.22
any_of(hit) 7.21
all_of(miss) 7.21
for_each 7.20

A 17× speedup on fill is a textbook SIMD store: the segmented inner loop is T* writing a constant into a known number of contiguous ints, which can be lowered to wide stores. The non-segmented version cannot prove the loop walks a contiguous buffer and ends up producing scalar assembly code.

Inside the segmented algorithm the final inner loop walks a T* over a contiguous block of 128 elements, with a known trip count and no indirection. Auto-vectorisers turns that into N-elements-per-iteration SIMD. For example, these SIMD enabled algorithms shine with Clang for the small MyInt case:

Clang nsg/seg clang-cl 2026 Clang + libstdc++ Clang + libc++
count(hit) 10.33 11.31 11.44
count(miss) 10.37 11.31 11.45
count_if(hit) 9.80 9.50 8.00
count_if(miss) 9.53 8.53 7.56
fill 10.13 8.80 7.89
generate 9.41 8.96 8.98
for_each 5.42 6.43 6.39

Configuration B: Inner-loop unroll hint

This configuration injects #pragma unroll(4) (or the compiler-specific equivalent) at the top of every per-segment inner loop. The intent is to give the back-end a stronger hint that the inner loop is a hot, count-bounded scan. Same code otherwise, same benchmark, same compilers:

Geomean over all 27 sub-bencharks (16 algorithms with “hit” + “miss” variants where applicable):

T = MyInt (4 bytes):

Compiler nsg/seg std/seg std/nsg
MSVC 2022 1.71 1.66 0.97
MSVC 2026 5.91 5.91 1.00
clang-cl 2026 2.20 2.95 1.34
GCC 16.0 3.13 3.06 0.98
Clang 22.1 + libstdc++ 3.10 3.33 1.08
Clang 22.1 + libc++ 3.11 4.10 1.32

T = MyFatInt (32 bytes):

Compiler nsg/seg std/seg std/nsg
MSVC 2022 1.37 1.38 1.00
MSVC 2026 3.40 2.91 0.86
clang-cl 2026 1.30 1.47 1.13
GCC 16.0 1.38 1.36 0.98
Clang 22.1 + libstdc++ 1.12 1.18 1.05
Clang 22.1 + libc++ 1.12 1.34 1.20

The picture is very different from the no-hint configuration:

  • MSVC 2022 and MSVC 2026 esentially ignore the unroll hint.
  • GCC 16 is the surprise this time: the hint helps, and by a lot. MyInt goes from 1.96× to 3.13× (+ 60 %), MyFatInt from 1.25× to 1.38×.
  • Clang 22 regresses on MyInt, so it seems that hand-unrolled or pragma-based explicit unrolling interferes with the optimizer.
  • The std/seg and std/nsg columns do differ between the two Clang configurations under the unroll hint: with libc++ the flat fallback pulls ~32 % ahead of libc++’s std:: on MyInt and also suggests libstdc++ already hand-unrolls or pragma-unrolls some std algorithms.

In these tests unrolling only helps GCC and regresses Clang, so a portable implementation should avoid using manual and/or pragma-based unrolling on Clang.

Final Conclusion

The whole story in two pictures — segmentation gain (geomean of nsg/seg across all 27 benchmarks) for each compiler in both configurations, shown separately for each value type:

We can summarize these benchmarks on segmented-iterator algorithms using deque in three key findings:

  1. For simple, single-pass algorithms, the segmented algorithm can yield up to big (5.93 faster geomean) speedups, with huge (17x) gains in some corner cases. These speedups vary by compiler and element type, with the largest gains on small trivially-comparable types where the compiler can auto-vectorize. Compilers are increasingly better taking advantage of segmented algorithms when the local_iterator can be simplified to a raw pointer.
  2. The standard library performance is similar to the non-segmented Boost algorithm, is not compiler-dependent. Of course, some standard library implementations like libstdc++ ship hand-unrolled or pragma-unrolled scans for selected algorithms that can beat the Boost escalar loop fallback, while libc++ uses textbook loops that tie it.
  3. Unroll hints or hand-unrolled loops are a mixed bag. Modern MSVC is indifferent, GCC 16 can certainly benefit from it, Clang regresses. The right default is therefore compiler-dependent and will certainly change when compiler-optimizers evolve in the future.

Few things in our field age as gracefully as a paper that names a structural truth and lets later generations catch up to it. Matt Austern’s Segmented Iterators and Hierarchical Algorithms is one of those papers, it identified a mismatch and proposed a technique elegant enough to still feel modern a quarter century later.

That’s exactly the gift a well-aimed abstraction gives you: it ages into the hardware improvements rather than against it. It’s also a reminder that revisiting the classic C++ literature — Stepanov, Austern, Musser, Lee — is rarely an exercise in nostalgia.










Annex: Per-compiler results

The two configurations (with and without unroll hint) are easier to read one compiler at a time. Each compiler section below shows two charts — one for T = MyInt, one for T = MyFatInt — both using the same layout: 27 benchmarks on the X axis, paired bars for the no-unroll-hint and with-unroll-hint configurations of that compiler. The Y axis is not shared between charts — the goal here is to see each compiler’s own internal ranking and the pragma-on-vs-pragma-off contrast.

MSVC 2022 (toolset v14.44)

The flat baseline. Most algorithms sit in a narrow 1.3×–2.2× band on MyInt and 1.1×–1.9× on MyFatInt; only fill stands out (4.5×) because it is a pure store loop the v14.44 vectoriser does manage to recognise. The pragma is essentially a no-op on this toolset.

MSVC 2026 (toolset v14.51)

The big change. On MyInt the v14.51 vectoriser kicks in for the single-pass scans (count, find, for_each, none_of, …), pushing every one of those bars to ≥ 6.5×, and fill to 17.2×. On MyFatInt most algorithms move into the 4×–6× band as well, something no other compiler tested manages on the fat type. Adding the pragma barely changes anything (5.93 → 5.91 geomean on MyInt, 3.31 → 3.40 on MyFatInt); the v14.51 back-end already commits to the SIMD strategy without needing the hint, which is the opposite of what happens on the LLVM-based toolchains where the same hint gives back some of the SIMD gain.

clang-cl 2026 (LLVM 20 from VS 2026)

Without the hint, clang-cl looks very much like Linux Clang on MyInt: the family of vectorisable scans (count*, fill, generate, find*) dominates with ratios around 9–10×. Adding the unroll hint collapses most of those to ≈ 1.4×–3×; the one algorithm that survives mostly unscathed is generate (the LLVM 20.1 vectoriser still recognises it after the unroll). On MyFatInt the no-hint configuration sees a moderate, even distribution; with the hint everything compresses to the 1.0–1.6× band.

GCC 16

GCC 16 is the surprise of the matrix: the unroll hint is a clear win. Without the hint, GCC 16 only auto-vectorises the heavy “plain store / load” algorithms (fill 7.7×, for_each 4.7×, generate* ≈ 4.5×) and leaves the lighter scans (all_of, any_of, find*, none_of, count) in the 1.2×–2.9× band. Adding the hint lifts all the lighter scans into the 2.3×–3.2× band while leaving the already-vectorised algorithms essentially unchanged. The geomean accordingly moves from 1.96× to 3.13× on MyInt (and from 1.25× to 1.38× on MyFatInt) without losing ground on any algorithm.

Clang 22 + libstdc++ (default Ubuntu setup)

Clang 22 still loses some performance when the unroll hint is forced, but much less than older releases. On MyInt the SIMD-heavy algorithms still drop a notch (count(hit) 11.3 → 10.2, fill 8.8 → 7.7, count_if(hit) 9.5 → 7.6) but none of them collapses to the old 2× band; many — generate, find*, find_if* — are unchanged or even slightly improved. On MyFatInt the hint compresses the distribution into the 1.0×–1.4× band, the same shape as clang-cl 2026 with the hint. Clang remains the toolchain where the no-hint default is the safer choice on MyInt, but the choice is no longer dramatic.

Clang 22 + libc++ (-stdlib=libc++)

Same compiler binary as the previous panel, but the standard library is swapped for libc++ 22.1. The shape on MyInt is extremely close to the libstdc++ panel — the SIMD-heavy algorithms (count*, fill, generate, for_each) sit in roughly the same band, and the unroll hint produces the same kind of mild regression on the same algorithms. On MyFatInt the no-hint distribution is also very close to libstdc++ (most algorithms in the 1.0×–1.7× band, geomean 1.33× versus 1.37× for libstdc++); the unroll hint compresses things further into the 1.0×–1.4× band, just like libstdc++. The biggest visible difference between the two stdlibs on the fat type is therefore in the std/seg and std/nsg columns, not in the nsg/seg column — libc++’s textbook <algorithm> headers leave more room for both segmented and non-segmented Boost paths to pull ahead of std::.

Leave a Reply

Discover more from template<class BoostedCpp>

Subscribe now to keep reading and get access to the full archive.

Continue reading