c# - Why isn't it faster to update a struct array than a class array? -

- February 15, 2010

in order prepare optimization in existing software framework, performed standalone performance test, assess potential gains before spending large amount of time on it.

the situation

there n different types of components, of implement iupdatable interface - interesting ones. grouped in m objects, each maintaining list of components. updating them works this:

foreach (groupobject obj in objects) {     foreach (component comp in obj.components)     {         iupdatable updatable = comp iupdatable;         if (updatable != null)             updatable.update();     } }

the optimization

my goal optimize these updates large amounts of grouping objects , components. first, make sure update components of 1 kind in row, caching them in 1 array per kind. essentially, this:

foreach (iupdatable[] compoftype in typesortedcomponents) {     foreach (iupdatable updatable in compoftype)     {         updatable.update();     } }

the thought behind jit or cpu might have easier time operating on same object type on , on again in shuffled version.

in next step, wanted further improve situation making sure data 1 component type aligned in memory - storing in struct array, this:

foreach (componentdatastruct[] compdataoftype in typesortedcomponentdata) {     (int = 0; < compdataoftype.length; i++)     {         compdataoftype[i].update();     } }

the problem

in standalone performance tests, there no significant performance gain either of these changes. i'm not sure why. no significant performance gains means, 10000 components, each batch running 100 update cycles, main tests take around 85 milliseconds +/- 2 milliseconds.

(the difference arises introducing as cast , if check, that's not testing for.)

all tests performed in release mode, without attached debugger.

external disturbances reduced using code:

    currentproc.processoraffinity = new intptr(2);     currentproc.priorityclass = processpriorityclass.high;     currentthread.priority = threadpriority.highest;

each test did primitive math work, it's not measuring empty method calls potentially optimized away.
garbage collection performed explicitly before each test, rule out interference well.
the full source code (vs solution, build & run) available here

i have expected significant change due memory alignment , repetition in update patterns. so, core question is: why wasn't able measure significant improvement? overlooking important? did miss in tests?

the main reason might traditionally prefer latter implementation because of locality of reference. if contents of array fit cpu cache, code runs lot faster. conversely, if have lot of cache misses, code runs more slowly.

your mistake, suspect, objects in first test already have locality of reference. if allocate lot of small objects @ once, objects contiguous in memory though they're on heap. (i'm looking better source that, i've experienced same thing anecdotally in own work) if aren't contiguous, gc might moving them around such are. since modern cpus have large caches may case entire data structure fits in l2 cache, since there isn't else around compete it. if cache isn't large, modern cpus have gotten @ predicting usage patterns , prefetching.

it may case code has box/unbox structs. seems unlikely, however, if performance similar.

the big thing low-level stuff in c# need either a) trust framework job, or b) profile under realistic conditions after you've identified low-level performance issue. appreciate may toy project, or may playing around memory optimisation giggles, priori optimisation you've done in op quite unlikely yield appreciable performance improvements @ project-scale.

i haven't yet gone through code in detail, suspect problem here unrealistic conditions. more memory pressure, , more dynamic allocation of components, might see performance differential expect. again, might not, why it's important profile.

it's worth noting if know in advance strict manual optimisation of memory locality critical proper functionality of application, may need consider whether managed language correct tool job.

edit: yeah, problem here:-

public static void preparetest() {   data = new base[program.objcount]; // 10000   (int = 0; < data.length; i++)     data[i] = new data(); // data consists of 4 floats }

those 10,000 instances of data contiguous in memory. furthermore, they fit in cache anyway, doubt you'd see performance impact cache misses in test.

Search This Blog

Current CAD