c++ - What could cause a mutex to misbehave? -


i've been busy last couple of months debugging rare crash caused somewhere within large proprietary c++ image processing library, compiled gcc 4.7.2 arm cortex-a9 linux target. since common symptom glibc complaining heap corruption, first step employ heap corruption checker catch oob memory writes. used technique described in https://stackoverflow.com/a/17850402/3779334 divert calls free/malloc own function, padding every allocated chunk of memory amount of known data catch out-of-bounds writes - found nothing, when padding as 1 kb before , after every single allocated block (there hundreds of thousands of allocated blocks due intensive use of stl containers, can't enlarge padding further, plus assume write more 1kb out of bounds trigger segfault anyway). bounds checker has found other problems in past don't doubt functionality.

(before says 'valgrind', yes, have tried no results either.)

now, memory bounds checker has feature prepends every allocated block data struct. these structs linked in 1 long linked list, allow me go on allocations , test memory integrity. reason, though manipulations of list mutex protected, list getting corrupted. when investigating issue, began seem mutex failing job. here pseudocode:

pthread_mutex_t alloc_mutex; static bool boolmutex; // set false during init. volatile has no effect.  void malloc_wrapper() {   // ...   pthread_mutex_lock(&alloc_mutex);   if (boolmutex) {     printf("mutex misbehaving\n");     __throw_error__; // happens!   }   boolmutex = true;   // manipulate linked list here   boolmutex = false;   pthread_mutex_unlock(&alloc_mutex);   // ... } 

the code commented "this happens!" reached, though seems impossible. first theory mutex data structure being overwritten. placed mutex within struct, large arrays before , after it, when problem occurred arrays untouched nothing seems overwritten.

so.. kind of corruption possibly cause happen, , how find , fix cause?

a few more notes. test program uses 3-4 threads processing. running less threads seems make corruptions less common, not disappear. test runs 20 seconds each time , completes in vast majority of cases (i can have 10 units repeating test, first failure occurring after 5 minutes several hours). when problem occurs quite late in test (say, 15 seconds in), isn't bad initialization issue. memory bounds checker never catches actual out of bounds writes glibc still fails corrupted heap error (can such error caused other oob write?). each failure generates core dump plenty of trace information; there no pattern can see in these dumps, no particular section of code shows more others. problem seems specific particular family of algorithms , not happen in other algorithms, i'm quite isn't sporadic hardware or memory error. have done many more tests check oob heap accesses don't want list keep post getting longer.

thanks in advance help!

thanks commenters. i've tried suggestions no results, when decided write simple memory allocation stress test - 1 run thread on each of cpu cores (my unit freescale i.mx6 quad core soc), each allocating , freeing memory in random order @ high speed. test crashed glibc memory corruption error within minutes or few hours @ most.

updating kernel 3.0.35 3.0.101 solved problem; both stress test , image processing algorithm run overnight without failing. problem not reproduce on intel machines same kernel version, problem specific either arm in general or perhaps patch freescale included specific bsp version included kernel 3.0.35.

for curious, attached stress test source code. set num_threads number of cpu cores , build with:

<cross-compiler-prefix>g++ -o3 test_heap.cpp -lpthread -o test_heap

i hope information helps someone. cheers :)

// multithreaded heap stress test. itay chamiel 20151012.  #include <stdio.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include <assert.h> #include <pthread.h> #include <sys/time.h>  #define num_threads 4 // set number of cpu cores  #define alive_indicator num_threads  // each thread allocates , frees memory. in each iteration of infinite loop, decide @ random whether // allocate or free block of memory. list of 500-1000 allocated blocks maintained each thread. when memory allocated // added list; when freeing, random block selected list, freed , removed list. void* thr(void* arg) {     int* alive_flag = (int*)arg;     int thread_id = *alive_flag; // number between 0 , (num_threads-1) given main()     int cnt = 0;     timeval t_pre, t_post;     gettimeofday(&t_pre, null);      const int allocate=1, free=0;     const unsigned int minsize=500, maxsize=1000;     const int max_alloc=10000;     char* membufs[maxsize];     unsigned int membufs_size = 0;      int num_allocs = 0, num_frees = 0;      while(1)     {         int action;         // decide whether allocate or free memory block.         // if have less minsize buffers, allocate.         if (membufs_size < minsize) action = allocate;         // if have maxsize, free.         else if (membufs_size >= maxsize) action = free;         // else, decide randomly.         else {             action = ((rand() & 0x1)? allocate : free);         }          if (action == allocate) {             // choose size allocate, 1 max_alloc bytes             size_t size = (rand() % max_alloc) + 1;             // allocate , fill memory             char* buf = (char*)malloc(size);             memset(buf, 0x77, size);             // add buffer list             membufs[membufs_size] = buf;             membufs_size++;             assert(membufs_size <= maxsize);             num_allocs++;         }         else { // action == free             // choose random buffer free             size_t pos = rand() % membufs_size;             assert (pos < membufs_size);             // free , remove list replacing entry last member             free(membufs[pos]);             membufs[pos] = membufs[membufs_size-1];             membufs_size--;             assert(membufs_size >= 0);             num_frees++;         }          // once in 10 seconds print status update         gettimeofday(&t_post, null);         if (t_post.tv_sec - t_pre.tv_sec >= 10) {             printf("thread %d [%d] - %d allocs %d frees. alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);             gettimeofday(&t_pre, null);         }          // indicate alive main thread         *alive_flag = alive_indicator;     }     return null; }  int main() {     int alive_flag[num_threads];     printf("memory allocation stress test running on %d threads.\n", num_threads);     // start thread each core     (int i=0; i<num_threads; i++) {         alive_flag[i] = i; // tell each thread id.         pthread_t th;         int ret = pthread_create(&th, null, thr, &alive_flag[i]);         assert(ret == 0);     }      while(1) {         sleep(10);         // check threads alive         bool ok = true;         (int i=0; i<num_threads; i++) {             if (alive_flag[i] != alive_indicator)             {                 printf("thread %d not responding\n", i);                 ok = false;             }         }         assert(ok);         (int i=0; i<num_threads; i++)             alive_flag[i] = 0;     }     return 0; } 

Comments

Popular posts from this blog

php - Zend Framework / Skeleton-Application / Composer install issue -

c# - Better 64-bit byte array hash -

python - PyCharm Type error Message -