Performance issue when using multiple threads with sqlite3 - c

I am writing a program that generates hashes for files in all subdirectories and then puts them in a database or prints them to standard output: https://github.com/cherrry9/dedup
In the latest commit, I added option for my program to use multiple threads (THREADS macro).
Here are some benchmarks that I did:
$ test() { /usr/bin/time -p ./dedup / -v 0 -c 2048 -e "/\(proc\|sys\|dev\|run\)"; }
$ make clean all THREADS=1 test
real 8.03
user 4.34
sys 4.55
$ make clean all THREADS=4 && test
real 3.94
user 7.66
sys 7.42
As you can the version compiled with THREADS=4 was 2 times faster.
Now I will use the second positional argument to specify sqlite3 database:
$ test() { /usr/bin/time -p ./dedup / test.db -v 0 -c 2048 -e "/\(proc\|sys\|dev\|run\)"; }
$ make clean all THREADS=1 && ​test
real 20.40
user 7.58
sys 7.29
$ rm test.db
$ make clean all THREADS=4 && ​test
real 21.86
user 17.17
sys 18.15
Version compiled with THREADS=4 was slower than version that used THREADS=1!
When I used second argument, in dedup.c was executed this code that inserted hashes to database:
if (sql != NULL && sql_insert(sql, entry->fpath, hash) != 0) {
// ...
sql_insert uses transactions to prevent sqlite from writing to database every time I call INSERT.
int
sql_insert(SQL *sql, const char *filename, char unsigned hash[])
{
int errcode;
pthread_mutex_lock(&sql->mtx);
sqlite3_bind_text(sql->stmt, 1, filename, -1, NULL);
sqlite3_bind_blob(sql->stmt, 2, hash, SHA256_LENGTH, NULL);
sqlite3_step(sql->stmt);
SQL_TRY(sqlite3_reset(sql->stmt));
if (++sql->insertc >= INSERT_LIM) {
SQL_TRY(sqlite3_exec(sql->database, "COMMIT;BEGIN", NULL, NULL, NULL));
sql->insertc = 0;
}
pthread_mutex_unlock(&sql->mtx);
return 0;
}
This fragment is executed for every processed file and for some reason it's blocking all threads in my program.
And here's my question, how can i prevent sqlite from blocking threads and degrading the performance of my program?
Here is dedup options explanation if you wonder what test function is doing:
1th positional argument - directory to use to generate hashes
2th positional argument - path to databases which will be used by sqlite3
-v level - verbose level (0 means print only errors)
-c nbytes - read nbytes from each file
-e regex - exclude directories that match regex
I'm using serialized mode in sqlite3.

It seems that all your threads use the same database connection and statement objects. Therefore you have a race-condition (even in SERIALIZED threading model), as multiple threads are binding, stepping, and resetting the same statement. Asking 'why is it slow' becomes irrelevant until you fix this problem.
Instead you should wrap your sql_insert with a mutex to guarantee that at most one thread is accessing the database connection:
int
sql_insert(SQL *sql, const char *filename, char unsigned hash[])
{
pthread_mutex_lock(&sql->mutex);
// ... actual insert and exec code ...
pthread_mutex_unlock(&sql->mutex);
return 0;
}
Then add and initialize that mutex in your SQL structure with pthread_mutex_init.
You'll see the performance boost if your bottleneck is indeed the computation of SHA-256 rather than writing into the database. Otherwise the overhead of this mutex should be negligible and the number of threads will not have a significant effect of the run-time.

Related

Syslog API: How to get subsecond timestamps in C (works in Python3)

I have two test programs that write to syslog, one in C and one in Python3. Here is some sample output (from /var/log/messages):
Dec 9 11:27:55.000 0c6e58933c36 c-logtest[206]: hello
Dec 9 11:27:55.000 0c6e58933c36 c-logtest[206]: world
Dec 9 11:27:59.584 0c6e58933c36 py-logtest[208]: hello
Dec 9 11:27:59.590 0c6e58933c36 py-logtest[208]: world
The milliseconds are always 000 for the c-logtest program, while evidently works for py-logtest. What am I doing wrong?
c-logtest.c:
#include <syslog.h>
#include <unistd.h> //usleep
int main() {
openlog("c-logtest", LOG_CONS | LOG_NDELAY, LOG_USER);
syslog(LOG_INFO, "hello");
usleep(5000);
syslog(LOG_INFO, "world");
closelog();
return 0;
}
py-logtest.py
#!/usr/bin/env python3
import time
import logging
import logging.handlers
logger = logging.getLogger('')
handler = logging.handlers.SysLogHandler(address = '/dev/log')
handler.setFormatter(logging.Formatter('py-logtest %(message)s'))
logger.addHandler(handler)
logger.setLevel(logging.INFO)
logger.info("hello")
time.sleep(0.005)
logger.info("world")
I'm using syslog-ng, which I've configured to produce millisecond resolution timestamps, by adding this to syslog-ng.conf:
options{ frac-digits(3); };
Tip: It is possible to reproduce this in an isolated manner with docker run --rm -it fedora bash, and from there, install and configure syslog-ng, run the two programs, and tail -F /var/log/messages.
According to this thread glibc implementation of syslog API does not generate sub-second timestamps precision.
What you can do is probably use keep-timestamp(no) syslog-ng option. It will make syslog-ng ignore the timestamp sent along with the message and use time of message reception instead. Whether this is acceptable or not depends on your use case. In most of the cases when syslog runs locally this probably should not be a problem. There is the following warning in the documentation though:
To use the S_ macros, the keep-timestamp() option must be enabled (this is the default behavior of syslog-ng PE).
I did my own log implementation, which does this:
(C++)
static void writeTimestamp(std::ofstream& out)
{
struct timeval now;
gettimeofday(&now, nullptr);
out << std::put_time(std::localtime(&now.tv_sec), "%F %T.");
char usecbuf[6+1];
snprintf(usecbuf, sizeof(usecbuf), "%06lu", now.tv_usec);
out << usecbuf;
}
For a complete solution, I would need to reimplement the syslog library, but I haven't.

mlock a program from a wrapper

Just a quick question (I hope). How would you allocate an address space via mlock and then launch an application within that space?
For instance I have a binary that launches from a wrapper program that configures the environment. I only have access to the wrapper code and would like to have the binary launch in a certain address space. Is it possible to do this from the wrapper?
Thanks!
If you have the sources for the program, add a command-line option so that the program calls mlockall(MCL_CURRENT | MCL_FUTURE) at some point. That locks it in memory.
If you want to control the address spaces the kernel loads the program into, you need to delve into kernel internals. Most likely, there is no reason to do so; only people with really funky hardware would.
If you don't have the sources, or don't want to recompile the program, then you can create a dynamic library that executes the command, and inject it into the process via LD_PRELOAD.
Save the following as lockall.c:
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
static void wrerr(const char *p)
{
if (p) {
const char *q = p + strlen(p);
ssize_t n;
while (p < q) {
n = write(STDERR_FILENO, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1 || errno != EINTR)
return;
}
}
}
static void init(void) __attribute__((constructor));
static void init(void)
{
int saved_errno = errno;
if (mlockall(MCL_CURRENT | MCL_FUTURE) == -1) {
const char *errmsg = strerror(errno);
wrerr("Cannot lock all memory: ");
wrerr(errmsg);
wrerr(".\n");
exit(127);
} else
wrerr("All memory locked.\n");
errno = saved_errno;
}
Compile it to a dynamic library liblockall.so using
gcc -Wall -O2 -fPIC -shared lockall.c -Wl,-soname,liblockall.so -o liblockall.so
Install the library somewhere typical, for example
sudo install -o 0 -g 0 -m 0664 liblockall.so /usr/lib/
so you can run any binary, and lock it into memory, using
LD_PRELOAD=liblockall.so binary arguments..
If you install the library somewhere else (not listed in /etc/ld.so.conf), you'll need to specify path to the library, like
LD_PRELOAD=/usr/lib/liblockall.so binary arguments..
Typically, you'll see the message Cannot lock all memory: Cannot allocate memory. printed by the interposed library, when running commands as a normal user. (The superuser, or root, typically has no such limit.) This is because for obvious reasons, most Linux distributions limit the amount of memory an unprivileged user can lock into memory; this is the RLIMIT_MEMLOCK resource limit. Run ulimit -l to see the per-process resource limits currently set (for the current user, obviously).
I suggest you set a suitable limit of how much memory the process can run, running e.g. ulimit -l 16384 bash-built-in before executing the (to set the limit to 16384*1024 bytes, or 16 MiB), if running as superuser (root). If the process leaks memory, instead of crashing your machine (because it locked all available memory), the process will die (from SIGSEGV) if it exceeds the limit. That is, you'd start your process using
ulimit -l 16384
LD_PRELOAD=/usr/lib/liblockall.so binary arguments..
if using Bash or dash shell.
If running as a dedicated user, most distributions use the pam_limits.so PAM module to set the resource limits "automatically". The limits are listed either in the /etc/security/limits.conf file, or in a file in the /etc/security/limits.d/ subdirectory, using this format; the memlock item specifies the amount of memory each process can lock, in units of 1024 bytes. So, if your service runs as user mydev, and you wish to allow the user to lock up to 16 megabytes = 16384*1024 bytes per process, then add line mydev - memlock 16384 into /etc/security/limits.conf or /etc/security/limits.d/mydev.conf, whichever your Linux distribution prefers/suggests.
Prior to PAM, shadow-utils were used to control the resource limits. The memlock resource limit is specified in units of 1024 bytes; a limit of 16 megabytes would be set using M16384. So, if using shadow-utils instead of PAM, adding line mydev M16384 (followed by whatever the other limits you wish to specify) to /etc/limits should do the trick.

Causing malloc() to return NULL on CentOS

I'm about to teach an introductory computer science course in C and I'd like to demonstrate to students why they should check whether malloc() returned a NULL. My plan was to use ulimit to restrict the amount of available memory such that I could exercise different code paths with different limits. Our prescribed environment is CentOS 6.5.
My first attempts to make this happened failed and the shell showed "Killed". This led to me discovering the Linux OOM killer. I have since tried to figure out the magic set of incantations that will cause the results I'm looking for. Apparently I need to mess with:
/etc/sysctl.conf
ulimit -m
ulimit -v
vm.overcommit_memory (which apparently should be set to 2, according to an Oracle article)
This far either I get "Killed" or a segmentation fault, neither of which is the expected outcome. The fact that I'm getting "Killed" with vm_overcommit_memory=2 means that I definitely don't understand what's going on.
If anyone can find a way to artificially and reliably create a constrained execution environment on CentOS so that students learn how to handle OOM (and other?) kinds of errors, many course instructors will thank you.
It is possible to [effectively] turn off overcommitting from kernel >= 2.5.30.
Following Linux Kernel Memory :
// save your work here and note your current overcommit_ratio value
# echo 2 > overcommit_memory
# echo 1 > overcommit_ratio
this sets the VM_OVERCOMMIT_MEMORY to 2 indicating not to overcommit past the overcommit_ratio, which is set to 1 (ie no overcommitting)
Null malloc demo
#include <stdio.h>
#include <stdlib.h>
int main(int argc,char *argv[])
{
void *page = 0; int index;
void *pages[256];
index = 0;
while(1)
{
page = malloc(1073741824); //1GB
if(!page)break;
pages[index] = page;
++index;
if(index >= 256)break;
}
if(index >= 256)
{
printf("allocated 256 pages\n");
}
else
{
printf("memory failed at %d\n",index);
}
while(index > 0)
{
--index;
free(pages[index]);
}
return 0;
}
Output
$ cat /proc/sys/vm/overcommit_memory
0
$ cat /proc/sys/vm/overcommit_ratio
50
$ ./code/stackoverflow/test-memory
allocated 256 pages
$ su
# echo 2 > /proc/sys/vm/overcommit_memory
# echo 1 > /proc/sys/vm/overcommit_ratio
# exit
exit
$ cat /proc/sys/vm/overcommit_memory
2
$ cat /proc/sys/vm/overcommit_ratio
1
$ ./code/stackoverflow/test-memory
memory failed at 0
remember to restore your overcommit_memory to 0 and overcommit_ratio as noted

Dynamic generation of file contents (poor man's proc file)

I'm trying to make a simple userspace program that dynamically generates file contents when a file is read, much like a virtual filesystem. I know there are programs like FUSE, but they seem a bit heavy for what I want to do.
For example, a simple counter implementation would look like:
$ cat specialFile
0
$ cat specialFile
1
$ cat specialFile
2
I was thinking that specialFile could be a named pipe, but I haven't had much luck. I was also thinking select may help here, but I'm not sure how I would use it. Am I missing some fundamental concept?
#include <stdio.h>
int main(void)
{
char stdoutEmpty;
char counter;
while (1) {
if (stdoutEmpty = feof(stdout)) { // stdout is never EOF (empty)?
printf("%d\n", counter++);
fflush(stdout);
}
}
return 0;
}
Then usage would be something like:
shell 1 $ mkfifo testing
shell 1 $ ./main > testing
shell 2 $ cat testing
# should be 0?
shell 2 $ cat testing
# should be 1?
You need to use FUSE. A FIFO will not work, because either your program keeps pushing content to stdout (in which case cat will never stop), or it closes stdout, in which case you obviously can't write to it anymore.

How to create a random char string of size 64k w/o using rand() or /dev/random in C?

I am searching for way to create a somewhat random string of 64k size.
But I want this to be fast as well. I have tried the following ways:
a) read from /dev/random -- This is too slow
b) call rand() or a similar function of my own -- a soln with few(<10) calls should be ok.
c) malloc() -- On my Linux, the memory region is all zeroes always,
instead of some random data.
d) Get some randomness from stack variable addresses/ timestamp etc. to init initial few bytes.
Followed by copying over these values to the remaining array in different variations.
Would like to know if there is a better way to approach this.
/dev/random blocks after its pool of random data has been emptied until it gathered new random data. You should try /dev/urandom instead.
rand() should be fairly fast in your c runtime implementation. If you can relax your "random" requirement a bit (accepting lower quality random numbers), you can generate a sequence of numbers using a tailored implementaton of a linear congruential generator. Be sure to choose your parameters wisely (see the wikipedia entry) to allow additional optimizations.
To generate such a long set of random numbers faster, you could use SSE/AVX and generate four/eight 32 random bits in parallel.
You say "somewhat random" so I assume you do not need high quality random numbers.
You should probably use a "linear congruential generator" (LGC). See wikipedia for details:
http://en.wikipedia.org/wiki/Linear_congruential_generator
That will require one addition, one multiplication and one mod function per element.
Your options:
a) /dev/random is not intended to be called frequently. See "man 4 random" for details.
b) rand etc. are like the LGC above but some use a more sophisticated algorithm that gives better random numbers at a higher computational cost. See "man 3 random" and "man 3 rand" for details.
c) The OS deliberately zeros the memory for security reasons. It stops leakage of data from other processes. Google "demand zero paging" for details.
d) Not a good idea. Use /dev/random or /dev/urandom once, that's what they're for.
Perhaps calling OpenSSL routines, something like the programmatic equivalent of:
openssl rand NUM_BYTES | head -c NUM_BYTES > /dev/null
which should run faster than /dev/random and /dev/urandom.
Here's some test code:
/* randombytes.c */
#include <stdlib.h>
#include <stdio.h>
#include <openssl/rand.h>
/*
compile with:
gcc -Wall -lcrypto randombytes.c -o randombytes
*/
int main (int argc, char **argv)
{
unsigned char *random_bytes = NULL;
int length = 0;
if (argc == 2)
length = atoi(argv[1]);
else {
fprintf(stderr, "usage: randombytes number_of_bytes\n");
return EXIT_FAILURE;
}
random_bytes = malloc((size_t)length + 1);
if (! random_bytes) {
fprintf(stderr, "could not allocate space for random_bytes...\n");
return EXIT_FAILURE;
}
if (! RAND_bytes(random_bytes, length)) {
fprintf(stderr, "could not get random bytes...\n");
return EXIT_FAILURE;
}
*(random_bytes + length) = '\0';
fprintf(stdout, "bytes: %s\n", random_bytes);
free(random_bytes);
return EXIT_SUCCESS;
}
Here's how it performs on a Mac OS X 10.7.3 system (1.7 GHz i5, 4 GB), relative to /dev/urandom and OpenSSL's openssl binary:
$ time ./randombytes 100000000 > /dev/null
real 0m6.902s
user 0m6.842s
sys 0m0.059s
$ time cat /dev/urandom | head -c 100000000 > /dev/null
real 0m9.391s
user 0m0.050s
sys 0m9.326s
$ time openssl rand 100000000 | head -c 100000000 > /dev/null
real 0m7.060s
user 0m7.050s
sys 0m0.118s
The randombytes binary is 27% faster than reading bytes from /dev/urandom and about 2% faster than openssl rand.
You could profile other approaches in a similar fashion.
Don't over think it. dd if=/dev/urandom bs=64k count=1 > random-bytes.bin.

Resources