I am trying to find a regular expression in a large memory mapped file
by using regexec() function. I discovered that the program crashes when the file size
is the multiple of the page size.
Is there a regexec() function that has the length of the string
as additional argument?
Or:
How to find a regex in a memory mapped file?
Here is the minimal example that ALWAYS crashes
(if I run less that 3 threads program doesn't crash):
ls -la ttt.txt
-rwx------ 1 bob bob 409600 Jun 14 18:16 ttt.txt
gcc -Wall mal.c -o mal -lpthread -g && ./mal
[1] 11364 segmentation fault (core dumped) ./mal
And the program is:
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <stdio.h>
#include <assert.h>
#include <pthread.h>
#include <regex.h>
void* f(void*arg) {
int size = 409600;
int fd = open("ttt.txt", O_RDONLY);
char* text = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
fd = open("/dev/zero", O_RDONLY);
char* end = mmap(text + size, 4096, PROT_READ, MAP_PRIVATE | MAP_FIXED, fd, 0);
close(fd);
assert(text+size == end);
regex_t myre;
regcomp(&myre, "XXXXX", REG_EXTENDED);
regexec(&myre, text, 0, NULL, 0);
regfree(&myre);
return NULL;
}
int main(int argc, char* argv[]) {
int n = 10;
int i;
pthread_t t[n];
for (i = 0; i < n; ++i) {
pthread_create(&t[n], NULL, f, NULL);
}
for (i = 0; i < n; ++i) {
pthread_join(t[n], NULL);
}
return 0;
}
P.S.
This is the output from gdb:
gdb ./mal
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/bob/prog/c/mal...done.
(gdb) r
Starting program: /home/srdjan/prog/c/mal
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff77ff700 (LWP 11817)]
[New Thread 0x7ffff6ffe700 (LWP 11818)]
[New Thread 0x7ffff6799700 (LWP 11819)]
[New Thread 0x7fffeffff700 (LWP 11820)]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6799700 (LWP 11819)]
__strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:72
72 ../sysdeps/x86_64/multiarch/../strlen.S: No such file or directory.
(gdb) bt
#0 __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:72
#1 0x00007ffff78df254 in __regexec (preg=0x7ffff6798e80, string=0x7fffef79b000 'a' <repeats 200 times>..., nmatch=<optimized out>,
pmatch=0x0, eflags=<optimized out>) at regexec.c:245
#2 0x00000000004008e6 in f (arg=0x0) at mal.c:24
#3 0x00007ffff7bc4e9a in start_thread (arg=0x7ffff6799700) at pthread_create.c:308
#4 0x00007ffff78f24bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5 0x0000000000000000 in ?? ()
(gdb)
Celada correctly identifies the problem - the file data does not necessarily include a null terminator.
You could fix the problem by mapping a page of zeroes immediately after the file:
int fd;
char *text;
fd = open("ttt.txt", O_RDONLY);
text = mmap(NULL, 409600, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
fd = open("/dev/zero", O_RDONLY);
mmap(text + 409600, 4096, PROT_READ, MAP_PRIVATE | MAP_FIXED, fd, 0);
close(fd);
(Note that you can close fd immediately after the mmap(), because mmap() adds a reference to the open file description).
You should of course add error-checking to the above. Also, many UNIX systems support a MAP_ANONYMOUS flag which you can use instead of opening /dev/zero (but this is not in POSIX).
The problem is that regexec() is used to match a null-terminated string against the precompiled pattern buffer, but an mmaped file is not necessarily (indeed not usually) null-terminated. Thus, it is looking beyond the end of the file to find a NUL character (0 byte).
You would need a version of regexec() that takes a buffer and a size argument instead of a null-terminated string, but there doesn't appear to be one.
Related
I am trying with a small program from Distinction between processes and threads in Linux
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <pthread.h>
void* threadMethod(void* arg)
{
int intArg = (int) *((int*) arg);
int32_t pid = getpid();
uint64_t pti = pthread_self();
printf("[Thread %d] getpid() = %d\n", intArg, pid);
printf("[Thread %d] pthread_self() = %lu\n", intArg, pti);
}
int main()
{
pthread_t threads[2];
int thread1 = 1;
if ((pthread_create(&threads[0], NULL, threadMethod, (void*) &thread1))
!= 0)
{
fprintf(stderr, "pthread_create: error\n");
exit(EXIT_FAILURE);
}
int thread2 = 2;
if ((pthread_create(&threads[1], NULL, threadMethod, (void*) &thread2))
!= 0)
{
fprintf(stderr, "pthread_create: error\n");
exit(EXIT_FAILURE);
}
int32_t pid = getpid();
uint64_t pti = pthread_self();
printf("[Process] getpid() = %d\n", pid);
printf("[Process] pthread_self() = %lu\n", pti);
if ((pthread_join(threads[0], NULL)) != 0)
{
fprintf(stderr, "Could not join thread 1\n");
exit(EXIT_FAILURE);
}
if ((pthread_join(threads[1], NULL)) != 0)
{
fprintf(stderr, "Could not join thread 2\n");
exit(EXIT_FAILURE);
}
return 0;
}
On 64 bit Lubuntu 18.04, I compile it by the same command from the post:
$ gcc -pthread -o thread_test thread_test.c
I also try to follow what the post says:
By using scheduler locking in gdb, I can keep the program and its threads alive so I can capture what top
but because I am not familiar with gdb, the program runs to finish without pausing (see below). I also tried to set up breakpoint by break 43, but gdb says No line 40 in the current file. What shall I do to pause the execution, so that I can use top or ps to examine the threads' pid and tgid?
$ gdb thread_test
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from thread_test...(no debugging symbols found)...done.
(gdb) set scheduler-locking
Requires an argument. Valid arguments are off, on, step, replay.
(gdb) set scheduler-locking on
Target 'exec' cannot support this command.
(gdb) run
Starting program: /tmp/test/pthreads/thread_test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff77c4700 (LWP 4711)]
[New Thread 0x7ffff6fc3700 (LWP 4712)]
[Thread 1] getpid() = 4707
[Thread 1] pthread_self() = 140737345505024
[Process] getpid() = 4707
[Process] pthread_self() = 140737353951040
[Thread 0x7ffff77c4700 (LWP 4711) exited]
[Thread 2] getpid() = 4707
[Thread 2] pthread_self() = 140737337112320
[Thread 0x7ffff6fc3700 (LWP 4712) exited]
[Inferior 1 (process 4707) exited normally]
(gdb)
You have two problems:
you built your program without debugging info (add -g flag), and
you are trying to set scheduler-locking on before the program started (that doesn't work).
This should work:
gcc -g -pthread -o thread_test thread_test.c
gdb -q ./thread_test
(gdb) start
(gdb) set scheduler-locking on
However, you must be extra careful with this setting -- simply continuing from this point will get your program to block in pthread_join, as only the main thread will keep running.
the following is an example of using gdb with the posted code to pause everything:
note: this was compiled to find/fix compile problems via:
gcc -ggdb -Wall -Wextra -Wconversion -pedantic -std=gnu11 -c untitled.c
this was finally compiled/linked via:
gcc -ggdb -Wall -o untitled untitled.c -lpthread
then using the debugger: gdb, thereby showing my inputs and the gdb outputs:
$ gdb untitled
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from untitled...done.
(gdb) br main
Breakpoint 1 at 0x9a5: file untitled.c, line 20.
(gdb) br threadMethod
Breakpoint 2 at 0x946: file untitled.c, line 9.
(gdb) r
Starting program: /home/richard/Documents/forum/untitled
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, main () at untitled.c:20
20 {
(gdb) c
Continuing.
[New Thread 0x7ffff77c4700 (LWP 8645)]
[New Thread 0x7ffff6fc3700 (LWP 8646)]
[Switching to Thread 0x7ffff77c4700 (LWP 8645)]
Thread 2 "untitled" hit Breakpoint 2, threadMethod (arg=0x7fffffffdf4c)
at untitled.c:9
9 int intArg = (int) *((int*) arg);
(gdb)
then you can (in another terminal window) use ps etc to display info. However, the thread function will output (to stdout the information you might be interested in.
or you can (in gdb enter commands like:
(gdb) c
[Process] getpid() = 8641
[Process] pthread_self() = 140737353992000
[Switching to Thread 0x7ffff6fc3700 (LWP 8646)]
Thread 3 "untitled" hit Breakpoint 2, threadMethod (arg=0x7fffffffdf50)
at untitled.c:9
9 int intArg = (int) *((int*) arg);
(gdb) c
....
[Thread 1] getpid() = 8641
[Thread 1] pthread_self() = 140737345505024
....
[Thread 2] getpid() = 8641
[Thread 2] pthread_self() = 140737337112320
Please see this snippet I wrote that is supposed to simply convert a multibyte string (which it gets from stdin) to a wide string. Having read the mbsrtowcs and mbstate_t documentation from cppreference I thought it was valid:
#include <stdio.h>
#include <wchar.h>
#include <errno.h>
#include <stdlib.h>
#include <error.h>
int main()
{
char *s = NULL; size_t n = 0; errno = 0;
ssize_t sn = getline(&s, &n, stdin);
if(sn == -1 && errno != 0)
error(EXIT_FAILURE, errno, "getline");
if(sn == -1 && errno == 0) // EOF
return EXIT_SUCCESS;
// determine how big should be the allocated buffer
const char* cs = s; mbstate_t st = {0}; // cs to avoid comp. warnings
size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
if(wn == (size_t)-1)
error(EXIT_FAILURE, errno, "first mbsrtowcs");
wchar_t* ws = malloc((wn+1) * sizeof(wchar_t));
if(ws == NULL)
error(EXIT_FAILURE, errno, "malloc");
// finally convert the multibyte string to wide string
st = (mbstate_t){0};
if(mbsrtowcs(ws, &cs, wn+1, &st) == (size_t)-1)
error(EXIT_FAILURE, errno, "second mbsrtowcs");
if(printf("%ls", ws) < 0)
error(EXIT_FAILURE, errno, "printf");
return EXIT_SUCCESS;
}
Yes this works for ASCII strings. BUT the very reason I'm trying to deal with non-ASCII strings is that I would like to support diacritics beyond the ASCII table! And it fails for those. The first call to mbsrtowcs fails with EILSEQ, which would indicate that the multi-byte string is invalid. But oddly enough, inspecting it with gdb, it seems valid! (insofar as gdb displays it correctly). Please see the effects of feeding this snippet a non-ASCII string and gdbing it below:
m#m-X555LJ:~/wtfdir$ gcc -g -o wtf wtf.c
m#m-X555LJ:~/wtfdir$ ./wtf
asa
asa
m#m-X555LJ:~/wtfdir$ ./wtf
ąsa
./wtf: first mbsrtowcs: Invalid or incomplete multibyte or wide character
m#m-X555LJ:~/wtfdir$ gdb ./wtf
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./wtf...done.
(gdb) break 18
Breakpoint 1 at 0x93b: file wtf.c, line 18.
(gdb) r
Starting program: /home/m/wtfdir/wtf
ąsa
Breakpoint 1, main () at wtf.c:18
18 size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
(gdb) p cs
$1 = 0x555555756260 "ąsa\n"
(gdb) c
Continuing.
/home/m/wtfdir/wtf: first mbsrtowcs: Invalid or incomplete multibyte or wide character
[Inferior 1 (process 5612) exited with code 01]
(gdb) quit
If this matters, I'm on Linux, and the locale encoding seems to be UTF8:
m#m-X555LJ:~$ locale charmap
UTF-8
(this is why I expected this to work, trivial programs like printf("ąsa\n"); tend to work for me on Linux but not on Windows)
What am I missing? what am I doing wrong?
I am experiencing a strange problem with the the popen and fgets library functions on a Linux system.
A short program demonstrating the problem is below that:
Installs a signal handler for SIGUSR1.
Creates a secondary thread to repeatedly send SIGUSR1 to the main thread.
In the main thread, repeatedly executes a very simple shell command via popen(), gets the output via fgets(), and checks to see if the output is of the expected length.
The output is unexpectedly truncated intermittently. Why?
Command-line invocation example:
$ gcc -Wall test.c -lpthread && ./a.out
iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
unexpected length: 0
Details of my machine (the program will also compile and run with this online C compiler):
$ cat /etc/redhat-release
CentOS release 6.5 (Final)
$ uname -a
Linux localhost.localdomain 2.6.32-431.17.1.el6.x86_64 #1 SMP Wed May 7 23:32:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
# gcc 4.4.7
$ gcc --version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# glibc 2.12
$ ldd --version
ldd (GNU libc) 2.12
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
The program:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <signal.h>
#include <pthread.h>
#include <errno.h>
void dummy_signal_handler(int signal);
void* signal_spam_task(void* arg);
void echo_and_verify_output();
char* fgets_with_retry(char *buffer, int size, FILE *stream);
static pthread_t main_thread;
/**
* Prints an error message and exits if the output is truncated, which happens
* about 5% of the time.
*
* Installing the signal handler with the SA_RESTART flag, blocking SIGUSR1
* during the call to fgets(), or sleeping for a few milliseconds after the
* call to popen() will completely prevent truncation.
*/
int main(int argc, char **argv) {
// install signal handler for SIGUSR1
struct sigaction sa, osa;
sa.sa_handler = dummy_signal_handler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGUSR1, &sa, &osa);
// create a secondary thread to repeatedly send SIGUSR1 to main thread
main_thread = pthread_self();
pthread_t spam_thread;
pthread_create(&spam_thread, NULL, signal_spam_task, NULL);
// repeatedly execute simple shell command until output is unexpected
unsigned int i = 0;
for (;;) {
printf("iteration %u\n", i++);
echo_and_verify_output();
}
return 0;
}
void dummy_signal_handler(int signal) {}
void* signal_spam_task(void* arg) {
for (;;)
pthread_kill(main_thread, SIGUSR1);
return NULL;
}
void echo_and_verify_output() {
// run simple command
FILE* stream = popen("echo -n hello", "r");
if (!stream)
exit(1);
// count the number of characters in the output
unsigned int length = 0;
char buffer[BUFSIZ];
while (fgets_with_retry(buffer, BUFSIZ, stream) != NULL)
length += strlen(buffer);
if (ferror(stream) || pclose(stream))
exit(1);
// double-check the output
if (length != strlen("hello")) {
printf("unexpected length: %i\n", length);
exit(2);
}
}
// version of fgets() that retries on EINTR
char* fgets_with_retry(char *buffer, int size, FILE *stream) {
for (;;) {
if (fgets(buffer, size, stream))
return buffer;
if (feof(stream))
return NULL;
if (errno != EINTR)
exit(1);
clearerr(stream);
}
}
If an error occurs on a FILE stream while reading with fgets, it's undefined as to whether some bytes read are transferred to the buffer before fgets returns NULL or not (7.19.7.2 of the C99 spec). So if the SIGUSR1 signal occurs while in the fgets call and causes an EINTR, its possible that some characters may be lost from the stream.
The upshot is that you can't use stdio functions to read/write FILE objects if the underlying system calls might have recoverable error returns (such as EINTR or EAGAIN), as there's no guarantee the standard library won't lose some data from the buffer when that happens. You can claim that this is a "bug" in the standard library implementation, but it is a bug that the C standard allows.
I want to break whenchmodified.I usedwatch chin gdb,it does not work.
Something like ch=1;will break.Why read()not?
Is is right use watch command like this. Or the read()function is Special?
Sorry for my English, Code say all things.
file 1.c:
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
const char *const filename = "/etc/passwd";
int main(void)
{
int fd;
int ch;
fd = open(filename, O_RDONLY);
read(fd, &ch, sizeof(int));
printf ("%d\n", ch);
close (fd);
return 0;
}
gcc -g 1.c
debugging:
$ gdb a.out
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/zodiac1111/tmp/a.out...done.
(gdb) b main
Breakpoint 1 at 0x80484b5: file 1.c, line 11.
(gdb) r
Starting program: /home/zodiac1111/tmp/a.out
Breakpoint 1, main () at 1.c:11
11 fd = open(filename, O_RDONLY);
(gdb) watch ch
Hardware watchpoint 2: ch
(gdb) c
Continuing.
1953460082
Watchpoint 2 deleted because the program has left the block in
which its expression is valid.
__libc_start_main (main=0x80484ac <main>, argc=1, ubp_av=0xbffff4c4,
init=0x8048530 <__libc_csu_init>, fini=0x8048520 <__libc_csu_fini>,
rtld_fini=0xb7ff0590, stack_end=0xbffff4bc) at libc-start.c:260
260 libc-start.c: No such dir...
(gdb) c
Continuing.
[Inferior 1 (process 9513) exited normally]
For a normal implementation of read(), the write to the memory will be performed directly by the kernel, not by any userspace code. The debugger does not have the mechanisms to put a breakpoint in the kernel, and even if it did, it wouldn't have permission to do so.
I know that mmap is a system call, but there must be some wrapper in glibc that does the system call. Yet when I try to use gdb to step through the mmap function in my program, gdb ignores it as it can't find any source file for it (Note I compile my own glibc from source). I can step through other glibc library functions such as printf and malloc but not mmap. I also use the flag -fno-builtin so that gcc doesn't use built in functions. Any help on this will be greatly appreciated.
I don't know what your problem is. It works perfectly fine for me.
Using system libc.so.6, with debug symbols installed:
// mmap.c
#include <sys/mman.h>
int main()
{
void *p = mmap(0, 4096, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
return 0;
}
gcc -g mmap.c
$ gdb -q a.out
Reading symbols from /tmp/a.out...done.
(gdb) start
Temporary breakpoint 1 at 0x40052c: file mmap.c, line 5.
Temporary breakpoint 1, main () at mmap.c:5
5 void *p = mmap(0, 4096, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
(gdb) step
mmap64 () at ../sysdeps/unix/syscall-template.S:82
82 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb)
mmap64 () at ../sysdeps/unix/syscall-template.S:83
83 in ../sysdeps/unix/syscall-template.S
(gdb)
main () at mmap.c:6
6 return 0;
(gdb) q
Using my own glibc build:
gdb -q a.out
Reading symbols from /tmp/a.out...done.
(gdb) start
Temporary breakpoint 1 at 0x40056c: file mmap.c, line 5.
warning: Could not load shared library symbols for linux-vdso.so.1.
Do you need "set solib-search-path" or "set sysroot"?
Temporary breakpoint 1, main () at mmap.c:5
5 void *p = mmap(0, 4096, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
(gdb) step
mmap64 () at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb)
mmap64 () at ../sysdeps/unix/syscall-template.S:82
82 ret
(gdb)
main () at mmap.c:6
6 return 0;
(gdb) q