how to retrieve the large file - c

I am working on an application wherein i need to compare 10^8 entries(alphanumeric entries). To retrieve the entries from file( file size is 1.5 GB) and then to compare them, i need to take less than 5 minutes of time. So, what would b the effective way to do that, since, only retrieving time is exceeding 5 min. And i need to work on file only. please suggest a way out.
I m working on windows with 3GB RAM n 100Gb hard disk.

Read a part of the file, sort it, write it to a temporary file.
Merge-sort the resulting files.

Error handling and header includes are not included. You need to provide DataType and cmpfunc, samples are provided. You should be able to deduce the core workings from this snippet:
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
typedef char DataType; // is this alphanumeric?
int cmpfunc(char const *left, char const *right)
{
return *right - *left;
}
int main(int argc, char **argv)
{
int fd = open(argv[1], O_RDWR|O_LARGEFILE);
if (fd == -1)
return 1;
struct stat st;
if (fstat(fd, &st) != 0)
return 1;
DataType *data = mmap(NULL, st.st_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (!data)
return 1;
qsort(data, st.st_size / sizeof(*data), cmpfunc);
if (0 != msync(data, st.st_size, MS_SYNC))
return 1;
if (-1 == munmap(data, st.st_size))
return 1;
if (0 != close(fd))
return 1;
return 0;
}
I can't imagine you can get much faster than this. Be sure you have enough virtual memory address space (1.5GB is pushing it but will probably just work on 32bit Linux, you'll be able to manage this on any 64bit OS). Note that this code is "limited" to working on a POSIX compliant system.
In terms of C and efficiency, this approach puts the entire operation in the hands of the OS, and the excellent qsort algorithm.

If retrieving time is exceeding 5 min it seems that you need to look at how you are reading this file. One thing that has caused bad performance for me is that a C implementation sometimes uses thread-safe I/O operations by default, and you can gain some speed by using thread-unsafe I/O.
What kind of computer will this be run on? Many computers nowadays have several gigabytes of memory, so perhaps it will work to just read it all into memory and then sort it there (with, for example, qsort)?

Related

Using MAP_POPULATE for writing to file

My application writes dumps huge files to disk. For various reasons it is more convenient to use mmap and memory writes than using the fwrite interface.
The slow part of writing writing to a file in this way are page faults. Using mmap with MAP_POPULATE is supposed to help; from the man page:
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. This will help to
reduce blocking on page faults later. MAP_POPULATE is supported for private mappings only since Linux 2.6.23.
(To answer the obvious question: I've tested this on relatively recent kernels, on 4.15 and 5.1).
However, this does not seem to reduce pagefaults while writing to the mapped file.
Minimal example code: test.c:
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
static int exit_with_error(const char *msg) {
perror(msg);
exit(EXIT_FAILURE);
}
int main() {
const size_t len = 1UL << 30;
const char *fname = "/tmp/foobar-huge.txt";
int f = open(fname, O_RDWR|O_CREAT|O_EXCL, 0644);
if(f == -1) {
exit_with_error("open");
}
int ret = ftruncate(f, len);
if(ret == -1) {
exit_with_error("ftruncate");
}
void *mem = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED|MAP_POPULATE, f, 0);
if(mem == MAP_FAILED) {
exit_with_error("mmap");
}
ret = close(f);
if(ret == -1) {
exit_with_error("close");
}
memset(mem, 'f', len);
}
When running this under a profiler or using perf stat it's clearly visible that the memset at the end triggers (many) pagefaults.
In fact, this program is slower when MAP_POPULATE is passed, on my machine ~1.8s vs ~1.6s without MAP_POPULATE. The difference simply seems to be the time it takes to do the populate, the number of page faults that perf stat reports is identical.
A last observation is that this behaves as expected when I read from the file, instead of writing -- in this case the MAP_POPULATE reduces the number of pagefaults to almost zero and helps to improve performance drastically.
Is this the expected behavior for MAP_POPULATE? Am I doing something wrong?
Because although the pages have been prefaulted by MAP_POPULATE, they are not dirty pages. Therefore, when writing to these pages, the CoW page faults would be triggered to set the page entries as dirty and writable.

C: reading file backwards : why this particular method is not considered good?

Hi I have a doubt regarding following question: In the OS textbook "Operating Systems in Depth by Thomas W Doeppner", one of the chapter exercise questions asks us to find fault with the given code for reading file contents backwards and also asks for a better way to do it. Now I have come across many ways to do that but cant really find out why the following is not considered a good way of doing it?
Appreciate your time and help ,thank you!
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
int main() {
int fd;
off_t fptr;
fd = open("./file.txt", O_RDONLY);
char buf[3];
/* go to last char in file */
fptr = lseek(fd, (off_t)-1, SEEK_END);
while (fptr != -1) {
read(fd, buf, 1);
write(1, buf, 1);
fptr = lseek(fd, (off_t)-2, SEEK_CUR);
}
return 0;
}
The method illustrated in your code is inefficient because you make 3 system calls for each byte in the file. Furthermore, you do not check the return values of the read() and write() function calls, nor that the file was opened successfully.
To improve efficiency, you should bufferize the input/output operations.
Using putchar() instead of write() would be both more efficient and more reliable.
Reading a chunk of file contents (from a few kilobytes to several megabytes) at a time would be more efficient too.
As always, benchmark the resulting code to measure actual performance improvements.

using mmap() to search large file (~1TB)

I'm working on a project that is trying to search for specific bytes (e.g. 0xAB) in a filesystem (e.g. ext2). I was able to find what I needed using malloc(), realloc(), and memchr(), but it seemed slow so I was looking into using mmap(). What I am trying to do is find a specific bytes, then copy them into a struct, so I have two questions: (1) is using mmap() the best strategy, and (2) why isn't the following code working (I get EINVAL error)?
UPDATE: The following program compiles and runs but I still have a couple issues:
1) it won't display correct file size on large files (displayed correct size for 1GB flash drive, but not for 32GB)*.
2) it's not searching the mapping correctly**.
*Is THIS a possible solution to getting the correct size using stat64()? If so, is it something I add in my Makefile? I haven't worked with makefiles much so I don't know how to add something like that.
**Is this even the proper way to search?
#define _LARGEFILE64_SOURCE
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <errno.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
int main(int argc, char **argv) {
int fd = open("/dev/sdb1", O_RDONLY);
if(fd < 0) {
printf("Error %s\n", strerror(errno));
return -1;
}
const char * map;
off64_t size;
size = lseek64(fd, 0, SEEK_END);
printf("file size: %llu\n", size);
lseek64(fd, 0, SEEK_SET);
map = mmap(0, size, PROT_READ, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) { handle_error("mmap error"); }
printf("Searching for magic numbers...\n");
for (i=0; i < size; i++) {
if(map[i] == 0X53 && map[i + 1] == 0XEF) {
if ((map[i-32] == 0X00 && map[i-31] == 0X00) ||
(map[i-32] == 0X01 && map[i-31] == 0X00) ||
(map[i-32] == 0X02 && map[i-31] == 0X00)) {
if(j <= 5) {
printf("superblock %d found\n", j);
++j;
} else break;
int q;
for(q=0; q<j; q++) {
printf("SUPERBLOCK[%d]: %d\n", q+1, sb_pos[q]);
}
fclose(fd);
munmap(map, size);
return 0;
}
Thanks for your help.
mmap is a very efficient way to handle searching a large file, especially in cases where there's an internal structure you can use (e.g. using mmap on a large file with fixed-size records that are sorted would permit you to do a binary search, and only the pages corresponding to records read would be touched).
In your case you need to compile for 64 bits and enable large file support (and use open(2)).
If your /dev/sdb1 is a device and not a file, I don't think stat(2) will show an actual size. stat returns a size of 0 for these devices on my boxes. I think you'll need to get the size another way.
Regarding address space: x86-64 uses 2^48 bytes of virtual address space, which is 256 TiB. You can't use all of that, but there's easily ~127 TiB of contiguous address space in most processes.
I just noticed that I was using fopen(), should I be using open() instead?
Yes, you should use open() instead of fopen(). And that's the reason why you got EINVAL error.
fopen("/dev/sdb1", O_RDONLY);
This code is totally incorrect. O_RDONLY is flag that ought to be used with open() syscall but not with fopen() libc functgion
You should also note that mmaping of large files is available only if you are running on a platform with large virtual address space. It's obvious: you should have enough virtual memory to address your file. Speaking about Intel, it should be only x86_64, not x86_32.
I haven't tried to do this with really large files ( >4G). May be some additional flags are required to be passed into open() syscall.
I'm working on a project that is trying to search for specific bytes (e.g. 0xAB) in a filesystem (e.g. ext2)
To mmap() a large file into memory is totally wrong approach in your case. You just need to process your file step-by-step by chunks with fixed size (something about 1MB). You can use mmap() or just read() it into your intenal buffer - that doesn't matter. But putting a whole file into memory is totally overkill if you just want to process it sequentially.

shred and remove files in linux from a C program

I want to shred some temp files produced by my C program before the files are removed.
Currently I am using
system("shred /tmp/datafile");
system("rm /tmp/datafile");
from within my program, but I think instead of calling the system function is not the best way (correct me if I am wrong..) Is there any other way I can do it? How do I shred the file from within my code itself? A library, or anything? Also, about deletion part, is this answer good?
Can I ask why you think this is not the best way to achieve this? It looks like a good solution to me, if it is genuinely necessary to destroy the file contents irretrievably.
The advantage of this way of doing it are:
the program already exists (so it's faster to develop); and
the program is already trusted.
The second is an important point. It's possible to overstate the necessity of elaborately scrubbing files (Peter Gutmann, in a remark quoted on the relevant wikipedia page, has described some uses of his method as ‘voodoo’), but that doesn't matter: in any security context, using a pre-existing tool is almost always more defensible than using something home-made.
About the only criticism I'd make of your current approach, using system(3), is that since it looks up the shred program in the PATH, it would be possible in principle for someone to play games with that and get up to mischief. But that's easily dealt with: use fork(2) and execve(2) to invoke a specific binary using its full path.
That said, if this is just a low-impact bit of tidying up, then it might be still more straightforward to simply mmap the file and quickly write zeros into it.
You can use the following code:
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#define BUF_SIZE 4096
#define ABS_FILE_PATH "/tmp/aaa"
int main()
{
//get file size
struct stat stat_buf;
if (stat(ABS_FILE_PATH, &stat_buf) == -1)
return errno;
off_t fsize = stat_buf.st_size;
//get file for writing
int fd = open(ABS_FILE_PATH, O_WRONLY);
if (fd == -1)
return errno;
//fill file with 0s
void *buf = malloc(BUF_SIZE);
memset(buf, 0, BUF_SIZE);
ssize_t ret = 0;
off_t shift = 0;
while((ret = write(fd, buf,
((fsize - shift >BUF_SIZE)?
BUF_SIZE:(fsize - shift)))) > 0)
shift += ret;
close(fd);
free(buf);
if (ret == -1)
return errno;
//remove file
if (remove(ABS_FILE_PATH) == -1)
return errno;
return 0;
}

Why can't my program save a large amount (>2GB) to a file?

I am having trouble trying to figure out why my program cannot save more than 2GB of data to a file. I cannot tell if this is a programming or environment (OS) problem. Here is my source code:
#define _LARGEFILE_SOURCE
#define _LARGEFILE64_SOURCE
#define _FILE_OFFSET_BITS 64
#include <math.h>
#include <time.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/*-------------------------------------*/
//for file mapping in Linux
#include<fcntl.h>
#include<unistd.h>
#include<sys/stat.h>
#include<sys/time.h>
#include<sys/mman.h>
#include<sys/types.h>
/*-------------------------------------*/
#define PERMS 0600
#define NEW(type) (type *) malloc(sizeof(type))
#define FILE_MODE (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)
void write_result(char *filename, char *data, long long length){
int fd, fq;
fd = open(filename, O_RDWR|O_CREAT|O_LARGEFILE, 0644);
if (fd < 0) {
perror(filename);
return -1;
}
if (ftruncate(fd, length) < 0)
{
printf("[%d]-ftruncate64 error: %s/n", errno, strerror(errno));
close(fd);
return 0;
}
fq = write (fd, data,length);
close(fd);
return;
}
main()
{
long long offset = 3000000000; // 3GB
char * ttt;
ttt = (char *)malloc(sizeof(char) *offset);
printf("length->%lld\n",strlen(ttt)); // length=0
memset (ttt,1,offset);
printf("length->%lld\n",strlen(ttt)); // length=3GB
write_result("test.big",ttt,offset);
return 1;
}
According to my test, the program can generate a file large than 2GB and can allocate such large memory as well.
The weird thing happened when I tried to write data into the file. I checked the file and it is empty, which is supposed to be filled with 1.
Can any one be kind and help me with this?
You need to read a little more about C strings and what malloc and calloc do.
In your original main ttt pointed to whatever garbage was in memory when malloc was called. This means a nul terminator (the end marker of a C String, which is binary 0) could be anywhere in the garbage returned by malloc.
Also, since malloc does not touch every byte of the allocated memory (and you're asking for a lot) you could get sparse memory which means the memory is not actually physically available until it is read or written.
calloc allocates and fills the allocated memory with 0. It is a little more prone to fail because of this (it touches every byte allocated, so if the OS left the allocation sparse it will not be sparse after calloc fills it.)
Here's your code with fixes for the above issues.
You should also always check the return value from write and react accordingly. I'll leave that to you...
main()
{
long long offset = 3000000000; // 3GB
char * ttt;
//ttt = (char *)malloc(sizeof(char) *offset);
ttt = (char *)calloc( sizeof( char ), offset ); // instead of malloc( ... )
if( !ttt )
{
puts( "calloc failed, bye bye now!" );
exit( 87 );
}
printf("length->%lld\n",strlen(ttt)); // length=0 (This now works as expected if calloc does not fail)
memset( ttt, 1, offset );
ttt[offset - 1] = 0; // Now it's nul terminated and the printf below will work
printf("length->%lld\n",strlen(ttt)); // length=3GB
write_result("test.big",ttt,offset);
return 1;
}
Note to Linux gurus... I know sparse may not be the correct term. Please correct me if I'm wrong as it's been a while since I've been buried in Linux minutiae. :)
Looks like you're hitting the internal file system's limitation for the iDevice: ios - Enterprise app with more than resource files of size 2GB
2Gb+ files are simply not possible. If you need to store such amount of data you should consider using some other tools or write the file chunk manager.
I'm going to go out on a limb here and say that your problem may lay in memset().
The best thing to do here is, I think, after memset() ing it,
for (unsigned long i = 0; i < 3000000000; i++) {
if (ttt[i] != 1) { printf("error in data at location %d", i); break; }
}
Once you've validated that the data you're trying to write is correct, then you should look into writing a smaller file such as 1GB and see if you have the same problems. Eliminate each and every possible variable and you will find the answer.

Resources