searching an integer from a text file in c - c

I have a simulation program written in c and I need to create random numbers and write them to a txt file. Program only stops
- when a random number already generated is generated again or
- 1 billion random number are generated (no repetition)
My problem is that I could not search the generated long int random number in the txt file!
Text file format is:
9875
764
19827
2332
...
Any help is appreciated..
`
FILE * out;
int checkNumber(long int num){
char line[512];
long int number;
int result=0;
if((out = fopen("out.txt","r"))==NULL){
result= 1;
}
char buf[10];
itoa(num, buf, 10);
while(fgets(line, 512, out) != NULL)
{
if((strstr(line,buf)) != NULL){
result = 0;
}
}
if(out) {
fclose(out);
}
return result;
}
int main(){
int seed;
long int nRNs=0;
long int numberGenerated;
out = fopen ("out.txt","w");
nRNs=0;
seed = 12345;
srand (seed);
fprintf(out,"%d\n",numberGenerated);
while( nRNs != 1000000000 )
{
numberGenerated = rand();
nRNs++;
if(checkNumber(numberGenerated)==0){
fclose(out); break; system("pause");
}
else{
fprintf(out,"%d\n",numberGenerated);
}
}
fclose(out);
}`

If the text file only contains randomly generated numbers separated by space, then you need strtok() function(google its usage) and throw it into the binary tree structure as mentioned by #jacekmigacz. But in any circumstance, you will have to search the whole file once at least. Then ftell() the value to get the location you've searched for in the file. When another number is generated you can use fseek() to get the latest number. Remember to get the data line by line with fgets()
Take care of the memory requirements and use malloc() judiciously

Try with tree (data structure).

Searching linearly through the text file every time is gonna take forever with so many numbers. You could hold every number generated so far sorted in a data structure so that you can do a binary search for a duplicate. This is going to need a lot of RAM though. For 1 billion integers that's already 4GB on a system with 32-bit integers, and you'll need several more for the data structure overhead. My estimate is around 16GB in the worst case scenario (where you actually get to 1 billion unique integers.)
If you don't have a memory monster machine, you should instead write the data structure to a binary file and do the binary search there. Though that's still gonna be quite slow.

This may work or you can approach like this : (slow but will work)
int new_rand = rand();
static int couter = 0;
FILE *fptr = fopen("txt","a+");
int i;
char c,buf[10];
while((c=getc(fptr))!=EOF)
{
buf[j++]=c;
if(c == ' ')
{
buf[--j]='\0';
i=atoi(buf);
if(i == new_rand)
return;
j=0;
}
if(counter < 1000000)
{
fwrite(&new_rand, 4, 1, fptr);
counter++;
}

Don't open and scan your file to checkNumber(). You'll be waiting forever.
Instead, keep your generated numbers in memory using a bit set data structure and refer to that.
Your bit set will need to be large enough to indicate every 32-bit integer, so it'll consume 2^32 / 8 bytes (or 512MiB) of memory. This may seem like a lot but it's much smaller than 32-bit * 1,000,000,000 (4GB). Also, both checking and updating will be done in constant time.
Edit: The wikipedia link doesn't do much to explain how to code one, so here's a rough sample: (There're faster ways of writing this, e.g.: using bit shifts instead of division, but this should be easier to understand.)
int checkNumberOrUpdate(char *bitSet, long int num){
char b = 1 << (num % 8);
char w = num / 8;
if (bitSet[w] & ~b) {
return 1;
}
bitSet[w] |= b;
return 0;
}
Note, bitSet needs to be calloc()d to the right size from your main function.

Related

Optimizing disk IO

I have a piece of code that analyzes streams of data from very large (10-100GB) binary files. It works well, so it's time to start optimizing, and currently disk IO is the biggest bottleneck.
There are two types of files in use. The first type of file consists of a stream of 16-bit integers, which must be scaled after I/O to convert to a floating point value which is physically meaningful. I read the file in chunks, and I read in the chunks of data by reading one 16-bit code at a time, performing the required scaling, and then storing the result in an array. Code is below:
int64_t read_current_chimera(FILE *input, double *current,
int64_t position, int64_t length, chimera *daqsetup)
{
int64_t test;
uint16_t iv;
int64_t i;
int64_t read = 0;
if (fseeko64(input, (off64_t)position * sizeof(uint16_t), SEEK_SET))
{
return 0;
}
for (i = 0; i < length; i++)
{
test = fread(&iv, sizeof(uint16_t), 1, input);
if (test == 1)
{
read++;
current[i] = chimera_gain(iv, daqsetup);
}
else
{
perror("End of file reached");
break;
}
}
return read;
}
The chimera_gain function just takes a 16-bit integer, scales it and returns the double for storage.
The second file type contains 64-bit doubles, but it contains two columns, of which I only need the first. To do this I fread pairs of doubles and discard the second one. The double must also be endian-swapped before use. The code I use to do this is below:
int64_t read_current_double(FILE *input, double *current, int64_t position, int64_t length)
{
int64_t test;
double iv[2];
int64_t i;
int64_t read = 0;
if (fseeko64(input, (off64_t)position * 2 * sizeof(double), SEEK_SET))
{
return 0;
}
for (i = 0; i < length; i++)
{
test = fread(iv, sizeof(double), 2, input);
if (test == 2)
{
read++;
swapByteOrder((int64_t *)&iv[0]);
current[i] = iv[0];
}
else
{
perror("End of file reached: ");
break;
}
}
return read;
}
Can anyone suggest a method of reading these file types that would be significantly faster than what I am currently doing?
First off, it would be useful to use a profiler to identify the hot spots in your program. Based on your description of the problem, you have a lot of overhead going on by the sheer number of freads. As the files are large there will be a big benefit to increasing the amount of data you read per io.
Convince yourself of this by putting together 2 small programs that read the stream.
1) read it as you are in the example above, of 2 doubles.
2) read it the same way, but make it 10,000 doubles.
Time both runs a few times, and odds are you will be observe #2 runs much faster.
Best of luck.

print out a large number of integers rapidly in C

I have to print 1,000,000 four digit numbers. I used printf for this purpose
for(i=0;i<1000000;i++)
{
printf("%d\n", students[i]);
}
and it turns out to be too slow.Is there a faster way so that I can print it.
You could create an array, fill it with output data and then print out that array at once. Or if there is memory problem, just break that array to smaller chunks and print them one by one.
Here is my attempt replacing printf and stdio stream buffering with straightforward special-case code:
int print_numbers(const char *filename, const unsigned int *input, size_t len) {
enum {
// Maximum digits per number. The input numbers must not be greater
// than this!
# if 1
DIGITS = 4,
# else
// Alternative safe upper bound on the digits per integer
// (log10(2) < 28/93)
DIGITS = sizeof *input * CHAR_BIT * 28UL + 92 / 93,
# endif
// Maximum lines to be held in the buffer. Tune this to your system,
// though something on the order of 32 kB should be reasonable
LINES = 5000
};
// Write the output in binary to avoid extra processing by the CRT. If necessary
// add the expected "\r\n" line endings or whatever else is required for the
// platform manually.
FILE *file = fopen(filename, "wb");
if(!file)
return EOF;
// Disable automatic file buffering in favor of our own
setbuf(file, NULL);
while(len) {
// Set up a write pointer for a buffer going back-to-front. This
// simplifies the reverse order of digit extraction
char buffer[(DIGITS + 1 /* for the newline */) * LINES];
char *tail = &buffer[sizeof buffer];
char *head = tail;
// Grab the largest set of lines still remaining to be printed which
// will safely fit in our buffer
size_t chunk = len > LINES ? LINES : len;
const unsigned int *input_chunk;
len -= chunk;
input += chunk;
input_chunk = input;
do {
// Convert the each number by extracting least-significant digits
// until all have been printed.
unsigned int number = *--input_chunk;
*--head = '\n';
do {
# if 1
char digit = '0' + number % 10;
number /= 10;
# else
// Alternative in case the compiler is unable to merge the
// division/modulo and perform reciprocal multiplication
char digit = '0' + number;
number = number * 0xCCCDUL >> 19;
digit -= number * 10;
# endif
*--head = digit;
} while(number);
} while(--chunk);
// Dump everything written to the present buffer
fwrite(head, tail - head, 1, file);
}
return fclose(file);
}
I fear this won't buy you much more than a fairly small constant factor over your original (by avoiding some printf format parsing, per-character buffering, locale handling, multithreading locks, etc.)
Beyond this you may want to consider processing the input and writing the output on-the-fly instead of reading /processing/writing as separate stages. Of course whether or not this is possible depends entirely on the operation to be performed.
Oh, and don't forget to enable compiler optimizations when building the application. A run through with a profiler couldn't hurt either.

C: sum of integer values by string identifiers

So, I have two files of financial data, say 'symbols', and 'volumes'. In symbols I have strings such as:
FOO
BAR
BAZINGA
...
In volumes, I have integer values such as:
0001387
0000022
0123374
...
The idea is that the stock symbols will repeat in the file and I need to find the total volume of each stock. So, each row where I observe foo I increment total volume of foo by the value observed in volumes. The problem is that these files can be huge: easily 5 - 100 million records. A typical day may have ~1K different symbols in the file.
Doing it using strcmp on symbols each new line will be very inefficient. I was thinking of using an associative array --- hash table library which allows string keys --- such as uthash or Glib's hashtable.
I am reading some pretty good things about Judy arrays? Is the licensing a problem in this case?
Any thoughts on the choice of an efficient hash-table implementation? And also, whether I should use hash tables at all or perhaps something else entirely.
Umm.. apologize for the omission earlier: I need to have a pure C solution.
Thanks.
Definitely hashtable sounds good. You should look at the libiberty implementation.
You can find it on the GCC project Here.
I would use Map of C++ STL. Here's how the pseudo-code looks like:
map< string, long int > Mp;
while(eof is not reached)
{
String stock_name=readline_from_file1();
long int stock_value=readline_from_file2();
Mp[stock_name]+=stock_value;
}
for(each stock_name in Mp)
cout<<stock_name<<" "<<stock_value<<endl;
Based on the amount of data you gave, it may be a bit inefficient, but I'd suggest this because its much easier to implement.
If the solution is to be implemented strictly in C, then hashing will be the best solution. But, if you feel that implementing a hash-table and writing the code to avoid collisions is complex, I have another idea of using trie. It may sound weird, but this can also help a bit.
I would suggest you to read this one. It has a nice explanation about what a trie is and how to construct it. The implementation in C was also given there. So, you may have a doubt of where to store the volumes for each stock. This value can be stored at the end of the stock string and can be updated easily whenever needed.
But as you say that you are new to C, i advice you to try implementing using hash table and then try this one.
Thinking why not stick to your associative array idea. I assume, at the end of execution you need to a have list of unique names with their aggregated values. Below will work as far as you have memory to hold all unique names. ofcourse, this might not be that efficient, however, few tricks can be done depending upon the patterns of your data.
Consolidate_Index =0;
struct sutruct_Customers
{
name[];
value[];
}
sutruct_Customers Customers[This_Could_be_worse_if_all_names_are_unique]
void consolidate_names(char *name , int value)
{
for(i=0;i<Consolidate_Index;i++){
if(Customers[i].name & name)
{
Customers[i].value+= Values[index];
}
else
{
Allocate memory for Name Now!
Customers[Consolidate_Index].name = name;
Customers[Consolidate_Index].value = Value;
Consolidate_Index++;
}
}
}
main(){
sutruct_Customers buffer[Size_In_Each_Iteration]
while(unless file is done){
file-data-chunk_names to buffer.name
file-data-chunk_values to buffer.Values
for(; i<Size_In_Each_Iteration;i++)
consolidate_names(buffer.Names , buffer.Values);
}
My solution:
I did end up using the JudySL array to solve this problem. After some reading, the solution was quite simple to implement using Judy. I am replicating the solution here in full for it to be useful to anyone else.
#include <stdio.h>
#include <Judy.h>
const unsigned int BUFSIZE = 10; /* A symbol is only 8 chars wide. */
int main (int argc, char const **argv) {
FILE *fsymb = fopen(argv[1], "r");
if (fsymb == NULL) return 1;
FILE *fvol = fopen(argv[2], "r");
if (fvol == NULL) return 1;
FILE *fout = fopen(argv[3], "w");
if (fout == NULL) return 1;
unsigned int lnumber = 0;
uint8_t symbol[BUFSIZE];
unsigned long volume;
/* Initialize the associative map as a JudySL array. */
Pvoid_t assmap = (Pvoid_t) NULL;
Word_t *value;
while (1) {
fscanf(fsymb, "%s", symbol);
if (feof(fsymb)) break;
fscanf(fvol, "%lu", &volume);
if (feof(fvol)) break;
++lnumber;
/* Insert a new symbol or return value if exists. */
JSLI(value, assmap, symbol);
if (value == PJERR) {
fclose(fsymb);
fclose(fvol);
fclose(fout);
return 2;
}
*value += volume;
}
symbol[0] = '\0'; /* Start from the empty string. */
JSLF(value, assmap, symbol); /* Find the next string in the array. */
while (value != NULL) {
fprintf(fout, "%s: %lu\n", symbol, *value); /* Print to output file. */
JSLN(value, assmap, symbol); /* Get next string. */
}
Word_t tmp;
JSLFA(tmp, assmap); /* Free the entire array. */
fclose(fsymb);
fclose(fvol);
fclose(fout);
return 0;
}
I tested the solution on a 'small' sample containing 300K lines. The output is correct and the elapsed time was 0.074 seconds.

Reading N integers in a single access from a file in C

Im trying to implement External Sorting in C.
I have to read N integers (fixed depending on main memory) from a file initially so that I can apply quicksort on them and then continue with the merging process.
I can think of these 2 ways:
read N integers one by one from the file and put them in an array then sort them.
read a bulk of data into a big char array and then reading integers from it using sscanf.
1st method is clearly slow and 2nd method is using lot of extra memory (but we have a limited main memory)
Is there any better way?
Don't try to be more clever than your OS, it probably supports some clever memory management functions, which will make your life easier, and your code faster.
Assuming you are using a POSIX compliant operating system, then you can use mmap(2).
Map your file into memory with mmap
Sort it
Sync it
This way the OS will handle swapping out data when room is tight, and swap it in when you need it.
Since stdio file operations are buffered, you won't really need to worry about the first option, especially if the file isn't huge. Remember you're not operating directly on a file, but a representation of that file in memory.
For example, if you scan in one number at a time, the system will read in a much larger section from the file (on my system it's 4096 bytes, or the entire file if it's shorter).
you can use below function to read ints from file one by one and continue sorting and merging on the go....
the function takes filename and integer count as argument and it returns int from file.
int read_int (const char *file_name, int count)
{
int err = -1;
int num = 0;
int fd = open(filename, O_RDONLY);
if(fd < 0)
{
printf("error opening file\n");
return (fd);
}
err = pread(fd, &num, sizeof(int), count*sizeof(int));
if(err < 0)
{
printf("End of file reached\n");
return (err);
}
close(fd);
return (num);
}
Sort at the same time you read is the best way. and save your data into linked list instead of array is more efficient in the sort
you can use fscanf() to read integer by integer from file. and try to sort at the moment you read integer from the file. I mean when you read integer from the file put it in the array in the right place to get the array sorted when you finish reading.
The following example read from file integer by integer and insert them with sort at the same time of reading. the integer are saved into arrays and not into linked list
void sort_insert(int x, int *array, int len)
{
int i=0, j;
for(i=0; i<(len-1); i++)
{
if (x > array[i])
continue;
for (j=(len-1); j>i; j--)
array[j] = array[j-1];
break;
}
array[i] = x;
}
void main() {
int x, i;
int len = 0;
int array[50];
FILE *fp = fopen("myfile.txt", "r");
while (len<50 && fscanf(fp, " %d",&x)>0)
{
len++;
sort_insert(x, array, len);
}
for (i=0; i<len; i++)
{
printf("array[%d] = %d\n", i, array[i]);
}
}

storing known sequences in c

I'm working on Project Euler #14 in C and have figured out the basic algorithm; however, it runs insufferably slow for large numbers, e.g. 2,000,000 as wanted; I presume because it has to generate the sequence over and over again, even though there should be a way to store known sequences (e.g., once we get to a 16, we know from previous experience that the next numbers are 8, 4, 2, then 1).
I'm not exactly sure how to do this with C's fixed-length array, but there must be a good way (that's amazingly efficient, I'm sure). Thanks in advance.
Here's what I currently have, if it helps.
#include <stdio.h>
#define UPTO 2000000
int collatzlen(int n);
int main(){
int i, l=-1, li=-1, c=0;
for(i=1; i<=UPTO; i++){
if( (c=collatzlen(i)) > l) l=c, li=i;
}
printf("Greatest length:\t\t%7d\nGreatest starting point:\t%7d\n", l, li);
return 1;
}
/* n != 0 */
int collatzlen(int n){
int len = 0;
while(n>1) n = (n%2==0 ? n/2 : 3*n+1), len+=1;
return len;
}
Your original program needs 3.5 seconds on my machine. Is it insufferably slow for you?
My dirty and ugly version needs 0.3 seconds. It uses a global array to store the values already calculated. And use them in future calculations.
int collatzlen2(unsigned long n);
static unsigned long array[2000000 + 1];//to store those already calculated
int main()
{
int i, l=-1, li=-1, c=0;
int x;
for(x = 0; x < 2000000 + 1; x++) {
array[x] = -1;//use -1 to denote not-calculated yet
}
for(i=1; i<=UPTO; i++){
if( (c=collatzlen2(i)) > l) l=c, li=i;
}
printf("Greatest length:\t\t%7d\nGreatest starting point:\t%7d\n", l, li);
return 1;
}
int collatzlen2(unsigned long n){
unsigned long len = 0;
unsigned long m = n;
while(n > 1){
if(n > 2000000 || array[n] == -1){ // outside range or not-calculated yet
n = (n%2 == 0 ? n/2 : 3*n+1);
len+=1;
}
else{ // if already calculated, use the value
len += array[n];
n = 1; // to get out of the while-loop
}
}
array[m] = len;
return len;
}
Given that this is essentially a throw-away program (i.e. once you've run it and got the answer, you're not going to be supporting it for years :), I would suggest having a global variable to hold the lengths of sequences already calculated:
int lengthfrom[UPTO] = {};
If your maximum size is a few million, then we're talking megabytes of memory, which should easily fit in RAM at once.
The above will initialise the array to zeros at startup. In your program - for each iteration, check whether the array contains zero. If it does - you'll have to keep going with the computation. If not - then you know that carrying on would go on for that many more iterations, so just add that to the number you've done so far and you're done. And then store the new result in the array, of course.
Don't be tempted to use a local variable for an array of this size: that will try to allocate it on the stack, which won't be big enough and will likely crash.
Also - remember that with this sequence the values go up as well as down, so you'll need to cope with that in your program (probably by having the array longer than UPTO values, and using an assert() to guard against indices greater than the size of the array).
If I recall correctly, your problem isn't a slow algorithm: the algorithm you have now is fast enough for what PE asks you to do. The problem is overflow: you sometimes end up multiplying your number by 3 so many times that it will eventually exceed the maximum value that can be stored in a signed int. Use unsigned ints, and if that still doesn't work (but I'm pretty sure it does), use 64 bit ints (long long).
This should run very fast, but if you want to do it even faster, the other answers already addressed that.

Resources