saving hashtable using c so that random access is faster - c

I am writing a C code (call it database generation) processes an input file and generated a number in range [1,10^8] alongwith a sequence of float values whose length is fixed but unknown followed by 3 integers. All values are separated by space
Example:
19432 23.45 32.12 45.76 ...(156 such float values) 4 6 106
This will be one line of database where first number is hash index (one to 10^8) , and last 3 integers denote the x,y coordinated and document ID respectively.
Our database is saved in file xyz which has following content
2341 34.67 43.13 ... (234 such float values) 5 8 123
2352 46.92 41.89 ... (51 such float values) 1 9 145
2352 46.92 41.89 ... (98 such float values) 2 7 12
2359 12.71 72.90 ... (141 such float values) 8 12 13
The starting number (hash index value) will always be in non-decreasing order in database as we proceed from one line to next.
I have another C code (call it retrieval) which takes hash index value as input and should output all lines starting with that value.
I have 2 questions
How can I make sure that retrieval directly jumps to line containing asked hash index value skipping the starting lines of database so that its response is fast.
When I get another input file for database and its hash index value is 2352. How do i add another line starting with 2352 at its proper position in database?
I am considering following approach which is not ideal, as the database won't be organised in required non-decreasing order of hash index values. Also, database is split into 2 components. One contains byte offset entries for each hash index and another is the database file presented above.
It involves
(1)byte-offset.txt of the form
2341 byte-pos-1
2352 byte-pos-2
2359 byte-pos-3
2352 byte-pos-4
(2)database.txt of the form
2341 34.67 43.13 ... (234 such float values) 5 8 123
2352 46.92 41.89 ... (51 such float values) 1 9 145
2359 12.71 72.90 ... (141 such float values) 8 12 13
2352 46.92 41.89 ... (98 such float values) 2 7 12
the only good thing about it is that new entries can be appended to end in each file as database grows when we get more data.

Related

Finding key for minimum value and conditions in excel

This is my table (copied from the similar question Finding minimum value in index(match) array [EXCEL])
A B C D
tasmania 10 3 10
queensland 22 8 10
new south wales 10 12 12
northern territory 8 4 15
south australia 12 2 8
western australia 32 4 15
tasmania 72 6 16
I have criteria for B and C, and I want to retrieve the A with the lowest corresponding value D. Values in B, C and D can be duplicates, values in A can not.
Example:
B >= 8
C >= 4
Should result in "queensland" (lowest matching value is 10), but not "tasmania" (has the same cost)
I am currently trying this array formula:
{ =MIN(IF(B:B>=8;IF(C:C>=4;D;""));1) }
Which returns the correct lowest D, but since I am losing the informaiton about A, I can not retrieve the value for A
This as an array formula should work for you:
=INDEX($A$1:$A$7,MATCH(MIN(IF($B$1:$B$7>=8,IF($C$1:$C$7>=4,$D$1:$D$7))),IF($B$1:$B$7>=8,IF($C$1:$C$7>=4,$D$1:$D$7)),0))
It should be noted that if you have Excel 2016 or Office365, you'll have access to the MINIFS function which is probably better suited for this task (i don't actually have the newest version, so am unable to test)

Aligning dataset columns in C

I have a dataset of two groups: Red and Green, and I want to compare the difference between ratios, but they have to be aligned first.
Original File ( First few rows of 200,000 entries)
A B C D
Red Ratio Green Ratio
1 0.35 1 0.21
2 0.45 2 0.235
3 0.45 3 0.154
4 0.235 4 0.156
6 0.156 5 0.146
7 0.668 6 0.154
8 0.44 7 0.148
9 0.446 8 0.148
10 0.354 9 0.199
11 0.154 10 0.143
12 0.49 12 0.148
After using the code, the values are aligned and the "extras" are delete, and the columns are shifted up.
A B C D
Red Ratio Green Ratio
1 0.35 1 0.21
2 0.45 2 0.235
3 0.45 3 0.154
4 0.235 4 0.156
6 0.156 6 0.154
7 0.668 7 0.148
8 0.44 8 0.148
9 0.446 9 0.199
10 0.354 10 0.143
12 0.49 12 0.148
15 0.146 15 0.87
17 0.113 17 0.113
19 0.44 19 0.448
This is the code I have so far: I am taking difference between A and C to check if they are 0, and adjusting them if they are not.
#include <stdio.h>
int deletemove(char column, int row)
{
// This script would delete the positions mentioned in the arguments, and shift the other values up.
}
int main(void)
{
//Opening input file for read/write
FILE *input;
input=fopen("/full/path/file.xlsx", "r");
if (input == NULL) {printf("error opening input file\n");}
//Store the values from file into an array
int colA[1024];
int colC[1024];
// read contents of columns A and C and store in an array
int ai;
for(ai=1; ai<1024; ai++)
{ fseek(input,ai,SEEK_SET);
colA[ai]=fgetc(input);
}
int ci;
for(ci=1; ci<1024; ci++)
{ fseek(input,ci,SEEK_SET);
colC[ci]=fgetc(input);
}
//Take difference between value of Column A and C to check if they are identical.
int j;
char A,B;
for (j = 1; j < 1024; j++)
{
int check = colA[j] - colC[j]; // check difference between two values in a column
if (check > 0)
deletemove(A,j); //delete values from column C and D
else if (check < 0)
deletemove(B,j); // delete values from column A and B
}
fclose(input); // close files
}
I need help implementing a delete row/column function and reading the values in array.
Also, is storing 200,000 values in an array a feasible method?
Thanks.
is storing 200,000 values in an array a feasible method?
Yes, as long as you don't put those arrays on the stack. Declaring a variable inside a function puts the variable on the stack(1). Contemporary (year 2016) desktops typically limit the stack size to a few megabytes, whereas the main memory is a few gigabytes.
So it's best to put large arrays into main memory. This can be done in a variety of ways:
use a global array, i.e. declare the array outside of any function
use a static array, i.e. declare the array with the static keyword
use a dynamically allocated array, i.e. use malloc to allocate the array
(You could also use a linked list. A linked list has the advantage that it can grow as needed; you don't need to know the space requirements in advance.)
In your case, I would store the ratios in the arrays at the index given by the red/green value. Assuming that ratios are always positive numbers, I would initialize all of the entries in the arrays with -1.0. Then, as you read the file store the ratios at the proper locations in the two arrays. For example, when you read the line
6 0.156 5 0.146
store 0.156 at index 6 in the red array, and store 0.146 at index 5 in the green array.
When all the values have been read from the file, you can simply scan the two arrays, and print the values where both arrays have a non-negative value.
(1) Ignoring oddball systems (e.g. small embedded systems) that don't have a normal stack.

Accesing two different rows simultaneously in C

Suppose I have a data set arranged in the following way
19 10 1 1
12 15 1 1
13 12 4 5
10 5 2 3
...
and so on, at a particular iteration in a for loop I have to read only the 1st and the 4th row and in the next iteration I have to access some other set of rows,for example
1st iteration:
1st row: 19 10 1 1
4th row: 10 5 2 3
i will access my data using the fscanf() function. But how will i ensure that I choose only the 1st and 4th rows or any two rows for that matter at a given iteration?
(I have not considered reading it into a 2D array since the size of data set is 10^8 )
Thank you.
As you read through your data (say, stored in a standard file), get byte offsets for rows by looking for row delimiters (a newline character). You can then read out rows based on the start and end byte offset with C pointer arithmetic on a FILE * and fseek(). Storing a few byte offsets (an eight byte long or equivalent, often) is cheap.

Get average of two consecutive values in a vector depending on logical vector

I am reading data from a file and I am trying to do some manipulation on the vector containing the data basically i want to check if the values come from consecutive lines and if so i want to average each two and put the value in a output vector
part of the data and lines
lines=[153 152 153 154 233 233 234 235 280 279 280 281];
Sail=[ 3 4 3 1.5 3 3 1 2 2.5 5 2.5 2 ];
here is what i am doing
Sail=S(lines);
Y=diff(lines)==1;
for ii=1:length(Y)
if Y(ii)
output(ceil(ii/2))=(Sail(ii)+Sail(ii+1))/2;
end
end
is this correct also is there a way to do that without a for loop
Thanks
My suggestion:
y = find(diff(lines)==1);
output = mean([Sail(y);Sail(y+1)]);
This assumes that when you have, say [233 234 235], you want one value averaging the values from lines [233 234] and one value averaging those from [234 245]. If you wanted to do something more complex when longer sets of consecutive lines exist in your data, then the problem becomes more complex.
Incidentally it's a bad idea to do something like (ceil(ii/2)) - you can't guarantee a unique index for each matching value of ii. If you did want an output the same size as Sail (will have zeros in non-matching areas) then you can do something like this:
output2 = zeros(size(Sail));
output2(y)=output;

What format does matlab need for n-dimensional data input?

I have a 4-dimensional dictionary I made with a Python script for a data mining project I'm working on, and I want to read the data into Matlab to do some statistical tests on the data.
To read a 2-dimensional matrix is trivial. I figured that since my first dimension is only 4-deep, I could just write each slice of it out to a separate file (4 files total) with each file having many 2-dimensional slices, looking something like this:
2 3 6
4 5 8
6 7 3
1 4 3
6 6 7
8 9 0
This however does not work, and matlab reads it as a single continuous 6 x 3 matrix. I even took a look a dlmread but could not figure out how to get it do what I wanted. How do I format this so I can put 3 (or preferably more) dimensions in a single file?
A simple solution is to create a file with two lines only: the first line contains the target array size, the second line contains all your data. Then, all you need to do is reshape the data.
Say your file is
3 2 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
You do the following to read the array into the variable data
fid = fopen('myFile'); %# open the file (don't forget the extension)
arraySize = str2num(fgetl(fid)); %# read the first line, convert to numbers
data = str2num(fgetl(fid)); %# read the second line
data = reshape(data,arraySize); %# reshape the data
fclose(fid); %# close the file
Have a look at data to see how Matlab orders elements in multidimensional arrays.
Matlab stores data column wise. So from your example (assuming its a 3x2x3 matrix), matlab will store it as first, second and third column from the first "slice", followed by the first, second third columns from the second slice and so on like this
2
4
3
5
6
8
6
1
7
4
3
3
6
8
6
9
7
0
So you can write the data out like this from python (I don't know how) and then read it into matlab. Then you can reshape it back into a 3x2x3 matrix and you'll retain your correct ordering.

Resources