Fastest data structure for inserting billions of integers? - c

I want recommendation on which is the fastest data structure in C which can hold about 2 billion integers taken from input. The integer value would not be less than 0 and would not be greater than 2 billion. My goal is to remove any duplicate values and sort elements of the data structure. If possible, I want to able to do the inserting operation in O(1) or O(logn) or as quickly as possible. I also want to avoid trees if possible. I would appreciate any feedback or recommendation about this.
Edit: Using a normal array would take a really long time. So, I want to use some other data structure than the array such as stack, queue, etc.

Since you have a given number of values, and the range of those values is the same as the number of values, you can implement the list as an array where each array index represents a value and the value of each array element represents whether or not a given value is in the list.
For example:
char *arr = malloc(20000000001);
int i;
// populate list
memset(arr, 0, sizeof(arr));
for (i=0; i<20000000001; i++) {
int value;
scanf("%d", &value);
arr[value] = 1;
}
// print list
for (i=0; i<20000000001; i++) {
if (arr[i]) {
printf("%d\n", i);
}
}
Here we initialize the list to contain 0 for all values. Then we read in the values. If we read the value n, then we set arr[n] to 1. This does two things: it inserts the value in the list and eliminates duplicates by always setting the value to 1 as opposed to incrementing the value.
This gives O(1) insersions with duplicate removal, and the list is already sorted.
Note also that since each element of the array only needs to store the values 0 or 1 we cause use char as the type to save memory. We can further save memory if we use each bit to hold the value 0 or 1 for a given value. Doing this will involve some bit shifting:
unsigned char *arr = malloc(20000000001 / 8 + 1);
int i;
// populate list
memset(arr, 0, sizeof(arr));
for (i=0; i<20000000001; i++) {
int value;
scanf("%d", &value);
arr[value/8] |= 1 << (value%8);
}
// print list
for (i=0; i<20000000001; i++) {
if (arr[i/8] & (1 << (i%8))) {
printf("%d\n", i);
}
}
This cuts the memory requirements down to about 250MB which is still large but manageable.

Related

developing a function that returns number of distinct values that exist in array

I want to create a function that can return the number distinct values present in a given array. If for eg the array is
array[5] = { 1 3 4 1 3}, the return value should be 3(3 unique numbers in array).
I've so far only got this:
int NewFucntion(int values[], int numValues){
for (i=0; i<numValues; i++){
Im a new coder/New to C language and im stuck on how to proceed. Any guidance would be much appreciated. Thanks
Add elements from the array to the std::set<T> and since the set is not allowing duplicate elements, you can then only get the number of elements from the set which gives you the number of distinct elements.
For example:
#include<set>
int NewFucntion(int values[], int numValues){
std::set<int> set;
for(int i=0; i<numValues; i++){
set.insert(values[i]);
}
return set.size();
}
int distinct(int arr[], int arr_size){
int count = arr_size;
int current;
int i, j;
for (i = 0; i < arr_size; i++){
current = arr[i];
for (j = i+1; j < arr_size; j++) // checks values after [i]th element.
if (current == arr[j])
--count; // decrease count by 1;
}
if (count >= 0)
return count;
else return 0;
}
Here's the explanation.
The array with its size is passed as an argument.
current stores the element to compare others with.
count is the number that we need finally.
count is assigned the value of size of the array (i.e we assume that all elements are unique).
(It can also be the other way round)
A for loop starts, and the first (0th) element is compared with the elements after it.
If the element reoccurs, i.e. if (current==arr[j]), then the value of count is decremented by 1 (since we expected all elements to be unique, and because it is not unique, the number of unique values is now one less than what it was initially. Hence --count).
The loops go on, and the value is decremented to whatever the number of unique elements is.
In case our array is {1,1,1,1}, then the code will print 0 instead of a negative value.
Hope that helps.
Happy coding. :)
I like wdc's answer, but I am going to give an alternative using only arrays and ints as you seam to be coding in c and wdc's answer is a c++ answer:
To do this thing, what you need to do is to go through your array as you did, and store the new numbers you go over in a different array lets call it repArray where there wont be any repetition; So every time you add something to this array you should check if the number isn't already there.
You need to create it and give it a size so why not numValues as it cannot get any longer than that. And an integers specifying how many of it's indexes are valid, in other words how many you have written to let's say validIndexes. So every time you add a NEW element to repArray you need to increment validIndexes.
In the end validIndexes will be your result.

Lowest n Numbers in an Array

How can I assemble a set of the lowest or greatest numbers in an array? For instance, if I wanted to find the lowest 10 numbers in an array of size 1000.
I'm working in C but I don't need a language specific answer. I'm just trying to figure out a way to deal with this sort of task because it's been coming up a lot lately.
QuickSelect algorithm allows to separate predefined number of the lowest and greatest numbers (without full sorting). It uses partition procedure like Quicksort algo, but stops when pivot finds needed position.
Method 1: Sort the array
You can do something like a quick sort on the array and get the first 10 elements. But this is rather inefficient because you are only interested in the first 10 elements, and sorting the entire array for that is an overkill.
Method 2: Do a linear traversal and keep track of 10 elements.
int lowerTen = malloc(size_of_array);
//'array' is your array with 1000 elements
for(int i=0; i<size_of_array; i++){
if(comesUnderLowerTen(array[i], lowerTeb)){
addTolowerTen(array[i], lowerTen)
}
}
int comesUnderLowerTen(int num, int *lowerTen){
//if there are not yet 10 elements in lowerTen, insert.
//else if 'num' is less than the largest element in lowerTen, insert.
}
void addToLowerTen(int num, int *lowerTen){
//should make sure that num is inserted at the right place in the array
//i.e, after inserting 'num' *lowerTen should remain sorted
}
Needless to say, this is not a working example. Also do this only if the 'lowerTen' array needs to maintain a sorted list of a small number of elements. If you need the first 500 elements in a 1000 element array, this would not be the preferred method.
Method 3: Do method 2 when you populate the original array
This works only if your original 1000 element array is populated one by one - in that case instead of doing a linear traversal on the 1000 element array you can maintain the 'lowerTen' array as the original array is being populated.
Method 4: Do not use an array
Tasks like these would be easier if you can maintain a data structure like a binary search tree based on your original array. But again, constructing a BST on your array and then finding first 10 elements would be as good as sorting the array and then doing the same. Only do this if your use case demands a search on a really large array and the data needs to be in-memory.
Implement a priority queue.
Loop through all the numbers and add them to that queue.
If that queue's length would be equal to 10, start checking if the current number is lower than highest one in that queue.
If yes, delete that highest number and add current one.
After all you will have a priority queue with 10 lowest numbers from your array.
(Time needed should be O(n) where n is the length of your array).
If you need any more tips, add a comment :)
the following code
cleanly compiles
performs the desired functionality
might not be the most efficient
handles duplicates
will need to be modified to handle numbers less than 0
and now the code
#include <stdlib.h> // size_t
void selectLowest( int *sourceArray, size_t numItemsInSource, int *lowestDest, size_t numItemsInDest )
{
size_t maxIndex = 0;
int maxValue = 0;
// initially populate lowestDest array
for( size_t i=0; i<numItemsInDest; i++ )
{
lowestDest[i] = sourceArray[i];
if( maxValue < sourceArray[i] )
{
maxValue = sourceArray[i];
maxIndex = i;
}
}
// search rest of sourceArray and
// if lower than max in lowestDest,
// then
// replace
// find new max value
for( size_t i=numItemsInDest; i<numItemsInSource; i++ )
{
if( maxValue > sourceArray[i] )
{
lowestDest[maxIndex] = sourceArray[i];
maxIndex = 0;
maxValue = 0;
for( size_t j=0; j<numItemsInDest; j++ )
{
if( maxValue < lowestDest[j] )
{
maxValue = lowestDest[j];
maxIndex = j;
}
}
}
}
} // end function: selectLowest

Maximum sum obtained after picking numbers in magical order from an array

There are N integers in an array.If you select element at index "i",then you get array[i] value in your packet, and array[i],array[i-1] and array[i+1] becomes zero, after selecting array[i] (i.e You can't take these elements any more in next selections). What is the maximum sum you can make in your packet by selecting array elements before all elements become zero ?
For each i we can either take that item or ignore it. We do the one which yields better result. Following is the dp approach:
const int N=10;
int a[]={1,2,3,4,5,6,7,8,9,10};
int dp[N];
int main() {
dp[0]=a[0];
dp[1]=max(a[0],a[1]);
for(int i=2;i<N;i++) {
dp[i]=max(dp[i-2]+a[i],dp[i-1]);
// dp[i-2]+a[i] is we include item i, hence we cannot take item i-1
// dp[i-1] is we don't take item i
}
cout<<dp[N-1];
return 0;
}

Count the number of initialized elements in an array in C

My array is:
int array[100];
If I initialize the first n elements (n < 100) with integers including 0, and the rest is uninitialized, how do I calculate n?
I tried a normal while loop with the following codes:
int i = 0;
int count = 0;
while (a[i++])
count++;
However, the problem with these codes is that it doesn't count the element of value 0 (it takes 0 as FALSE). How do I overcome this problem?
UPDATE: below is the background of this question
I have the following code:
int a[100];
int i;
for (i = 0; i < 100; i++)
scanf("%d", &a[i]);
If I have to input (just an example):
1 0 1 0 1 *
Then the first 5 elements of the array will be: 1 0 1 0 1. The rest will be uninitialized. In this situation, how do I count the number of these initialized elements to get 5?
If you can't simply record how many elements have been initialized, then you need to use a "magic" value like INT_MIN (the largest negative int) to know when an element is not used. Alternatively, instead of storing ints, store something like this:
struct element {
int value;
int flags; // 0 means not used
};
Oh, one more idea: store the count of initialized elements in the first element. This is sort of how malloc() works sometimes. Then you can make the array have 101 elements and pass (array + 1, array[0]) to functions which expect an array of size 100.

find the largest ten numbers in an array in C

I have an array of int (the length of the array can go from 11 to 500) and i need to extract, in another array, the largest ten numbers.
So, my starting code could be this:
arrayNumbers[n]; //array in input with numbers, 11<n<500
int arrayMax[10];
for (int i=0; i<n; i++){
if(arrayNumbers[i] ....
//here, i need the code to save current int in arrayMax correctly
}
//at the end of cycle, i want to have in arrayMax, the ten largest numbers (they haven't to be ordered)
What's the best efficient way to do this in C?
Study maxheap. Maintain a heap of size 10 and ignore all spilling elements. If you face a difficulty please ask.
EDIT:
If number of elements are less than 20, find n-10 smallest elements and rest if the numbers are top 10 numbers.
Visualize a heap here
EDIT2: Based on comment from Sleepy head, I searched and found this (I have not tested). You can find kth largest element (10 in this case) in )(n) time. Now in O(n) time, you can find first 10 elements which are greater than or equal to this kth largest number. Final complexity is linear.
Here is a algo which solves in linear time:
Use the selection algorithm, which effectively find the k-th element in a un-sorted array in linear time. You can either use a variant of quick sort or more robust algorithms.
Get the top k using the pivot got in step 1.
This is my idea:
insert first 10 elements of your arrayNum into arrMax.
Sort those 10 elements arrMax[0] = min , arrMax[9] = max.
then check the remaining elements one by one and insert every possible candidate into it's right position as follow (draft):
int k, r, p;
for (int k = 10; k < n; k++)
{
r = 0;
while(1)
{
if (arrMax[r] > arrNum[k]) break; // position to insert new comer
else if (r == 10) break; // don't exceed length of arrMax
else r++; // iteration
}
if (r != 0) // no need to insert number smaller than all members
{
for (p=0; p<r-1; p++) arrMax[p]=arrMax[p+1]; // shift arrMax to make space for new comer
arrMax[r-1] = arrNum[k]; // insert new comer at it's position
}
} // done!
Sort the array and insert Max 10 elements in another array
you can use the "select" algorithm which finds you the i-th largest number (you can put any number you like instead of i) and then iterate over the array and find the numbers that are bigger than i. in your case i=10 of course..
The following example can help you. it arranges the biggest 10 elements of the original array into arrMax assuming you have all positive numbers in the original array arrNum. Based on this you can work for negative numbers also by initializing all elements of the arrMax with possible smallest number.
Anyway, using a heap of 10 elements is a better solution rather than this one.
void main()
{
int arrNum[500]={1,2,3,21,34,4,5,6,7,87,8,9,10,11,12,13,14,15,16,17,18,19,20};
int arrMax[10]={0};
int i,cur,j,nn=23,pos;
clrscr();
for(cur=0;cur<nn;cur++)
{
for(pos=9;pos>=0;pos--)
if(arrMax[pos]<arrNum[cur])
break;
for(j=1;j<=pos;j++)
arrMax[j-1]=arrMax[j];
if(pos>=0)
arrMax[pos]=arrNum[cur];
}
for(i=0;i<10;i++)
printf("%d ",arrMax[i]);
getch();
}
When improving efficiency of an algorithm, it is often best (and instructive) to start with a naive implementation and improve it. Since in your question you obviously don't even have that, efficiency is perhaps a moot point.
If you start with the simpler question of how to find the largest integer:
Initialise largest_found to INT_MIN
Iterate the array with :
IF value > largest_found THEN largest_found = value
To get the 10 largest, you perform the same algorithm 10 times, but retaining the last_largest and its index from the previous iteration, modify the largest_found test thus:
IF value > largest_found &&
value <= last_largest_found &&
index != last_largest_index
THEN
largest_found = last_largest_found = value
last_largest_index = index
Start with that, then ask yourself (or here) about efficiency.

Resources