Code optimization - cicle - c

I want to do a large block of code until at least one of the elements of array 1 is equal to 1 of the elements of array 2.
I'm asking the community to share, if possible, the best(fastest to process) ways possible to do this "while"
Sum up:
while (none of the elements from arr1 is equal to any of arr2)
{
(code)
}
Reason: In my code, depending on some dimensions set by the user, my program may need to make this n^2 complexity comparison a lot of times, so I'm looking for a way to make it as light as I can.
I'm sorry in advance, and please let me know, if this type of questions are not suitable for StackOverflow.
Edit: My bad in not giving information about the arrays. As I said, it's dimension may vary based on what the user choses, but each one's size should be between 3 and 1000. Both arrays of integers.
Their values do change, the bigger the dimensions, the more it can happen.

The comments mention a hash and I agree, it could work, but it's very size-dependent. O(n^2)'s overhead will be negligible a lot of the time.
Otherwise, just add all the elements of arr1 into the hashset, then go through arr2's elements to see if they're in there. You'll get O(n) time. Honestly though, unless you're working with elements in the hundred thousands, even millions, I don't think the pay-out will be that tangible, but it's machine dependent and I haven't tested it myself.
C++ has std::unordered_set in the standard. If you're using pure C, I'm sure there's implementations available online.

Related

what does worst case big omega(n) means?

If Big-Omega is the lower bound then what does it mean to have a worst case time complexity of Big-Omega(n).
From the book "data structures and algorithms with python" by Michael T. Goodrich:
consider a dynamic array that doubles it size when the element reaches its capacity.
this is from the book:
"we fully explored the append method. In the worst case, it requires
Ω(n) time because the underlying array is resized, but it uses O(1)time in the amortized sense"
The parameterized version, pop(k), removes the element that is at index k < n
of a list, shifting all subsequent elements leftward to fill the gap that results from
the removal. The efficiency of this operation is O(n−k), as the amount of shifting
depends upon the choice of index k. Note well that this
implies that pop(0) is the most expensive call, using Ω(n) time.
how is "Ω(n)" describes the most expensive time?
The number inside the parenthesis is the number of operations you must do to actually carry out the operation, always expressed as a function of the number of items you are dealing with. You never worry about just how hard those operations are, only the total number of them.
If the array is full and has to be resized you need to copy all the elements into the new array. One operation per item in the array, thus an O(n) runtime. However, most of the time you just do one operation for an O(1) runtime.
Common values are:
O(1): One operation only, such as adding it to the list when the list isn't full.
O(log n): This typically occurs when you have a binary search or the like to find your target. Note that the base of the log isn't specified as the difference is just a constant and you always ignore constants.
O(n): One operation per item in your dataset. For example, unsorted search.
O(n log n): Commonly seen in good sort routines where you have to process every item but can divide and conquer as you go.
O(n^2): Usually encountered when you must consider every interaction of two items in your dataset and have no way to organize it. For example a routine I wrote long ago to find near-duplicate pictures. (Exact duplicates would be handled by making a dictionary of hashes and testing whether the hash existed and thus be O(n)--the two passes is a constant and discarded, you wouldn't say O(2n).)
O(n^3): By the time you're getting this high you consider it very carefully. Now you're looking at three-way interactions of items in your dataset.
Higher orders can exist but you need to consider carefully what's it's going to do. I have shipped production code that was O(n^8) but with very heavy pruning of paths and even then it took 12 hours to run. Had the nature of the data not been conductive to such pruning I wouldn't have written it at all--the code would still be running.
You will occasionally encounter even nastier stuff which needs careful consideration of whether it's going to be tolerable or not. For large datasets they're impossible:
O(2^n): Real world example: Attempting to prune paths so as to retain a minimum spanning tree--I computed all possible trees and kept the cheapest. Several experiments showed n never going above 10, I thought I was ok--until a different seed produced n = 22. I rewrote the routine for not-always-perfect answer that was O(n^2) instead.
O(n!): I don't know any examples. It blows up horribly fast.

Which of the following methods is more efficient

Problem Statement:- Given an array of integers and an integer k, print all the pairs in the array whose sum is k
Method 1:-
Sort the array and maintain two pointers low and high, start iterating...
Time Complexity - O(nlogn)
Space Complexity - O(1)
Method 2:-
Keep all the elements in the dictionary and do the process
Time Complexity - O(n)
Space Complexity - O(n)
Now, out of above 2 approaches, which one is the most efficient and on what basis I am going to compare the efficiency, time (or) space in this case as both are different in both the approaches
I've left my comment above for reference.
It was hasty. You do allow O(nlogn) time for the Method 1 sort (I now think I understand?) and that's fair (apologies;-).
What happens next? If the input array must be used again, then you need a sorted copy (the sort would not be in-place) which adds an O(n) space requirement.
The "iterating" part of Method 1 also costs ~O(n) time.
But loading up the dictionary in Method 2 is also ~O(n) time (presumably a throw-away data structure?) and dictionary access - although ~O(1) - is slower (than array indexing).
Bottom line: O-notation is helpful if it can identify an "overpowering cost" (rendering others negligible by comparison), but without a hint at use-cases (typical and boundary, details like data quantities and available system resources etc), questions like this (seeking a "generalised ideal" answer) can't benefit from it.
Often some simple proof-of-concept code and performance tests on representative data can make "the right choice obvious" (more easily and often more correctly than speculative theorising).
Finally, in the absence of a clear performance winner, there is always "code readability" to help decide;-)

AI - what is included in the state

I'm taking my first course in AI, and I have to define some problems in my homework (not yet solve them, just supply a definition).
So I have to define about boolean satisfiability problem
:
What is a state?
What is the initial state?
What is a final state?
What are the operators?
My question is: Should the formula be a part of the state?
Considerations so far:
The operator doesn't change it, and it's constant through the computation, so it's not.
If I do include it, in theory, the search space gets much bigger, since more states are possible, but in reality the formula can't be changed, so I get a big state, and a branching factor that is not corresponding.
It's varying from one execution to the next, so it should be a part of the state.
You need only really consider the varying parts of the problem to be a state when conducting a search such as this, although I'd say in this case it really comes down to how you define the problem.
The search space for a given run of the algorithm depends upon the input formula, but after that is fixed, ie you are searching the space of n length bit vectors where n is the number of variables in the formula. So the formula is not part of the state because it does not vary.
The counter claim is that you are searching in a larger space of formula-vector pairs, but as you cannot change the formula as part of the problem, this has not really increased the size of the search space. So I would not make the claim that "If I do include it, in theory the search space gets much bigger". It does not, the reachable states are the same, the branching is the same, the space that requires exploring to solve the problem is the same.
Given this, my answer would that the formula is not part of the state, but defines the nature of the state space. So the answers to your four questions will each be functionally dependant on the formula in some way, but the state depends only on the length of the formula.
Hope that makes sense!
This is just a note for future readers - Not an answer
Vic Smith is right, another way to look at the fact that in theory there are more states but in practice not (my second dot), is just to think about it as separate bondage spaces. For example for the formula X or Y there is one bondage space, and for not X and Y there is another one and they have no common nodes in the representation.
So it can vary from one execution to another, but still has the same "reachable" states, and same branching factor. And each execution has a different starting state.

finding a number appearing again among numbers stored in a file

Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

How many possible bugs are there?

A user has to complete ten steps to achieve a desired result. The ten steps can be completed in any order.
If there is a bug, the bug is dependent only on the steps that have been taken, not the order in which they were taken (i.e., the bug is path independent).
For example: If the user performs three steps in the order 10, 1, 2 and produces a bug the exact same bug will be produced if the user performs the same three steps in the order 1, 2, 10.
What is the maximum number of unique bugs this program can have?
You mean what is the number of distinct sets pickable from 10 elements? That's a powerset: 2**10.
Much later: some knowledgeable others have suggested that having no bugs should not be counted as a bug. Accordingly, I revise my count: 2**10 - 1.
hughdbrown's answer is correct, but there is another possible interpretation of the question. Suppose that a sequence of operations can never produce more than one bug (i.e. that it should just be counted as one bug). For example, if the operations (3,6,2) is a bug, then you shouldn't be allowed to count (3,6,2,5) as another bug. In that case, rather than finding the maximum possible number of subsets of {1,2,...,10}, you want to find the maximum number of possible subsets so that no one contains another. The answer to this version of the question is "10 choose 5"=252.
Edit: by the way, the result that says this is maximal is called Sperner's Theorem.
It depends entirely how many ways there are of doing each step. If you have a process that involves only one step, but there are multiple ways of doing that step, every step could have an associated bug.
There's also the misuse of functions, which you cant prevent against, which could be considered a bug. ie:
If a user was to think that
rm -rf /
was short for
remove media --really fast /
ie: eject all devices1
I would guess that would be a potential bug. Its user error really, but its still a singular thing that can occur that produces results other than that were wanted.
You could argue the above is a bit over the top, but ultimately, there is no limitation on the ways users can do things wrong.
When users are there, assume, anything that can go wrong, will.
The only problem with the above reasoning, is you have to prematurely delete powerful things so users don't hurt themselves, which leads to less effective tools for those who know how to use them. Like corks on forks sort of rationale.
The only way to solve this concern effectively is give newbs blunt objects to learn with, and then give them an option which takes away all the foam padding once they learn the ropes, so experienced users don't have to keep working with blunt tools, and don't have to deblunten every tool themself.
( If there are infinite numbers of possibly ways to do 1 step, I don't even want to begin to think of the numbers of ways to do 10 steps wrong )
1: If you don't know, this will erase lots of your hard drive and cause much pain. Don't do it.
One, a designer fault? :)

Resources