Is this a good way to speed up array processing? - arrays

So, today I woke up with this single idea.
Just supouse you have a long list of things, an array, and you have to check each one of those to find the one that matches what you're looking for. To do this, you could maybe use a for loop. Now, imagine that the one you're looking for is almost at the end of the list but you don't know it. So, in that case, asuming it doesn't matter the order in which you check the elements of the list, it would be more convinient for you to start from the last element rather than the first one just to save some time and memory maybe. But then, what if your element is almost at the beggining?
That's when I thought: what if I could start checking the elements from both ends of the list at the same time?
So, after several tries, I came up with this raw sample code (which is written in js) that, in my opinion, would solve what we were defining above:
fx (var list) {
var len = length(list);
// To save some time as we were saying, we could check first if the array isn't as long as we were expecting
if (len == 0) {
// If it's not, then we just process the only element anyway
/*
...
list[0]
...
*/
return;
} else {
// So, now here's the thing. The number of loops won't be the length of the list but just half of it.
for (var i = 0; i == len/2; i++) {
// And inside each loop we process both the first and last elements and so on until we reach the middle or find the one we're looking, whatever happens first
/*
...
list[i]
list[len]
...
*/
len--;
}
}
return;
};
Anyway, I'm still not totally sure about if this would really speed up the process or make it slower or not making any difference at all. That's why I need your help, guys.
In your own experience, what do you think? Is this really a good way to make this kind of process faster? If it is or it isn't, why? Is there a way to improve it?
Thanks, guys.

Your proposed algorithm is good if you know that the item is likely to be at the beginning or end but not in the middle, bad if it's likely to be in the middle, and merely overcomplicated if it's equally likely to be anywhere in the list.
In general, if you have an unsorted list of n items then you potentially have to check all of them, and that will always take time which is at least proportional to n (this is roughly what the notation “O(n)” means) — there are no ways around this, other than starting with a sorted or partly-sorted list.
In your scheme, the loop runs for only n/2 iterations, but it does about twice as much work in each iteration as an ordinary linear search (from one end to the other) would, so it's roughly equal in total cost.
If you do have a partly-sorted list (that is, you have some information about where the item is more likely to be), then starting with the most likely locations first is a fine strategy. (Assuming you're not frequently looking for items which aren't in the list at all, in which case nothing helps you.)

If you work from both ends, then you'll get the worst performance when the item you're looking for is near the middle. No matter what you do, sequential searching is O(n).
If you want to speed up searching a list, you need to use a better data structure, such as a sorted list, hash table, or B-tree.

Related

efficient way to remove duplicate pairs of integers in C

I've found answers to similar problems, but none of them exactly described my problem.
so on the risk of being down-voted to hell I was wondering if there is a standard method to solve my problem. Further, there's a chance that I'm asking the wrong question. Maybe the problem can be solved more efficiently another way.
So here's some background:
I'm looping through a list of particles. Each particle has a list of it's neighboring particles. Now I need to create a list of unique particle pairs of mutual neightbours.
Each particle can be identified by an integer number.
Should I just build a list of all the pair's including duplicates and use some kind of sort & comparator to eliminate duplicates or should I try to avoid adding duplicates into my list in the first place?
Performance is really important to me. I guess most of the loops may be vectorized and threaded. On average each particle has around 15 neighbours and I expect, that there will be 1e6 particles at most.
I do have some ideas, but I'm not an experienced coder and I don't want to waste 1 week to test every single method by benchmarking different situations just to find out that there's already a standard meyjod for my problem.
Any suggestions?
BTW: I'm using C.
Some pseudo-code
for i in nparticles
particle=particles[i]; //just an array containing the "index" of each particle
//each particle has a neightbor-list
for k in neighlist[i] //looping through all the neighbors
//k represent the index of the neighbor of particle "i"
if the pair (i,k) or (k,i) is not already in the pair-list, add it. otherwise don't
Sorting the elements each iteration is not a good idea since comparison sort is O(n log n) complex.
The next best thing would be to store the items in a search tree, better yet binary search tree, and better yet self equalizing binary search tree, you can find implementations on GitHub.
Even better solution would give an access time of O(1), you can achieve this in 2 different ways one is a simple identity array, where at each slot you would save say a pointer to item if there is on at this id or some flag defining that current id is empty. This is very fast but wasteful. You'll need O(N) memory.
The best solution in my opinion would be to use a set or a has-map. Which are basically the same because sets can be implemented using hash-map.
Here is a github project with c hash-map implementation.
And stack overflow answer to a similar question.

O(n) linear search of an array for most common items

Am trying to write pseudo code for an O(n) algorithm that searches a sorted array for the most frequently occurring elements.
Very new to data structures and algorithms and i haven't coded for about 2 and a half years. I have done some reading around the subject and i believe i am grasping the concepts but i am struggling with the above problem.
This is all i have so far, and i am struggling getting the desired result without the second "for" loop, which makes the algorithm an O(n^2) i believe and i am not to sure how i would deal with more than one frequently occurring element.
Any help or direction as to where i can get help would be greatly appreciated.
A=[i];
Elem=0;
Count=0;
For (i=0; j< A[n-1]; j++);
tempElem=A[j];
empCount=0;
for(p=0; p<A[n-1; p++])
If(A[p]==tempElem)
tempCount++:
if(tempCount>Count);
Elem==tempElem:
Count=tempCount;
Print(“The most frequent element of array A is”: Elem “as it appears” Count “times”)
The inner loop is not your friend. :-)
Your loop body should key on just two bits of logic:
Is this element the same as the previous one?
If so, increment the count for the current item (curr_count) and go to the next element.
Otherwise, check curr_count against the best so far. If it's better, then make the previous element and count the new "best" data.
Either way, set the count back to 1 and go to the next element.

fast indexing for slow cpu?

I have a large document that I want to build an index of for word searching. (I hear this type of array is really called a concordances). Currently it takes about 10 minutes. Is there a fast way to do it? Currently I iterate through each paragraph and if I find a word I have not encountered before, I add it too my word array, along with the paragraph number in a subsidiary array, any time I encounter that same word again, I add the paragraph number to the index. :
associativeArray={chocolate:[10,30,35,200,50001],parsnips:[5,500,100403]}
This takes forever, well, 5 minutes or so. I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway.
Is there a faster way to build a text index other than linear brute force? I'm not looking for a product that will do the index for me, just the fastest known algorithm. The index should be accurate, not fuzzy, and there will be no need for partial searches.
I think the best idea is to build a trie, adding a word at the time of your text, and having for each leaf a List of location you can find that word.
This would not only save you some space, since storing word with similar prefixes will require way less space, but the search will be faster too. Search time is O(M) where M is the maximum string length, and insert time is O(n) where n is the length of the key you are inserting.
Since the obvious alternative is an hash table, here you can find some more comparison between the two.
I would use a HashMap<String, List<Occurrency>> This way you can check if a word is already in yoz index in about O(1).
At the end, when you have all word collected and want to search them very often, you might try to find a hash-function that has no or nearly no collisions. This way you can guarantee O(1) time for the search (or nearly O(1) if you have still some collisions).
Well, apart from going along with MrSmith42's suggestion of using the built in HashMap, I also wonder how much time you are spending tracking the paragraph number?
Would it be faster to change things to track line numbers instead? (Especially if you are reading the input line-by-line).
There are a few things unclear in your question, like what do you mean in "I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway."?! What array, is your input in form of array of paragraphs or do you mean the concordance entries per word, or what.
It is also unclear why your program is so slow, probably there is something inefficient there - i suspect is you check "if I find a word I have not encountered before" - i presume you look up the word in the dictionary and then iterate through the array of occurrences to see if paragraph# is there? That's slow linear search, you will be better served to use a set there (think hash/dictionary where you care only about the keys), kind of
concord = {
'chocolate': {10:1, 30:1, 35:1, 200:1, 50001:1},
'parsnips': {5:1, 500:1, 100403:1}
}
and your check then becomes if paraNum in concord[word]: ... instead of a loop or binary search.
PS. actually assuming you are keeping list of occurrences in array AND scanning the text from 1st to last paragraph, that means arrays will form sorted, so you only need to check the very last element there if word in concord and paraNum == concord[word][-1]:. (Examples are in pseudocode/python but you can translate to your language)

Mid point in a link list in a single traversal?

I'm trying to find the point of a singly link list where a loop begins.
what I thought of was taking 2 pointers *slow, *fast one moving with twice the speed of other.
If the list has a loop then at some point
5-6-7-8
| |
1-2-3-4-7-7
slow=fast
Can there be another elegant solution so that the list is traversed only once?
Your idea of using two walkers, one at twice the speed of the other would work, however the more fundamental question this raises is are you picking an appropriate data structure? You should ask yourself if you really need to find the midpoint, and if so, what other structures might be better suited to achieve this in O(1) (constant) time? An array would certainly provide you with much better performance for the midpoint of a collection, but has other operations which are slower. Without knowing the rest of the context I can't make any other suggestion, but I would suggest reviewing your requirements.
I am assuming this was some kind of interview question.
If your list has a loop, then to do it in a single traversal, you will need to mark the nodes as visited as your fast walker goes through the list. When the fast walker encounters NULL or an already visited node, the iteration can end, and your slow walker is at the midpoint.
There are many ways to mark the node as visited, but an external map or set could be used. If you mark the node directly in the node itself, this would necessitate another traversal to clean up the mark.
Edit: So this is not about finding the midpoint, but about loop detection without revisiting already visited nodes. Marking works for that as well. Just traverse the list and mark the nodes. If you hit NULL, no loop. If you hit a visited node, there is a loop. If the mark includes a counter as well, you even know where the loop starts.
I'm assuming that this singly linked list is ending with NULL. In this case, slow pointer and fast pointer will work. Because fast pointer is double at speed of slow one, if fast pointer reaches end of list slow pointer should be at middle of it.

How will you delete duplicate odd numbers from a linked list?

Requirements/constraint:
delete only duplicates
keep one copy
list is not initially sorted
How can this be implemented in C?
(An algorithm and/or code would be greatly appreciated!)
If the list is very long and you want reasonable performances and you are OK with allocating an extra log(n) of memory, you can sort in nlog(n) using qsort or merge sort:
http://swiss-knife.blogspot.com/2010/11/sorting.html
Then you can remove duplicates in n (the total is: nlog(n) + n)
If your list is very tiny, you can do like jswolf19 suggest, and you will get: n(n-1)/2 worst.
There are several different ways of detecting/deleting duplicates:
Nested loops
Take the next value in sequence, then scan until the end of the list to see if this value occurs again. This is O(n2) -- although I believe the bounds can be argued lower? -- but the actual performance may be better as only scanning from i to end (not 0 to end) is done and it may terminate early. This does not require extra data aside from a few variables.
(See Christoph's answer as how this could be done just using a traversal of the linked list and destructive "appending" to a new list -- e.g. the nested loops don't have to "feel" like nested loops.)
Sort and filter
Sort the list (mergesort can be modified to work on linked lists) and then detect duplicate values (they will be side-by-side now). With a good sort this is O(n*lg(n)). The sorting phase usually is/can be destructive (e.g. you have "one copy") but it has been modified ;-)
Scan and maintain a look-up
Scan the list and as the list is scanned add the values to a lookup. If the lookup already contains said values then there is a duplicate! This approach can be O(n) if the lookup access is O(1). Generally a "hash/dictionary" or "set" is used as the lookup, but if only a limited range of integrals are used then an array will work just fine (e.g. the index is the value). This requires extra storage but no "extra copy" -- at least in the literal reading.
For small values of n, big-O is pretty much worthless ;-)
Happy coding.
I'd either
mergesort the list followed by a linear scan to remove duplicates
use an insertion-sort based algorithm which already removes duplicates when re-building the list
The former will be faster, the latter is easier to implement from scratch: Just construct a new list by popping off elements from your old list and inserting them into the new one by scanning it until you hit an element of greater value (in which case you insert the element into the list) or equal value (in which case you discard the element).
Well, you can sort the list first and then check for duplicates, or you could do one of the following:
for i from 0 to list.length-1
for j from i+1 to list.length-1
if list[i] == list[j]
//delete one of them
fi
loop
loop
This is probably the most unoptimized piece of crap, but it'll probably work.
Iterate through the list, holding a pointer to the previous object every time you go on to the next one. Inside your iteration loop iterate through it all to check for a duplicate. If there is a duplicate, now back in the main iteration loop, get the next object. Set the previous objects pointer to the next object to the object you just retrieved, then break out of the loop and restart the whole process till there are no duplicates.
You can do this in linear time using a hash table.
You'd want to scan through the list sequentially. Each time you encounter an odd numbered element, look it up in your hash table. If that number is already in the hash table, delete it from the list, if not add it to the hash table and continue.
Basically the idea is that for each element you scan in the list, you are able to check in constant time whether it is a duplicate of a previous element that you've seen. This takes only a single pass through your list and will take at worst a linear amount of memory (worst case is that every element of the list is a unique odd number, thus your hash table is as long as your list).

Resources