How to sum up the values iterated by the mapper - google-app-engine

I am new to use mapper following IKAI source to learn.
Assume the mapper is iterating over a datastore entity(Say PowerHouse which is having currentConsumption fields which maintains the amount of current consumed for each house)
I need the mapper tool to traverse the complete entity and get the sum of it currentConsumption field.
According to IKAI Demo
I am able to traverse the each row of PowerHouseTable but Not sure how to sum up the currentConsumption.
Any help greatly appreciated.

Yes, you need a Reducer step to aggregate the currentConsumption.
Normally, it is easy, just try to implement a Reduce function with regards to the Map function that you already have in order to aggregate the results.
Try to look on the WordCount example, it has almost the same principle that you are looking for.
http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v2.0
So, in the word count example, in the map function, it gets word by word and assign to it 1. That means, the map result is a list of (key is word and the value is one). In you example it will be the key house and the value is the currentconsumption for that house.
In the reducer, in the word example, the output of the mapper will be the input for the reducer. The reducer sums for the same words the 1s to get the overall sum of that word. the results will be a list(the key is the word and the value is the sum). Same thing with your case and as a result you will get as a key the house and the value will be the currentConsumption.

Sum requires a reducer step since it is an aggregation operator.

https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/demo/main.py#256 for a Python example that uses a reduce phase and https://code.google.com/p/appengine-mapreduce/source/browse/trunk/java/example/src/com/google/appengine/demos/mapreduce/entitycount/ for a Java example.

Related

Storing and replacing values in array continuously

I'm trying to read amplitude from a waveform and shine a green, yellow or red light depending on the amplitude of the signal. I'm fairly new to labVIEW and couldnt get my idea that wouldve worked with any other programming language I know to work. What I'm trying to do is take the value of the signal and for everytime it updates I'll store the value of the amplitude into an index of a large array. With each measurement being stored in the n+1 index of the array.
After a certain amount of data points I want to start over and replace values in the array (I use the formula node with the modulus for this). By keeping a finite amount of indexes to check for max value I restrict my amplitude check to a certain time period.
However my problem is that whenever I use the replace array subset to insert a new value into index n, all the other index points get erased. Rendering it pretty much useless. I was thinking its the Initialize array causing problems but I just cant seem to wrap my head around what to do here.
I tried creating just basic arrays in the front panel, but those either are control or indicator arrays and can't seem to be both written and read from, its either control (read but not write) or indicate(write but not read)?. Maybe its just not possible to do what I had in mind in an eloquent way in LabVIEW. If its not possible to do this with arrays in LabVIEW I will look for a different way to do it.
I'm pretty sure I got most of the rest of the code down except for an unfinished part here and there. Its just my issue with the arrays not working as I want them too.
I expected the array to retain its previously inputted data for index n-1 when index n is inputted. And only to be replaced once the index has come back to that specific point.
Instead its like a new array is initialized every time a new index is input.
download link for the VI
What you want to do:
Transport the content of the modified array into the next iteration of the WHILE loop.
What happens:
On each iteration, the content of the array is the same. It is the content of the initial array you created outside.
To solve this, right-click the orange square on the left border of the loop, and make it a "shift register". The symbol changes, and a similar symbol appears on the right border. Now, wire the modified array to the symbol on the right. What flows out into that symbol on the right, comes in from the left symbol on the next iteration.
Edit:
I have optimized your code a little. There is a modulo function, and an IF clause can handle ranges. ..3 means "values lower or equal 3". The next case is "Default", the next "7..". Unfortunately, this only works for integers. Otherwise, one would use nested IF clauses with the < comparator or similar.

How to sort an array and find the two highest peaks after using find_peaks from Scipy

I am struggling find_peaks a little...
Im applying a cubic spline to some data, from which I want to extract some peaks. However, the data may have several peaks, while I only want the two largest. I can find the peaks
peak_data = signal.find_peaks(spline, height=0.3, distance=50)
I can use this to get the x and y values at the index points within peak_data
peak_vals = spline[peak_data[0]]
time_vals = xnew[peak_data[0]] # xnew being thee splined x-axis
I thought I could order the peak_vals and keep the first two values (ie highest and second highest peaks) and then use that to get the time from xnew which coincides with those values. However, I am unable to use .sort which returns
AttributeError: 'tuple' object has no attribute 'sort'
or sorted() which returns
TypeError: '>' not supported between instances of 'int' and 'dict'
I think this is because it is indexing from a Numpy array (the spline data) and therefore creates another numpy array which then does not work either of the sort commands.
The best I can manage is to iterate through to a new list and then grab the first two values from that:
peak_val1=[]
peak_vals = spline[peak_data[0]]
for i in peak_val_d:
peak_val1.append(i)
peak_val1.sort(reverse=True)
peak_val2 = peak_val1[0:2]
This works but seems a tremendously long winded way to do this given that I still need to index the time values. I'm sure that there must be a faster (simpler) way?
Added Note: I realise that find_peaks returns an index list, but it actually seems to contain both index's and max values in and array-dictionary?? (sorry Im very new to python and curly braces means dictionary but it doesn't look like a simple dict). Anyway... print(peak_data) returns both the index positions and their values.
(array([ 40, 145, 240, 446]), {'peak_heights': array([0.34588031, 0.43761898, 0.45778744, 0.74167977])})
Is there a way to directly access these data perhaps?
You can do this:
peak_indices, peak_dict = signal.find_peaks(spline, height=0.3, distance=50)
This returns the indices of all the peaks in the array, as well as a dictionary of information about the peaks, such as their heights, prominences, etc. To get the heights of the peaks you can access the dictionary like this:
peak_heights = peak_dict['peak_heights']
Then to find the indices of the highest and second-highest peak you can do:
highest_peak_index = peak_indices[np.argmax(peak_heights)]
second_highest_peak_index = peak_indices[np.argpartition(peak_heights,-2)[-2]]
Hope this helps someone somewhere:))
Just posting this incase anyone else has similar trouble and to remind myself to read the docs carefully in future!
Assuming there are no additional arguments, find_peaks() returns a tuple containing and array of the indexes of the peak values, and a dictionary of the actual peak values. Once I realised this it's pretty simple to perform sequence unpacking to generate a separate array and dictionary. So if I began with
peak_data = signal.find_peaks(spline, height=0.3, distance=50)
all I needed to do was to unpack to two variables
peak_index, dict_vals = peak_data
Now I have the index and the values in the order they were identified.

How to compare two arrays containing points to finding out the percentage of similarity?

Suppose we've 10 arrays like below sample
Every element of those array has two parts: the section number, time the cursor was there in second(for e.g in 3rd and 10th seconds cursor was in that section)
By the way, if they give the new array; we need to compare new one with our model and then show the similarity percentage for them to score the actions.
I really have no idea should I use any clustering or classification methods and if yes how it should be for arrays(we always learned about element of an array or some vectors in university)
I found something in Jiawei Han, Micheline Kamber, Jian Pei Data mining book. what do you think about Cosine similarity?
but we need to convert that array to another including Section numbers and Frequency of refers.
converted array

Complexity on sorting or not an integer array

I have an array of integers storing some userIDs. I basically want to prevent a user from performing an action twice, so the moment he has done it his userID enters this array.
I wonder whether it is a good idea to sort or not this array. If it is sorted, then you have A={min, ..., max}. Then, if I'm not wrong, checking if an ID is in the array will take log2(|A|) 'steps'. On the other hand, if the array was not sorted then you will need |A|/2 (in average) steps.
So sorting seems better to check if an element exists in the array (log(|A|) vs |A|), but what about 'adding' a new value? Calculating the position of where the new userID should be can be done at the same time you're checking, but then you will have to displace all the elements from that position by 1... or at least that's how I'd do it on C, truth is this is going to be an array in a MongoDB document, so perhaps this is handled in some other most-effective way.
Of course if the array is unsorted then adding a new value will just take one step ("pushing" it to the end).
To me, an adding operation (with previous checking) will take:
If sorted: log2(|A|) + |A|/2. The log2 part to check and find the place and the |A|/2 as an average of the displacements needed.
If not sorted: |A|/2 + 1. The |A|/2 to check and the +1 to push the new element.
Given that for adding you'll always first check, then the not sorted version appears to have less steps, but truth is I'm not very confident on the +|A|/2 of the sorted version. That's how I would do it in C, but maybe it can work another way...
O(Log(A)) is definitely better than O(A), but this can be done in O(1). The data structure you are looking for is HashMap, if you are going to do this in C. I haven't worked in C in a very long time so I don't know if it is natively available now. It surely is available in C++. Also there are some libraries which you can use in the worst case.
For MongoDB, my solution may not be the best, but I think that you can create another collection of just the userIDs and index the collection keyed on userIDs. This way when someone tries to do that action, you can query the user status quickest.
Also in MongoDB you can try adding another key called UserDidTheAction to your User's collection. This key's value may be true or false. Index the collection based on userID and probably you will have similar performance as the other solution, but at the cost of modifying your original collection's design (though it's not required to be fixed in MongoDB).

Finding k different keys using binary search in an array of n elements

Say, I have a sorted array of n elements. I want to find 2 different keys k1 and k2 in this array using Binary search.
A basic solution would be to apply Binary search on them separately, like two calls for 2 keys which would maintain the time complexity to 2(logn).
Can we solve this problem using any other approach(es) for different k keys, k < n ?
Each search you complete can be used to subdivide the input to make it more efficient. For example suppose the element corresponding to k1 is at index i1. If k2 > k1 you can restrict the second search to i1..n, otherwise restrict it to 0..i1.
Best case is when your search keys are sorted also, so every new search can begin where the last one was found.
You can reduce the real complexity (although it will still be the same big O) by walking the shared search path once. That is, start the binary search until the element you're at is between the two items you are looking for. At that point, spawn a thread to continue the binary search for one element in the range past the pivot element you're at and spawn a thread to continue the binary search for the other element in the range before the pivot element you're at. Return both results. :-)
EDIT:
As Oli Charlesworth had mentioned in his comment, you did ask for an arbitrary amount of elements. This same logic can be extended to an arbitrary amount of search keys though. Here is an example:
You have an array of search keys like so:
searchKeys = ['findme1', 'findme2', ...]
You have key-value datastructure that maps a search key to the value found:
keyToValue = {'findme1': 'foundme1', 'findme2': 'foundme2', 'findme3': 'NOT_FOUND_VALUE'}
Now, following the same logic as before this EDIT, you can pass a "pruned" searchKeys array on each thread spawn where the keys diverge at the pivot. Each time you find a value for the given key, you update the keyToValue map. When there are no more ranges to search but still values in the searchKeys array, you can assume those keys are not to be found and you can update the mapping to signify that in some way (some null-like value perhaps?). When all threads have been joined (or by use of a counter), you return the mapping. The big win here is that you did not have to repeat the initial search logic that any two keys may share.
Second EDIT:
As Mark has added in his answer, sorting the search keys allows you to only have to look at the first item in the key range.
You can find academic articles calculating the complexity of different schemes for the general case, which is merging two sorted sequences of possibly very different lengths using the minimum number of comparisons. The paper at http://www.math.cmu.edu/~af1p/Texfiles/HL.pdf analyses one of the best known schemes, by Hwang and Lin, and has references to other schemes, and to the original paper by Hwang and Lin.
It looks a lot like a merge which steps through each item of the smaller list, skipping along the larger list with a stepsize that is the ratio of the sizes of the two lists. If it finds out that it has stepped too far along the large list it can use binary search to find a match amongst the values it has stepped over. If it has not stepped far enough, it takes another step.

Resources