How to iterate through two numpy arrays of different dimensions - arrays

I am working with the MNIST dataset, x_test has dimension of (10000,784) and y_test has a dimension of (10000,10). I need to iterate through each sample of these two numpy arrays at the same time as I need to pass them individually to score.evaluate()
I tried nditer, but it throws an error saying operands could not be broadcasted together since hey have different shape.
score=[]
for x_sample, y_sample in np.nditer ([x_test,y_test]):
a=x_sample.reshape(784,1)
a=np.transpose(a)
b=y_sample.reshape(10,1)
b=np.transpose(b)
s=model.evaluate(a,b,verbose=0)
score.append(s)

Assuming that what you are actually trying to do here is to get the individual loss per sample in your test set, here is a way to do it (in your approach, even if you get past the iteration part, you will have issues with model.evaluate, which was not designed for single sample pairs)...
To make the example reproducible, here I also assume we have first run the Keras MNIST CNN example for only 2 epochs; so, the shape of our data is:
x_test.shape
# (10000, 28, 28, 1)
y_test.shape
# (10000, 10)
Given that, here is a way to get the individual loss per sample:
from keras import backend as K
y_pred = model.predict(x_test)
y_test = y_test.astype('float32') # necessary, as y_pred.dtype is 'float32'
y_test_tensor = K.constant(y_test)
y_pred_tensor = K.constant(y_pred)
g = K.categorical_crossentropy(target=y_test_tensor, output=y_pred_tensor)
ce = K.eval(g) # 'ce' for cross-entropy
ce
# array([1.1563368e-05, 2.0206178e-05, 5.4946734e-04, ..., 1.7662416e-04,
# 2.4232995e-03, 1.8954457e-05], dtype=float32)
ce.shape
# (10000,)
i.e. ce now contains what the score list in your question was supposed to contain.
For confirmation, let's calculate the loss for all test samples using model.evaluate:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
# Test loss: 0.050856544668227435
and again manually, averaging the values of the ce we have just calculated:
import numpy as np
log_loss = np.sum(ce)/ce.shape[0]
log_loss
# 0.05085654296875
which, although not exactly equal (due to different numeric precision involved in the two ways of calculation), they are practically equal indeed:
log_loss == score[0]
# False
np.isclose(log_loss, score[0])
# True
Now, the adaptation of this to your own case, where the shape of x_test is (10000, 784), is arguably straighforward...

You are mixing between training features and testing labels. The training set has 60,000 samples and the test set has 10,000 samples (that is, your x_test should be of dimension (10000,784)). Make sure you have download all the correct data, and don't mix up training data with testing data.
http://yann.lecun.com/exdb/mnist/

Related

Post hoc tests for permutation repeated measure anova

I have a two-way designed experiment, which are two species and three treatments(n = 5 or 6, for each group), and I measure leaf area every three months of same individuals (repeated measured).
Because my data is heterogeneous variance (tested with levene test) and the sample size is small, I decide to use permutation.
Here is my script. I used aovperm() to build the model, and set Error(number/(month)) as random factor (number is the ID of individuals).
area.aovp<-aovperm(p_area~species*treatment*month + Error(number/(month)), data = growth, type = "permutation", np = 2000, method = "Rd_kheradPajouh_renaud")
For doing post-hoc tests, I've tryed lsmeans() and glht(), but neither of them could work with the permutated anova.
Is there other proper way to do such post hoc?

Import CSV file into python, then turn it into numpy array, then feed it to sklearn algorithm

Sklearn algorithm require a feature and a label for it to learn.
I have a CSV file which contain some data. These data is actually a challenge from hackerearth website in which participant need to create a learning algorithm that learn from data on massive amount of individuals from affiliate network and their ad click performance which then predict future performance of other individuals in the affiliate network which allow the company to optimize their ad performance.
The features in these data include id,date,siteid, offerid, category, merchant, countrycode,type of browser, type of device and the number of clicks their ads have gotten.
https://www.hackerearth.com/practice/algorithms/string-algorithm/string-searching/practice-problems/machine-learning/predict-ad-clicks/
So my plan is to use the first 7 information as my feature and ad click as label. Unfortunately, countrycode,browser and device information is in text (Google Chrome, Desktop) and not integers which can be turned into array.
Q1: Is there a way for sklearn to accept not just numpy arrays but also words as features? Am I support to use vectorizer for this? If so, how would I do it? If not, can I just replace the wording data into numbers (Google Chrome replaced by 1, firefox replaced by 2) and still have it to work? (I am using Naive Bayes algorithm)
Q2: Would Naive Bayes algorithm be suitable for this task? Since this competition require participant to create a program that predict the probability of individuals in affiliate network have their ads click, I assume Naive Bayes would be best suited.
Training data : https://drive.google.com/open?id=1vWdzm0uadoro3WcpWmJ0SVEebeaSsHvr
Testing data : https://drive.google.com/open?id=1M8gR1ZSpNEyVi5W19y0d_qR6EGUeGBQl
My messy coding and horrible attempt at this challenge which I don't think will be much help:
from sklearn.naive_bayes import GaussianNB
import csv
import pandas as pd
import numpy as np
data = []
from numpy import genfromtxt
import pandas as pd
data = genfromtxt('smaller.csv', delimiter=',')
dat = pd.read_csv('smaller.csv', delimiter=',')
print(dat(siteid))
feature = []
label =[]
i = 1
j = 1
while i <17:
feature.append(data[i][2:8])
i += 1
while j <17:
label.append(data[i][9])
j += 1
clf = GaussianNB()
clf.fit(feature,label)
print(clf.predict([data[18][2:8]]))
print(data[18])
Answer for Question1: No. Sklearn only works with numerical data. So you need to convert your text to numbers.
Now to convert text to numbers you can follow multiple approaches. First is as you said just assign numbers to them. But you need to to take in account if the text data shows any order like the numbers assigned to them or not. In that case, most often one-hot encoding is used. Please see the below scikit-learn documentation for that:
- http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
Answer to Question 2: It depends on the data and task at hand.
No single algorithm is capable of handling every type of data optimally.
Most of the times we need to compare multiple algorithms and see what gives best result for our data. See this example:
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py
Even in a single algorithm we need to check for various parameter values, tune those values for maximum score. This is called grid-search. See this example:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py
Hope this clears your doubts. Make sure to go through the scikit-learn documentation and examples:
http://scikit-learn.org/stable/user_guide.html
http://scikit-learn.org/stable/auto_examples/index.html
They are one of the best out there.

1D Number Array Clustering

So let's say I have an array like this:
[1,1,2,3,10,11,13,67,71]
Is there a convenient way to partition the array into something like this?
[[1,1,2,3],[10,11,13],[67,71]]
I looked through similar questions yet most people suggested using k-means to cluster points, like scipy, which is quite confusing to use for a beginner like me. Also I think that k-means is more suitable for two or more dimensional clustering right? Are there any ways to partition an array of N numbers to many partitions/clustering depending on the numbers?
Some people also suggest rigid range partitioning, but it doesn't always render the results as
expected
Don't use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier.
In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization.
You might want to look at Jenks Natural Breaks Optimization and similar statistical methods. Kernel Density Estimation is also a good method to look at, with a strong statistical background. Local minima in density are be good places to split the data into clusters, with statistical reasons to do so. KDE is maybe the most sound method for clustering 1-dimensional data.
With KDE, it again becomes obvious that 1-dimensional data is much more well behaved. In 1D, you have local minima; but in 2D you may have saddle points and such "maybe" splitting points. See this Wikipedia illustration of a saddle point, as how such a point may or may not be appropriate for splitting clusters.
See this answer for an example how to do this in Python (green markers are the cluster modes; red markers a points where the data is cut; the y axis is a log-likelihood of the density):
This simple algorithm works:
points = [0.1, 0.31, 0.32, 0.45, 0.35, 0.40, 0.5 ]
clusters = []
eps = 0.2
points_sorted = sorted(points)
curr_point = points_sorted[0]
curr_cluster = [curr_point]
for point in points_sorted[1:]:
if point <= curr_point + eps:
curr_cluster.append(point)
else:
clusters.append(curr_cluster)
curr_cluster = [point]
curr_point = point
clusters.append(curr_cluster)
print(clusters)
The above example clusters points into a group, such that each element in a group is at most eps away from another element in the group. This is like the clustering algorithm DBSCAN with eps=0.2, min_samples=1. As others noted, 1d data allows you to solve the problem directly, instead of using the bigger guns like DBSCAN.
The above algorithm is 10-100x faster for some small datasets with <1000 elements I tested.
You may look for discretize algorithms. 1D discretization problem is a lot similar to what you are asking. They decide cut-off points, according to frequency, binning strategy etc.
weka uses following algorithms in its , discretization process.
weka.filters.supervised.attribute.Discretize
uses either Fayyad & Irani's MDL method or Kononeko's MDL criterion
weka.filters.unsupervised.attribute.Discretize
uses simple binning
CKwrap is a fast and straightforward k-means clustering function, though a bit light on documentation.
Example Usage
pip install ckwrap
import ckwrap
nums= np.array([1,1,2,3,10,11,13,67,71])
km = ckwrap.ckmeans(nums,3)
print(km.labels)
# [0 0 0 0 1 1 1 2 2]
buckets = [[],[],[]]
for i in range(len(nums)):
buckets[km.labels[i]].append(nums[i])
print(buckets)
# [[1, 1, 2, 3], [10, 11, 13], [67, 71]]
exit()
I expect the authors intended you to make use of the nd array functionality rather than create a list of lists.
other measures:
km.centers
km.k
km.sizes
km.totss
km.betweenss
km.withinss
The underlying algorithm is based on this article.
Late response and just for the record. You can partition a 1D array using Ckmeans.1d.dp.
This method guarantees optimality and it is O(n^2), where n is the num of observations. The implementation is in C++ and there is a wrapper in R.
The code for Has QUIT--Anony-Mousse's answer to Clustering values by their proximity in python (machine learning?)
When you have 1-dimensional data, sort it, and look for the largest
gaps
I only added that gaps need to be relatively large
import numpy as np
from scipy.signal import argrelextrema
# lst = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
lst = [1,1,2,3,10,11,13,67,71]
lst.sort()
diff = [lst[i] - lst[i-1] for i in range(1, len(lst))]
rel_diff = [diff[i]/lst[i] for i in range(len(diff))]
arg = argrelextrema(np.array(rel_diff), np.greater)[0]
last = 0
for x in arg:
print(f'{last}:{x + 1} {lst[last:x + 1]}')
last = x + 1
print(f'{last}: {lst[last:]}')
output:
0:2 [1, 1]
2:4 [2, 3]
4:7 [10, 11, 13]
7: [67, 71]

Is there a supervised learning algorithm that takes tags as input, and produces a probability as output?

Let's say I want to determine the probability that I will upvote a question on SO, based only on which tags are present or absent.
Let's also imagine that I have plenty of data about past questions that I did or did not upvote.
Is there a machine learning algorithm that could take this historical data, train on it, and then be able to predict my upvote probability for future questions? Note that it must be the probability, not just some arbitrary score.
Let's assume that there will be up-to 7 tags associated with any given question, these being drawn from a superset of tens of thousands.
My hope is that it is able to make quite sophisticated connections between tags, rather than each tag simply contributing to the end result in a "linear" way (much as words do in a Bayesian spam filter).
So for example, it might be that the word "java" increases my upvote probability, except when it is present with "database", however "database" might increase my upvote probability when present with "ruby".
Oh, and it should be computationally reasonable (training within an hour or two on millions of questions).
What approaches should I research here?
Given that there probably aren't many tags per message, you could just create "n-gram" tags and apply naive Bayes. Regression trees would also produce an empirical probability at the leaf nodes, using +1 for upvote and 0 for no upvote. See http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf for some readable lecture notes and http://sites.google.com/site/rtranking/ for an open source implementation.
You can try several methods (linear regression, SMV, neural networks). The input vector should consist of all possible tags, where each tag represents one dimension.
Then each record in a training set has to be transformed to the input vector according to the tags. For example let's say you have different combinations of 4 tags in your training set (php, ruby, ms, sql) and you define an unweighted input vector [php, ruby, ms, sql]. Let's say you have the following 3 records whic are transformed to weighted input vectors:
php, sql -> [1, 0, 0, 1]
ruby -> [0, 1, 0, 0]
ms, sql -> [0, 0, 1, 1]
In case you use linear regression you use the following formula
y = k * X
where y represents an answer (upvote/downvote) in your case and by inserting known values (X - weighted input vectors).
How ta calculate weights in case you use linear regression you can read here but the point is to create binary input vectors which size is equal (or larger in case you take into account some other variables) to the number of all tags and then for each record you set weights for each tag (0 if it is not included or 1 otherwise).

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Resources