Solr experts, I'd really appreciate some advice on my problem.
I want to build a multi-dimensional space using Solr, let's say with 5 dimensions. In this space, there should be points, e.g.
P1 (0.3, 0.3, 0.3, 0.3, 0.3)
P2 (0.5, 0.5, 0.5, 0.5, 0.1)
P3 (0.5, 0.1, 0.1, 0.1, 0.1)
Now I'd like to find the point that is nearest to a given point, e.g.
Px (0.5, 0.5, 0.5, 0.5, 0.5)
I've tried to find reliable information about multi-dimensional spatial search. But I could not find anything that was of help.
In the Solr Wiki is an article about Spatial Search. But there they are only using 2 dimensions.
So my question is: Does Solr provide the functionality for a multi-dimensional spatial search?
You can use either Principal component analysis or T-distributed Stochastic Neighbor Embedding to reduce your 5-dimensional space to a 2-dimensional representation, and then you can use Solr to find the nearest neighbors for any point on your dataset.
According to this question, it seems that t-SNE is the most suitable option for your problem.
There is a Python t-SNE tutorial here but I think this would be enough to solve your problem:
from sklearn.manifold import TSNE
X = np.array([ [0.3, 0.3, 0.3, 0.3, 0.3], [0.5, 0.5, 0.5, 0.5, 0.1], [0.5, 0.1, 0.1, 0.1, 0.1], [0.5, 0.5, 0.5, 0.5, 0.5] ])
reduced_points = TSNE(n_components=2, random_state=0, angle=.99, init='pca').fit_transform(X)
reduced_points = [ [int(x[0]*100), int(x[1]*100)] for x in reduced_points ]
And then you'll get your points in bidimensional space.
>>> reduced_points
[[-21020, 2023], [-12745, -16097], [-2899, 10298], [5375, -7822]]
This isn't supported in Solr, but it is supported in Lucene.
https://www.elastic.co/blog/lucene-points-6.0
Related
For example, I want to generate a sample of 100 elements from the array a = [1, 2, 3, 4] with the probabilities p = [0.1, 0.1, 0.3, 0.5] associated with each element in a. In Python I can use np.random.choice(a=[1, 2, 3, 4], size=100, p=[0.1, 0.1, 0.3, 0.5]).
Does DolphinDB have a built-in function for this?
You can use a user-defined function:
def choice(v, n, p){
cump = removeTail!([0.0].join(cumsum(p\p.sum())), 1)
return v[cump.asof(rand(1.0, n))]
}
a=[1, 2, 3, 4]
n=100000
p=[0.1, 0.1, 0.3, 0.5]
r = choice(a, n, p)
Starting from version 1.30.19/2.00.7, you can use the built-in function randDiscrete directly:
randDiscrete(1 2 3 4, [0.1, 0.1, 0.3, 0.3], 100)
output:
I have an array of sorted numbers:
arr = [-0.1, 0.0, 0.5, 0.8, 1.2]
I want the difference (dist below) between consecutive numbers for that array to be above a given threshold. For example, if threshold is 0.25:
dist = [0.1, 0.5, 0.3, 0.4] # must be >0.25 for all elements
arr[0] and arr[1] are too close to each other, so one of them must be modified. In this case the desired array would be:
good_array = [-0.25, 0.0, 0.5, 0.8, 1.2] # all elements distance > threshold
In order to obtain good_array, I want to modify the minimum amount of elements in arr. So I substract 0.15 from arr[0] rather than, say, substract 0.1 from arr[0] and add 0.05 to arr[1]:
[-0.2, 0.05, 0.5, 0.8, 1.2]
Previous array is also valid, but we have modified 2 elements rather than one.
Also, in case it is possible to generate good_array by modifying different elements in arr, by default modify the element closer to the edge of the array. But keep in mind the main goal is to generate good_array by modifying the minimum number of elemtns in arr.
[-0.1, 0.15, 0.5, 0.8, 1.2]
Previous array is also valid, but we have modified arr[1] rather than the element closer to the edge (arr[0]). In case 2 elements have equal distance from edges, modify the one closer to begining of array:
[-0.3, 0.15, 0.2, 0.7] # modify arr[1] rather than arr[2]
So far I have been doing this manually for small arrays, but I would like a general solution for larger arrays.
Here is brute force python solution, where we try to fix elements to the right or elements to the left when there is a conflict:
def solve(arr, thereshold):
original = list(arr)
def solve(idx):
if idx + 1 >= len(arr):
return [sum(1 for x in range(len(arr)) if arr[x] != original[x]), list(arr)];
if arr[idx + 1] - arr[idx] < thereshold:
copy = list(arr)
leftCost = 0
while idx - leftCost >= 0 and arr[idx + 1] - arr[idx - leftCost] < thereshold * (leftCost + 1):
arr[idx - leftCost] = arr[idx - leftCost + 1] - thereshold
leftCost += 1
left = solve(idx + 1)
for cost in range(leftCost):
arr[idx - cost] = copy[idx - cost]
rightCost = 0
while idx + rightCost + 1 < len(arr) and arr[idx + rightCost + 1] - arr[idx] < thereshold * (rightCost + 1):
arr[idx + rightCost + 1] = arr[idx + rightCost ] + thereshold
rightCost += 1
right = solve(idx + 1)
for cost in range(rightCost):
arr[idx + cost + 1] = copy[idx + cost + 1]
if right[0] < left[0]:
return right
elif left[0] < right[0]:
return left
else:
return left if idx - left[0] <= len(arr) - idx - right[0] else right
else:
return solve(idx + 1)
return solve(0)
print(solve([0,0.26,0.63,0.7,1.2], 0.25))
Edit: I just realized that my original solution was stupid and overcomplicated. Now presenting simple and better solution
First approach
If I understand your problem correctly, your input array can have some regions, where your condition is not met. For instance:
array = [0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.5, 0.75, 1.0] (first 4 elements)
or:
array = [0.25, 0.5, 0.75, 1.0, 1.0, 1.0, 1.0, 1.0, 1.25, 1.5, 1.75] (elements arr[4], arr[5] and arr[6])
To fix that, you have to add (or subtract) some pattern like:
fixup = [0.0, 0.25, 0.0, 0.25, 0.0, 0.0, 0.0, 0.0, 0.0] (for the first case)
or:
fixup = [0.0, 0.0, 0.0, 0.0, 0.25, 0.0, 0.25, 0.0, 0.0, 0.0, 0.0] (for the second example)
Second approach
But our current solution has got some problem. Consider a bad area with an "elevation":
array = [0.0, 0.25, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.35, 1.6] (broken area is within values: 0.6-1.0)
In that case our correct "solution" will be:
fixup = [0.0, 0.0, 0.0, 0.25+0.1, 0.0, 0.25+0.1, 0.0, 0.25+0.1, 0.0, 0.0, 0.0]
which produce:
good_array = [0.0, 0.25, 0.5, 0.95, 0.7, 1.15, 0.9, 1.0, 1.1, 1.35, 1.6]
So to summarize, you have to apply the "patch":
fixup[i] = threshold+max(difference[i], difference[i-1]) (for i when i-start_index is even)
(please note that it will be -threshold+min(difference[i], difference[i-1]) for negative values)
and:
fixup[i] = 0 (for i when i-start_index is odd)
start_index is a beginning of the bad region.
Third approach
Previously mentioned formula doesn't work well for some cases (like [0.1, 0.3, 0.4] that it would increment 0.3 up to 0.75 when only 0.65 is sufficient)
Lets try to improve that:
good_array[i] = max(threshold+array[i-1], threshold+array[i+1]) (for abs(array[i-1]-array[i+1]) < threshold*2)
and:
good_array[i] = (array[i-1]+array[i+1])/2 otherwise.
(you can also choose formula: good_array[i] = min(-threshold+array[i-1], -threshold+array[i+1]) when it would produce a result closer to original array value, if minimizing difference is also your optimization goal)
4th approach
Bad regions of even length are also a threat. I can think about 2 ways to solve it:
Solution based on a pattern like [0.0, 0.25, 0.5, 0.0]
Or based on a pattern like [0.0, 0.25, -0.25, 0.0] (We are simply using "the second formula")
Or [0.0, 0.25, 0.0, 0.25] (just including additional element to make bad region length odd -I don't recommend this approach as it would require handling lot of corner cases)
Corner cases
Please consider also some corner cases (bad region starts or ends at an "edge" of the array):
good_array[0] = threshold+array[1]
and:
good_array[array_size-1] = threshold+array[array_size-2]
Final hints
I would suggest to implement lot of unit tests during implementation in order to easily verify correctness of derived formulas and handle some combinations of corner cases. Bad areas that consist of only one element can be one of them.
I want to get arrays with floats from A,B,C list.
page = requests.get("http://www.arso.gov.si/potresi/obvestila%20o%20potresih/aip/")
soup = BeautifulSoup(page.content, 'html.parser')
all_tables=soup.find_all('table')
right_table=soup.find('table',class_='online')
A=[]
B=[]
C=[]
for row in right_table.findAll("tr"):
cells = row.findAll('td')
if len(cells)==6:
A.append(cells[1].find(text=True))
B.append(cells[2].find(text=True))
C.append(cells[3].find(text=True))
For now I have variables like this:
A=[u'45.50',u'46.00',...]
and I want just floats from list:
A=[45.50,46.00,...]
Just convert the element's text to float type:
...
if len(cells) == 6:
A.append(float(cells[1].text))
B.append(float(cells[2].text))
C.append(float(cells[3].text))
print(A)
print(B)
print(C)
The output:
[45.5, 46.0, 46.07, 45.89, 45.83, 46.1, 46.53, 45.88, 45.84, 45.9, 46.09, 46.39, 45.3, 45.34, 46.7, 45.25, 46.39, 45.5, 46.39]
[14.41, 14.76, 14.22, 14.59, 15.12, 14.42, 14.57, 15.19, 15.18, 14.57, 14.19, 13.39, 14.62, 14.59, 15.23, 14.58, 15.03, 14.4, 15.03]
[1.2, 1.2, 1.0, 0.8, 1.2, 1.0, 1.1, 1.3, 0.8, 0.9, 0.5, 1.0, 1.3, 2.3, 1.4, 1.9, 0.7, 0.8, 0.4]
You could use python2.7 map function to convert each list of strings to a list of floats:
A = map(float, A)
B = map(float, B)
C = map(float, C)
print A # [45.5, 46.0, 46.07, 45.89, 45.83, 46.1, 46.53, 45.88, 45.84, 45.9, 46.09, 46.39, 45.3, 45.34, 46.7, 45.25, 46.39, 45.5, 46.39]
print B # [14.41, 14.76, 14.22, 14.59, 15.12, 14.42, 14.57, 15.19, 15.18, 14.57, 14.19, 13.39, 14.62, 14.59, 15.23, 14.58, 15.03, 14.4, 15.03]
print C # [1.2, 1.2, 1.0, 0.8, 1.2, 1.0, 1.1, 1.3, 0.8, 0.9, 0.5, 1.0, 1.3, 2.3, 1.4, 1.9, 0.7, 0.8, 0.4]
I'm trying to index some matrix, y, and then reindex that result with some boolean statement and set the corresponding elements in y to 0. The dummy code I'm using to test this indexing scheme is shown below.
x=np.zeros([5,4])+0.1;
y=x;
print(x)
m=np.array([0,2,3]);
y[0:4,m][y[0:4,m]<0.5]=0;
print(y)
I'm not sure why it does not work. The output I want:
[[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]]
[[ 0. 0.1 0. 0. ]
[ 0. 0.1 0. 0. ]
[ 0. 0.1 0. 0. ]
[ 0. 0.1 0. 0. ]
[ 0.1 0.1 0.1 0.1]]
But what I actually get:
[[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]]
[[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]]
I'm sure I'm missing some under-the-hood details that explains why this does not work. Interestingly, if you replace m with :, then the assignment works. For some reason, selecting a subset of the columns does not let me assign the zeros.
If someone could explain what's going on and help me find an alternative solution (hopefully one that does not involve generating a temporary numpy array since my actual y will be really huge), I would really appreciate it! Thank you!
EDIT:
y[0:4,:][y[0:4,:]<0.5]=0;
y[0:4,0:3][y[0:4,0:3]<0.5]=0;
etc.
all work as expected. It seems the issue is when you index with a list of some kind.
Make an array (this is one of my favorites because the values differ):
In [845]: x=np.arange(12).reshape(3,4)
In [846]: x
Out[846]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [847]: m=np.array([0,2,3])
In [848]: x[:,m]
Out[848]:
array([[ 0, 2, 3],
[ 4, 6, 7],
[ 8, 10, 11]])
In [849]: x[:,m][:2,:]=0
In [850]: x
Out[850]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
No change. But if I do the indexing in one step, it changes.
In [851]: x[:2,m]=0
In [852]: x
Out[852]:
array([[ 0, 1, 0, 0],
[ 0, 5, 0, 0],
[ 8, 9, 10, 11]])
it also works if I reverse the order:
In [853]: x[:2,:][:,m]=10
In [854]: x
Out[854]:
array([[10, 1, 10, 10],
[10, 5, 10, 10],
[ 8, 9, 10, 11]])
x[i,j] is executed as x.__getitem__((i,j)). x[i,j]=v as x.__setitem__((i,j),v).
x[i,j][k,l]=v is x.__getitem__((i,j)).__setitem__((k,l),v).
The set applies to the value produced by the get. If the get returns a view, then the change affects x. But if it produces a copy, the change does not affect x.
With array m, y[0:4,m] produces a copy (do I need to demonstrate that?). y[0:4,:] produces a view.
So in short, if the first indexing produces a view the second indexed assignment works. But if produces a copy, the second has no effect.
I am working in python. I have an angle quantity for which I want a varying step size for the array instead of a uniform grid that can be created like np.linspace(0, pi, 100) for 100 equal steps. Instead, I want more 'resolution' (i.e. a smaller step-size) for values close to 0 and pi, with larger step sizes closer to pi/2 radians. Is there a simple way to implement this in python using a technique already provided in numpy or otherwise?
Here's how to use np.r_ to construct a array with closer spacing at the ends, and wider in the middle:
In [578]: x=np.r_[0:.09:10j, .1:.9:11j, .91:1:10j]
In [579]: x
Out[579]:
array([ 0. , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08,
0.09, 0.1 , 0.18, 0.26, 0.34, 0.42, 0.5 , 0.58, 0.66,
0.74, 0.82, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96,
0.97, 0.98, 0.99, 1. ])
then scale x with np.pi.
This is the kind of thing that np.r_ was created for. Not that it's doing anything special. It's doing the same as:
np.concatenate([np.linspace(0,.09,10),
np.linspace(.1,.9,11),
np.linspace(.91,1,10)])
For a smoother gradation in spacing, I'd try mapping a single linspace with a curve.
In [606]: x=np.arctan(np.linspace(-10,10,10))
In [607]: x -= x[0]
In [608]: x /= x[-1]
In [609]: x
Out[609]:
array([ 0. , 0.00958491, 0.02665448, 0.06518406, 0.21519086,
0.78480914, 0.93481594, 0.97334552, 0.99041509, 1. ])