The documentation clearly states that once a merge is accepted, it is durable and will be indexed shortly.
What's not clear is the atomicity of the indexing. I am assuming that all merged fields for a specific document are updated atomically (all or nothing) but just thought I'd confirm.
So assuming we create the index with 3 fields a, b, and c and a document with key="k1" is already indexed with the following values: { a:1, b:1, c:1 }
After the following merge is submitted { b:2, c:2 } for k1, subsequent queries for k1 will either return:
{ a:1, b:1, c:1 }
or eventually
{ a:1, b:2, c:2 }
but never
{ a:1, b:2, c:1 }
Is that a correct assumption?
Yes, the assumption is correct, for a given merge request all fields within a single document will be updated atomically.
For completeness, note that updates across different entries in an indexing batch (targeted at different documents or even referencing the same document multiple times) are not guaranteed to be atomic.
Related
Hi I have a dictionary of type:
1:[[12.342,34.234],[....,...],....]
2:[[......],[....]]....
Now I'd like to know if there are functions to delete specifics key and correspondents value, and a function to re-index it for examples if I delete the value correspondents to key 2 the key 3 should become the key 2 and so on.
I think you need to use an Array, not a Dictionary
var elms: [[[Double]]] = [
[[0.1],[0.2, 0.3]],
[[0.4], [0.5]],
[[0.6]],
]
elms.remove(at: 1) // remove the second element
print(elms)
[
[[0.10000000000000001], [0.20000000000000001, 0.29999999999999999]],
[[0.59999999999999998]]
]
Yes output values are slightly different from the original ones.
To delete a key and value, just do dict[key] = nil.
As for re-indexing, dictionary keys are not in any particular order, and shifting all the values over to different keys isn't how a dictionary is designed to work. If this is important for you, maybe you should use something like a pair of arrays instead.
Dictionaries are not ordered. This means that "key 3" becoming "key 2" is not a supported scenario. If keys have been in the same order that you've inserted them, you've been lucky so far, as this is absolutely not guaranteed.
If you want ordering and your list of key/value pairs is small (a hundred or so is small), you should consider using an array tuples: [(Key, Value)]. This has guaranteed ordering. If you need something bigger than that or faster key lookup, you should find a way to define an ordering relationship between keys (such that you can say that one should always be after some other key), and use a sorted collection like this one.
I have been trying to re-work some AcoustID code, and I am trying to figure out if ElasticSearch has a way to fuzzy-match arrays of ints. Specifically, say I have my search array: {1,2,3} and in ES, I store my reference docs:
1: {3,4,5,6}
2: {1,1,1,2,4,3,4}
3: {6,7,8}
4: {1,1,2,3}
I would want to get 4 back as the best match (it contains 1,2,3 exactly) then 2 (it contains my search but has an extra int in there), then 1 (it has a 3) but NOT 3.
The current AcoustID code does this in postgres with some custom C code -- if it's helpful for context, it can be found here; https://bitbucket.org/acoustid/pg_acoustid/src/4085807d755cd4776c78ba47f435cfb4b7d6b32c/acoustid_compare.c?fileviewer=file-view-default#acoustid_compare.c-122
I actually intend to have ~100GB of these arrays indexed, with each containing ~100 ints. Can ES handle this kind of work, and provide a reasonable level of performance?
Did you try to us the Terms Query of Elasticsearch. This query will give you back the required documents.
However, using it on an integer array will give you a different sorting, as the fieldNorms are not stored along with integer fields.
To get your desired sorting, it would be enough to store your integer array as string array and query with the terms query:
"query": {
"terms": {
"ids": [
"1",
"2",
"3"
]
}
}
Recently I came across a question, I had to code it but failed to do so effectively. So i'll try to explain the question in the best way I can, and it goes like..
There are different people belonging to different communities. Say for example, 1 belongs to C1, 2 belong to C2 and 3 belongs to C3. We can perform two operations, Query and Join. Query returns the total number of people belonging to the person's community. And Join is used to combine the communities of exactly two persons into one.
We are taking the number of people and the number of operations to be performed as an input and we need to produce the result onto the standard output.
Example Case: (Q -> Query and J -> Join)
3 // No. of People
6 // No. of Operations
Q 1 // Prints 1
J 1,2 // Joins communities of 1 and 2
Q 1 // Prints 2
J 2,3 // Joins communities of 1 and 2
Q 3 // Prints 3
Q 1 // Prints 3
So essentially, its like people are belonging to individual bubbles initially and on join, we join the bubbles of two peoples to form a larger bubble containing two people.
There are different ways to solve this problem. Using ArrayList methods of Java, its pretty easy. I was trying to solve it using arrays.
My approach was to form an array for each person initially and as we join two communities, the respective arrays are added with the people as described :
Arr1 : 1 // Array for Person 1 ; Size 1
Arr2 : 2 // Array for Person 2 ; Size 1
J 1,2 results in,
Arr1 : 1,2 // Size 2
Arr2 : 2,1 // Size 2
But I was told that this is not an effective approach, an effective approach would be to make use of Linked List. I was unable to use linked list to solve it. So I would like some inputs from you guys on the approach, how exactly do I make use of Linked List to keep track of Join operations?
Sorry for the long post, and thanks in advance :)
P.S : I am not sure if the title is appropriate, kindly suggest proper title in case its not appropriate.
I do not know why somebody felt that linked lists would be better than arrays in this instance, but joining the communities is as simple as adding all of the members of one to the other, and changing all the pointers that point to the now empty community to point to the node containing all of the members. Query should be simple enough that it goes without saying.
I am a complete noob to VB.Net, and have had my share of growing pains. I'm starting to get a handle on what I need to do, though.
The program I am writing needs to take about 500 .csv files, sucks out the info from them line by line, stores the data into about four different arrays, then export the data into one long index.
Each line in the files starts with a code word, and contains between 5 and 20 fields of data. The code word determines how many fields there are, and how the data needs to be stored. If it's Code A, it needs to go into Array A. If it's Index B, it needs to go in Array B and set some variables for Arrays A, B, C, and D. Code C means it goes intoArray C. And so on.
My problem is that I will not know how many lines of data there will be, so using a number of standard arrays. I've got the code figured out so that I have each line of data channeled into the the correct sub. But I am unsure how to STORE the data.
I will need to manipulate/sort the data in Array C, but will be able to just dump data into and suck it out of Index A, B, and D.
Should I use 2D arrays for all the indexes? Would collections work better? If so, which kind of collection would work better?
//Array A= 4 columns per row, unknown number (500) of rows
//Array B= 18 columns of columns, unknown number (10,000+) rows
//Array C= 3 columns, unknown number (2000) of rows, must be able to sort and alter
//Array D= 3 columns, unknown number (1000) rows.
Thank you
In a nutshell:
Should I use 2D arrays for all the indexes?
no.
Would collections work better?
Yes, much better.
If so, which kind of collection would work better?
Generic lists (List(Of T)), where you define objects (classes) with fields the match the columns for each type of record in your csv data, and use those classes as the types for your lists.
For ArrayB, beware the Large Object Heap causing problems with OutOfMemoryExceptions. You may need to keep that mostly on disk.
Say that I have a field in my Solr schema that either has the value 1, 2, 3 or 4. I do no arithmetic on this field. The field is a status of the record. It could just as easily be A, B, C or D. Each of the 11,000,000 records has one of these statuses.
In this question an answer says that ints are "more memory-efficient", so that's a start. Are there other factors to consider? Does one match faster than the other?
This field is not going to be sorted. The values are arbitrary, and we'll never do a sort. It's only going to be used in filter queries.
Will you ever query on a range? So if your 1...4 is really marking statuses of say Bad to Great, would you ever query on records from 1-2? This is the only thing of where you may need them to be ints (and, since you only have 4, it's not that big of a deal).
My rule in data storage is that if the int will never be used as an int, then store it as a string. It may require more space, etc. but you can do more string manipulations, etc. And the memory requirements of 11m records may not matter if that one field is a string or int (11m is a lot of records, but not a heavy load for Solr/Lucene).
With only 4 distinct values, int and String will perform very similarly for filter queries, sorting and even range queries.