I have been trying to re-work some AcoustID code, and I am trying to figure out if ElasticSearch has a way to fuzzy-match arrays of ints. Specifically, say I have my search array: {1,2,3} and in ES, I store my reference docs:
1: {3,4,5,6}
2: {1,1,1,2,4,3,4}
3: {6,7,8}
4: {1,1,2,3}
I would want to get 4 back as the best match (it contains 1,2,3 exactly) then 2 (it contains my search but has an extra int in there), then 1 (it has a 3) but NOT 3.
The current AcoustID code does this in postgres with some custom C code -- if it's helpful for context, it can be found here; https://bitbucket.org/acoustid/pg_acoustid/src/4085807d755cd4776c78ba47f435cfb4b7d6b32c/acoustid_compare.c?fileviewer=file-view-default#acoustid_compare.c-122
I actually intend to have ~100GB of these arrays indexed, with each containing ~100 ints. Can ES handle this kind of work, and provide a reasonable level of performance?
Did you try to us the Terms Query of Elasticsearch. This query will give you back the required documents.
However, using it on an integer array will give you a different sorting, as the fieldNorms are not stored along with integer fields.
To get your desired sorting, it would be enough to store your integer array as string array and query with the terms query:
"query": {
"terms": {
"ids": [
"1",
"2",
"3"
]
}
}
Related
I have an array of data and all values are in string format due to some prior JSON operations in the flow. I need to sort the array based on one field which are numbers (in string format). I'm using the sort function and it currently sorts the array, but it's done as a string and sorts 'alphabetically' (e.g. "20", "290", "3", "300", "31" instead of the desired "3", "20", "31", "290", "300")
I've tried using Int() and Float() to convert the field to an integer. I'm using the functions wrapped around the mapped item in a select action int(item()?['Points_Total']) with no success. See images.
Both functions give the same error:
The template language function ('int' or 'float') was invoked with a
parameter that is not valid. The value cannot be converted to the
target type.
The data in the field are whole numbers, both positive and negative. I'd think that both functions could handle the values I'm working with.
Here is how I'm trying to convert the field Points_Total:
Error and View of Data Field
Any ideas from the community?
Thank you!
I'd suggest you have a blank value there somewhere. You'll need to check for that prior to converting to a numeric value. Your expression should be changed to something like this ...
if(isInt(item()?['Points_Total']), int(item()?['Points_Total']), 0)
... my suggestion would be to set it to zero in the cases where it's not a number however, if that skews your results, etc. then you can choose what you do.
Also, I've used isInt given you said the numbers are whole.
I’m trying to get the size of an array so that I can use the number that it returns in a for loop. I’m using mongodb compass. I’m trying to use something like the projection below where 0 is an object inside of path and “here” is an array with 2 items.
$project
{
“alias” : {$size : “$this.is.the.field.path.0.here”}
}
However, this keeps returning an array size of 0. It works fine for field paths that don’t contain a number in their path but returns 0 if the path does contain a number. Is there a way to properly get the correct size of the array here[ ] which has a size of 2?
Using dot notation to specify an array index works in the query language, but not in aggregation.
For aggregation, use the $arrayElemAt operator.
Also note from the BSON specification:
Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array ['red', 'blue'] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.
If you have a document that has numeric keys starting at 0, it will be treated as an array.
In one of my NDB models, I have one field whose type is ndb.JsonProperty(repeated=True, indexed=False).
This field is an array that on the average case has about 800 elements which look like {ac: "i", e: [0, 3], ls: ["a"], s: [0, 2], sn: 9, ts: "2018-06-25T22:35:04.855Z"}. These represent editions in a text editor.
In some parts of my application just a summary of these elements is needed. For example: how many elements have the ac property equal to 'x' or how many seconds elapsed between the first and the last element of the array.
My problem is that sometimes I need to get several models, each of them with an array containing about 800 of the elements described previously. So, let's say I need to process 40 models * 800 elements = 32000 elements – this already causes my application to use about 150MB of memory, when GAE allows 128MB to be used.
I tried to do an optimization: I summarize the data once, and keep it stored in a property, because at some point, the array doesn't change anymore. This allows me to use less memory by not accessing the array. But the summary data may become stale when the array is updated – this can be verified by getting the timestamp of last element in the array, and comparing it to the summary timestamp, if it's more recent, then the summary is stale.
My question is: in Python 2.7 – more specifically when using NDB – will the whole array be retrieved when I access the last element, as in elements[-1]? Or is there a way to lazily load the list from the last to the first element? Or is there a way to optimize the data, so that I don't use so much memory when I need to process it?
Introduction
My collection has more than 1 million of documents. Each document's structure is identical and looks like this:
{_id: "LiTC4psuoLWokMPmY", number: "12345", letter: "A", extra: [{eid:"jAHBSzCeK4SS9bShT", value: "Some text"}]}
So, as you can see, my extra field is an array that contains small objects. I'm trying to insert these objects as many as possible (until I get closer to 16MB of document limit). And these objects usually present in the extra array of the most documents in the collections. So I usually have hundreds of thousands of the same objects.
I have an index on eid key in the extra array. I created this index by using this:
db.collectionName.createIndex({"extra.eid":1})
Problem
I want to count how many extra field object present in the collection. I'm doing it by using this:
db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}).count()
In the beginning, the query above is very fast. But whenever extra array gets a little bit bigger (more than 20 objects), it gets really slow.
With 3-4 objects, it takes less than 100 miliseconds but when it gets bigger, it takes a lot more time. With 50 objects, it takes 6238 miliseconds.
Questions
Why is this happening?
How can I make this process faster?
Is there any other way that does this process but faster?
I ran into a similar problem. I bet your query isn't hitting your index.
You can do an explain (run db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}).explain() in the Mongo shell) to know for sure.
The reason is that in Mongo db.collectionName.find({extra: {eid: "jAHBSzCeK4SS9bShT"}}) is not the same as db.collectionName.find({"extra.eid": "jAHBSzCeK4SS9bShT"}). The first form won't use your index, while the second form will (as an example, although this wouldn't work in your case because your subdocument is actually an array). Not sure why, but this seems to be a quirk of Mongo's query builder.
I didn't find any solution except for indexing the entire subdocument.
I believe that arrays as a data structure is an organized set of items and by definition in JSON it is an ordered set of key:value pairs. I tried to test it out by a simple example.
{
"employees":[{
"Srno":1,
"EmpID":123,
"Name":"John Doe"
},
{
"Srno":2,
"Name":"James Mars"}]
}
The idea was every element in the employees array to have three properties viz. Srno,EmpID and Name.
However, the second element is intentionally left with 2 out 3 properties viz Srno and Name only.
My assumption was that it will not parse. But it did.
Then this statement from JSON.org about arrays, in incorrect.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
Where am I mistaken in understanding about arrays in JSON? Can someone clarify please.
JSON defines a syntax for exchange of structured data, but doesn't define much in the way of semantics at all.
{
"example":[{
"id":1,
"a":123,
"b":"John Doe"
},
{
"id":1,
"a":"ABC",
"c":"James Mars",
"d": true
}]
}
The above snippet is perfectly valid JSON. Notice -- in addition to your "concerns" about arrays:
There is no way of specifying that ID must be unique.
There is no way of specifying that nodes with the same name have the same datatype.
In summary, not only does JSON not require that each node has an identical number of properties, the properties that exist don't have to have the same names or the same data-types.
Conversely, you could duplicate the first node of your example entirely (3 properties with the same names and values) and it would be equally valid. It's purely syntax, no semantics.
Your assumption is that programming languages should give some sort of parse error given an array where the values are of different type, like in your example. That assumption is VERY wrong.
Sure, you're correct if you're talking about Java, C++ or C# for example, but Perl, Python, PHP, Ruby, R, JavaScript, Smalltalk, ActionScript, Clojure, ColdFusion, Common Lisp (and most other Lisps), Powershell, Dylan, Groovy, Gambas, Matlab, io, VBScript and many many more languages would accept an array with objects of different types.
JSON is just like those languages. Nothing weird going on at all.
PS. I would recommend learning a dynamically typed language (one from the list above maybe) to get a wider understanding of programming in general. Just as I would advice all dynamic language-advocates to learn a static one!