Which model is better for elasticsearch indexing? - arrays

I have 2 JSON model that can represented model for elasticsearch indexing. First :
{
"id" : 1,
"nama" : "satu",
"child" : {
"id" : 2,
"nama" : "dua",
"child" : [
{
"id" : 3
"nama" : "tiga"
},
{
"id" : 4,
"nama" : "empat"
}
}
}
}
And second :
[{
"parent1id" : 1,
"parent1nama" : "satu",
"parent2id" : 2,
"parent2nama" : "dua",
"id" : 3,
"nama" : "tiga"
},
{
"parent1id" : 1,
"parent1nama" : "satu",
"parent2id" : 2,
"parent2nama" : "dua",
"id" : 4,
"nama" : "empat"
}]
Actually both first and second have the same meaning and created for elasticsearch indexing. I think the first model is less redundant, and the second ones is more redundant. But the first ones, represented as 1 elastic record, but the second ones represented as 2 elastic record. This thing will impact when I do searching for example ID = 3. The first ones, will return the whole record, and the second ones will return the record that the ID = 3.
So, I want your suggestion all, which model better for elasticsearch. Thanks...

There's no diference inside elasticsearch because he uses Apache lucene to save your fields as key = value. for example you first example i'll be save as child.id = 3, child.mama = tiga.
But a good point in your first case the child object will be indexed as Nested Object that have a lot of possibilities as filters, queries and another kind of things.
Take a look in nested object i think this will clarify your needs.
Note: use aggregated data when is possible, elasticsearch is a NoSql document oriented.

I strongly suggest your 2nd model. A key principal of a noSQL database is that you duplicate data to make it easier to query.
Using Nested or Parent/Child in ES is doable, but it makes all your queries more complicated. We have found that flattening everything is much easier to work with and allows us to use Kibana much more efficiently.

Related

Performance issue when querying time-based objects

I'm currently working on a mongoDB collection containing documents that looks like the following :
{ startTime : Date, endTime: Date, source: String, metaData: {}}
And my usecase is to retrieve all documents that is included within a queried time frame, such as my query looks like this :
db.myCollection.find(
{
$and: [
{"source": aSource},
{"startTime" : {$lte: timeFrame.end}},
{"endTime" : {$gte: timeFrame.start}}
]
}
).sort({ "startTime" : 1 })
With an index defined as the following :
db.myCollection.createIndex( { "source" : 1, "startTime": 1, "endTime": 1 } );
The problem is that queries are very slow (multiple hundreds of ms on a local database) as soon as the number of document per source increase.
Using mongo explain shows me that i'm efficiently using this index (only found documents are scanned, otherwise only index-access is made), so the slowness seems to come from the index scan itself, as this query needs to go over a large portion of this index.
In addition to that, such an index gets huge pretty quickly and therefore seems inefficient.
Is there anything i'm missing that could help makes those queries faster, or am I condemned to retrieve all the documents belonging to a given source as the best way to go ? I see that mongo now provides some time-series features, could that bring any help in regard of my problem ?

JSON data Aggregation in flink 1.10 DataStream API

I am trying to aggregate data in Elasticsearch using Kafka messages (as a Flink 1.10 API StreamSource). Data is receiving in JSON format which is dynamic and sample is given below.I want to combine multiple records in single document by unique ID. Data is coming in sequence wise and it's time series data.
source sink kafka and destination sink elasticseach 7.6.1 6
I am not found any good example which can be utilize in below problem statement.
Record : 1
{
"ID" : "1",
"timestamp" : "2020-05-07 14:34:51.325",
"Data" :
{
"Field1" : "ABC",
"Field2" : "DEF"
}
}
Record : 2
{
"ID" : "1",
"timestamp" : "2020-05-07 14:34:51.725",
"Data" :
{
"Field3" : "GHY"
}
}
Result :
{
"ID" : "1",
"Start_timestamp" : "2020-05-07 14:34:51.325",
"End_timestamp" : "2020-05-07 14:34:51.725",
"Data" :
{
"Field1" : "ABC",
"Field2" : "DEF",
"Field3" : "GHY"
}
}
Below is version details:
Flink 1.10
Flink-kafka-connector 2.11
Flink-Elasticsearch-connector 7.x
Kafka 2.11
JDK 1.8
What you're asking for could be described as some sort of join, and there are many ways you might accomplish this with Flink. There's an example of stateful enrichment in the Apache Flink Training that shows how to implement a similar join using a RichFlatMapFunction that should help you get started. You'll want to read through the relevant training materials first -- at least the section on Data Pipelines & ETL.
What you'll end up doing with this approach is to partition the stream by ID (via keyBy), and then use key-partitioned state (probably MapState in this case, assuming you have several attribute/value pairs to store for each ID) to store information from records like record 1 until you're ready to emit a result.
BTW, if the set of keys is unbounded, you'll need to take care that you don't keep this state forever. Either clear the state when it's no longer needed (as this example does), or use State TTL to arrange for its eventual deletion.
For more information on other kinds of joins in Flink, see the links in this answer.

How to store translations in nosql DB with minimal duplication?

I got this schema in DynamoDB
{
"timestamp" : "",
"fruit" : {
"name" : "orange",
"translations" : [
{
"en-GB" : "orange"
},
{
"sv-SE" : "apelsin"
},
....
]
}
I need to store translations for objects in a DynamoDB database, to be able to query them efficiently. E.g. my query has to be something like "give me all objects where translations array contains "
The problem is, is this a really dumb idea? There are 6500 languages out there, and this means I will be forcing all entries to each contain an array with thousands of properties with 99% of them empty string values. What's a better approach?
Thanks,
Unless your willing to let DynamoDB do a table scan to get your results, I think your using the wrong tool. Consider streaming your transactions to AWS ElasticSearch via something like Firehose. Firehose will give you a lot of nice to haves and can help you rotate transaction indexes. ElasticSearch should able to store that structure and run your query.
If you don't go that route, then at least consider dropping the language code in your structure if your not actually using it. Just make an array of the unique spellings of your fruit. This is the kind of query I might try to do with multiple queries instead of a single one; Go from the spelling of the fruit name to a fruit UUID which you can then query against.
I would rather save it as.
{
"primaryKey" : "orange",
"SecondaryKey": "en-GB"
"timestamp" : "",
"Metadata" : {
"name" : "orange",
}
And create a secondary-index with SecondaryKey as PK and primaryKey as SK.
By Doing this you can query
Get me orange in en-GB.
What all keys existing in en-GB
If you are updating multiple item at once. You can create 1 object like this
{
"KeyName" : "orange",
"SecondaryKey": "master"
"timestamp" : "",
"fruit" : {
"name" : "orange",
"translations" : [
{
"en-GB" : "orange"
},
{
"sv-SE" : "apelsin"
},
....
]
}
And create a lambda function who denormalises the above object and creates multiple entities in dynamodb. But you will have to take create of deleting the elements as well. If in the new object some language is not there.

How to query with order by time different

I am having a real-time Firebase with time, value as following, timing is 1 min delay btw each object.
"-LhIaB7SP0y-FLb1xFFx" : {
"time" : 1560475623,
"value" : 11.614479842990287
},
"-LhIaJ6PjtbX1VHKlwFM" : {
"time" : 1560475681,
"value" : 11.642968895431837
},
"-LhIaXbX42k8dmApfztL" : {
"time" : 1560475741,
"value" : 11.707783121665505
},
"-LhIaqgYSpUmKbcH1MTN" : {
"time" : 1560475802,
"value" : 11.704004474172576
},
"-LhIb-20G9jnx61vNjS-" : {
"time" : 1560475861,
"value" : 11.69861155382089
},
"-LhIbDdTEdWrhirbjVRa" : {
"time" : 1560475921,
"value" : 11.661539551497276
},
"-LhIbSGKvS2POggUCots" : {
"time" : 1560475981,
"value" : 11.581711077020692
}
I can retrieve data for "time" in order. But I want to filter for every 5 mins, or 1 day, and a week.
this.items = this.db.list(`history/data`, ref => ref.orderByChild("time").limitToLast(1000));
Is there firebase list filtering for that?
The query model of the Firebase Realtime Database works as follow for your code:
It orders the child nodes of the reference by the child you indicate.
It finds the last node in the result, and then returns the 1000 nodes before that.
You can add a condition to your query with startAt(), endAt, and/or equalTo, so make the database start/end at a specific set of child nodes within the range. But there is no way within these conditions to skip child nodes in the middle. Once the database has found a child node to start returning, it will return all child nodes from there on until the conditions are no longer met.
The simplest way I can think of to implement your requirement is to store the data in the aggregation buckets that you want to query on. So if you want to allow reading the data in a way that gives you the first and last item of every week, and the first and last item of every week, you'd store:
2019w24_first: "-LhIaB7SP0y-FLb1xFFx",
2019w24_last: "-LhIbSGKvS2POggUCots",
2019m06_first: "-LhIaB7SP0y-FLb1xFFx",
2019m06_last: "-LhIbSGKvS2POggUCots"
And then each time when you write the data, you update the relevant aggregates too.
This sounds incredibly inefficient for folks who come from a background with relational/SQL databases, but is actually very common in NoSQL databases. By making your write operations do some extra work, and storing some duplicate data, your read operations becomes massively more scaleable.
For some more information on these types of data modeling choices, I recommend:
reading NoSQL data modeling
watching Firebase for SQL developers
watching Getting to know Cloud Firestore, which may be for a different Firebase Database, but many of the principles apply equally.

Parse Server, MongoDB - get "liked" state of an object

I am using Parse Server, which runs on MongoDB.
Let's say I have collections User and Comment and a join table of user and comment.
User can like a comment, which creates a new record in a join table.
Specifically in Parse Server, join table can be defined using a 'relation' field in the collection.
Now when I want to retrieve all comments, I also need to know, whether each of them is liked by the current user. How can I do this, without doing additional queries?
You might say I could create an array field likers in Comment table and use $elemMatch, but it doesn't seem as a good idea, because potentially, there can be thousands of likes on a comment.
My idea, but I hope there could be a better solution:
I could create an array field someLikers, a relation (join table) field allLikers and a number field likesCount in Comment table. Then put first 100 likers in both someLikers and allLikers and additional likers only in the allLikers. I would always increment the likesCount.
Then when querying a list of comments, I would implement the call with $elemMatch, which would tell me whether the current user is inside someLikers. When I would get the comments, I would check whether some of the comments have likesCount > 100 AND $elemMatch returned null. If so, I would have to run another query in the join table, looking for those comments and checking (querying by) whether they are liked by the current user.
Is there a better option?
Thanks!
I'd advise agains directly accessing MongoDB unless you absolutely have to; after all, the way collections and relations are built is an implementation detail of Parse and in theory could change in the future, breaking your code.
Even though you want to avoid multiple queries I suggest to do just that (depending on your platform you might even be able to run two Parse queries in parallel):
The first one is the query on Comment for getting all comments you want to display; assuming you have some kind of Post for which comments can be written, the query would find all comments referencing the current post.
The second query again is for on Comment, but this time
constrained to the comments retrieved in the first query, e.g.: containedIn("objectID", arrayOfCommentIDs)
and constrained to the comments having the current user in their likers relation, e.g.: equalTo("likers", currentUser)
Well a join collection is not really a noSQL way of thinking ;-)
I don't know ParseServer, so below is just based on pure MongoDB.
What i would do is, in the Comment document use an array of ObjectId's for each user who likes the comment.
Sample document layout
{
"_id" : ObjectId(""),
"name" : "Comment X",
"liked" : [
ObjectId(""),
....
]
}
Then use a aggregation to get the data. I asume you have the _id of the comment and you know the _id of the user.
The following aggregation returns the comment with a like count and a boolean which indicates the user liked the comment.
db.Comment.aggregate(
[
{
$match: {
_id : ObjectId("your commentId")
}
},
{
$project: {
_id : 1,
name :1,
number_of_likes : {$size : "$liked"},
user_liked: {
$gt: [{
$size: {
$filter: {
input: "$liked",
as: "like",
cond: {
$eq: ["$$like", ObjectId("your userId")]
}
}
}
}, 0]
},
}
},
]
);
this returns
{
"_id" : ObjectId(""),
"name" : "Comment X",
"number_of_likes" : NumberInt(7),
"user_liked" : true
}
Hope this is what your after.

Resources