How to store translations in nosql DB with minimal duplication?

How to store translations in nosql DB with minimal duplication? - database

I got this schema in DynamoDB
{
"timestamp" : "",
"fruit" : {
"name" : "orange",
"translations" : [
{
"en-GB" : "orange"
},
{
"sv-SE" : "apelsin"
},
....
]
}
I need to store translations for objects in a DynamoDB database, to be able to query them efficiently. E.g. my query has to be something like "give me all objects where translations array contains "
The problem is, is this a really dumb idea? There are 6500 languages out there, and this means I will be forcing all entries to each contain an array with thousands of properties with 99% of them empty string values. What's a better approach?
Thanks,

Unless your willing to let DynamoDB do a table scan to get your results, I think your using the wrong tool. Consider streaming your transactions to AWS ElasticSearch via something like Firehose. Firehose will give you a lot of nice to haves and can help you rotate transaction indexes. ElasticSearch should able to store that structure and run your query.
If you don't go that route, then at least consider dropping the language code in your structure if your not actually using it. Just make an array of the unique spellings of your fruit. This is the kind of query I might try to do with multiple queries instead of a single one; Go from the spelling of the fruit name to a fruit UUID which you can then query against.

I would rather save it as.
{
"primaryKey" : "orange",
"SecondaryKey": "en-GB"
"timestamp" : "",
"Metadata" : {
"name" : "orange",
}
And create a secondary-index with SecondaryKey as PK and primaryKey as SK.
By Doing this you can query
Get me orange in en-GB.
What all keys existing in en-GB
If you are updating multiple item at once. You can create 1 object like this
{
"KeyName" : "orange",
"SecondaryKey": "master"
"timestamp" : "",
"fruit" : {
"name" : "orange",
"translations" : [
{
"en-GB" : "orange"
},
{
"sv-SE" : "apelsin"
},
....
]
}
And create a lambda function who denormalises the above object and creates multiple entities in dynamodb. But you will have to take create of deleting the elements as well. If in the new object some language is not there.

Related

Performance issue when querying time-based objects

I'm currently working on a mongoDB collection containing documents that looks like the following :
{ startTime : Date, endTime: Date, source: String, metaData: {}}
And my usecase is to retrieve all documents that is included within a queried time frame, such as my query looks like this :
db.myCollection.find(
{
$and: [
{"source": aSource},
{"startTime" : {$lte: timeFrame.end}},
{"endTime" : {$gte: timeFrame.start}}
]
}
).sort({ "startTime" : 1 })
With an index defined as the following :
db.myCollection.createIndex( { "source" : 1, "startTime": 1, "endTime": 1 } );
The problem is that queries are very slow (multiple hundreds of ms on a local database) as soon as the number of document per source increase.
Using mongo explain shows me that i'm efficiently using this index (only found documents are scanned, otherwise only index-access is made), so the slowness seems to come from the index scan itself, as this query needs to go over a large portion of this index.
In addition to that, such an index gets huge pretty quickly and therefore seems inefficient.
Is there anything i'm missing that could help makes those queries faster, or am I condemned to retrieve all the documents belonging to a given source as the best way to go ? I see that mongo now provides some time-series features, could that bring any help in regard of my problem ?

How can I generate a DB, that fits my room scheme?

I have a database with quite a lot of Entities and I want to preload data from a file, on first creation of the database. For that the scheme of Room needs to fit the scheme of the database file. Since converting the json scheme by hand to SQLite statements is very error-prone ( I would need to copy paste every single of the statements and exchange the variable names) I am looking for a possibility to automatically generate a database from the scheme, that I then just need to fill with the data.
However apparently there´s no information if that is possible or even how to do so, out in the internet. It´s my first time working with SQLite (normally I use MySQL) and also the first time I see a database scheme in json. (Since standard MariaDB export options always just export the CREATE TABLE statements.)
Is there a way? Or does Room provide anyway to actually get the create table statements as a proper text, not split up in tons of JSON arrays?
I followed the guide on Android Developer Guidelines to get the json-scheme, so I have that file already. For those, who do not know it´s structure, it looks like this:
{
"formatVersion": 1,
"database": {
"version": 1,
"identityHash": "someAwesomeHash",
"entities": [
{
"tableName": "Articles",
"createSql": "CREATE TABLE IF NOT EXISTS `${TABLE_NAME}` (`id` INTEGER NOT NULL, `germanArticle` TEXT NOT NULL, `frenchArticle` TEXT, PRIMARY KEY(`id`))",
"fields": [
{
"fieldPath": "id",
"columnName": "id",
"affinity": "INTEGER",
"notNull": true
},
{
"fieldPath": "germanArticle",
"columnName": "germanArticle",
"affinity": "TEXT",
"notNull": true
},
{
"fieldPath": "frenchArticle",
"columnName": "frenchArticle",
"affinity": "TEXT",
"notNull": false
}
],
"primaryKey": {
"columnNames": [
"id"
],
"autoGenerate": false
},
"indices": [
{
"name": "index_Articles_germanArticle",
"unique": true,
"columnNames": [
"germanArticle"
],
"createSql": "CREATE UNIQUE INDEX IF NOT EXISTS `index_Articles_germanArticle` ON `${TABLE_NAME}` (`germanArticle`)"
},
{
"name": "index_Articles_frenchArticle",
"unique": true,
"columnNames": [
"frenchArticle"
],
"createSql": "CREATE UNIQUE INDEX IF NOT EXISTS `index_Articles_frenchArticle` ON `${TABLE_NAME}` (`frenchArticle`)"
}
],
"foreignKeys": []
},
...
Note: My question was not, how to create the Room DB out of the scheme. To receive the scheme, I already had to create all the Entities and the database. But how to get the structure Room creates as SQL to prepopulate my Database. However, I think the answer is a really nice explanation, and in fact I found the SQL-Statements I was searching for in the generated Java-file, which was an awesome hint. ;)

Is there a way? Or does Room provide anyway to actually get the create table statements as a proper text, not split up in tons of JSON arrays?
You cannot simply provide the CREATE SQL for Room, what you need to do is to generate the java/Kotlin classes (Entities) from the JSON and then add those classes to the project.
native SQLite (i.e. not using Room) would be a different matter as it could do done at runtime.
The way Room works is that the database is generated from the classes annotated with #Entity (at compile time).
The Entity/classes have to exist for the compile to correctly generate the code that it generates.
Furthermore the Entity(ies) have to be incorporated/included into a class for the Database, that being annotated with #Database (this class is typically abstract).
Yet furthermore to access the database tables you have abstract classes or interfaces for the SQL each being annotated with #Dao and again these require the Entity classes as the SQL is checked at compile time.
e.g. the JSON you provided would equate to something like :-
#Entity(
indices = {
#Index(value = "germanArticle", name = "index_Articles_germanArticle", unique = true),
#Index(value = "frenchArticle", name = "index_Articles_frenchArticle", unique = true)
}
, primaryKeys = {"id"}
)
public class Articles {
//#PrimaryKey // Could use this as an alternative
long id;
#NonNull
String germanArticle;
String frenchArticle;
}
so your process would have to convert the JSON to create the above and which could then be copied into the project.
You would then need a Class for the database which could be for example :-
#Database(entities = {Articles.class},version = 1)
abstract class MyDatabase extends RoomDatabase {
}
Note that Dao classes would be added to body of the above along the lines of :-
abstract MyDaoClass getDao();
Or does Room provide anyway to actually get the create table statements as a proper text, not split up in tons of JSON arrays?
Yes it does ....
At this stage if you compile it generates java (MyDatabase_Impl for the above i.e. the name of the Database class suffixed with _Impl). However as there are no Dao classes/interfaces. The database would unusable from a Room perspective (and thus wouldn't even get created).
part of the code generated would be :-
#Override
public void createAllTables(SupportSQLiteDatabase _db) {
_db.execSQL("CREATE TABLE IF NOT EXISTS `Articles` (`id` INTEGER NOT NULL, `germanArticle` TEXT NOT NULL, `frenchArticle` TEXT, PRIMARY KEY(`id`))");
_db.execSQL("CREATE UNIQUE INDEX IF NOT EXISTS `index_Articles_germanArticle` ON `Articles` (`germanArticle`)");
_db.execSQL("CREATE UNIQUE INDEX IF NOT EXISTS `index_Articles_frenchArticle` ON `Articles` (`frenchArticle`)");
_db.execSQL("CREATE TABLE IF NOT EXISTS room_master_table (id INTEGER PRIMARY KEY,identity_hash TEXT)");
_db.execSQL("INSERT OR REPLACE INTO room_master_table (id,identity_hash) VALUES(42, 'f7294cddfc3c1bc56a99e772f0c5b9bb')");
}
As you can see the Articles table and the two indices are created, the room_master_table is used for validation checking.

JSON data Aggregation in flink 1.10 DataStream API

I am trying to aggregate data in Elasticsearch using Kafka messages (as a Flink 1.10 API StreamSource). Data is receiving in JSON format which is dynamic and sample is given below.I want to combine multiple records in single document by unique ID. Data is coming in sequence wise and it's time series data.
source sink kafka and destination sink elasticseach 7.6.1 6
I am not found any good example which can be utilize in below problem statement.
Record : 1
{
"ID" : "1",
"timestamp" : "2020-05-07 14:34:51.325",
"Data" :
{
"Field1" : "ABC",
"Field2" : "DEF"
}
}
Record : 2
{
"ID" : "1",
"timestamp" : "2020-05-07 14:34:51.725",
"Data" :
{
"Field3" : "GHY"
}
}
Result :
{
"ID" : "1",
"Start_timestamp" : "2020-05-07 14:34:51.325",
"End_timestamp" : "2020-05-07 14:34:51.725",
"Data" :
{
"Field1" : "ABC",
"Field2" : "DEF",
"Field3" : "GHY"
}
}
Below is version details:
Flink 1.10
Flink-kafka-connector 2.11
Flink-Elasticsearch-connector 7.x
Kafka 2.11
JDK 1.8

What you're asking for could be described as some sort of join, and there are many ways you might accomplish this with Flink. There's an example of stateful enrichment in the Apache Flink Training that shows how to implement a similar join using a RichFlatMapFunction that should help you get started. You'll want to read through the relevant training materials first -- at least the section on Data Pipelines & ETL.
What you'll end up doing with this approach is to partition the stream by ID (via keyBy), and then use key-partitioned state (probably MapState in this case, assuming you have several attribute/value pairs to store for each ID) to store information from records like record 1 until you're ready to emit a result.
BTW, if the set of keys is unbounded, you'll need to take care that you don't keep this state forever. Either clear the state when it's no longer needed (as this example does), or use State TTL to arrange for its eventual deletion.
For more information on other kinds of joins in Flink, see the links in this answer.

Parse Server, MongoDB - get "liked" state of an object

I am using Parse Server, which runs on MongoDB.
Let's say I have collections User and Comment and a join table of user and comment.
User can like a comment, which creates a new record in a join table.
Specifically in Parse Server, join table can be defined using a 'relation' field in the collection.
Now when I want to retrieve all comments, I also need to know, whether each of them is liked by the current user. How can I do this, without doing additional queries?
You might say I could create an array field likers in Comment table and use $elemMatch, but it doesn't seem as a good idea, because potentially, there can be thousands of likes on a comment.
My idea, but I hope there could be a better solution:
I could create an array field someLikers, a relation (join table) field allLikers and a number field likesCount in Comment table. Then put first 100 likers in both someLikers and allLikers and additional likers only in the allLikers. I would always increment the likesCount.
Then when querying a list of comments, I would implement the call with $elemMatch, which would tell me whether the current user is inside someLikers. When I would get the comments, I would check whether some of the comments have likesCount > 100 AND $elemMatch returned null. If so, I would have to run another query in the join table, looking for those comments and checking (querying by) whether they are liked by the current user.
Is there a better option?
Thanks!

I'd advise agains directly accessing MongoDB unless you absolutely have to; after all, the way collections and relations are built is an implementation detail of Parse and in theory could change in the future, breaking your code.
Even though you want to avoid multiple queries I suggest to do just that (depending on your platform you might even be able to run two Parse queries in parallel):
The first one is the query on Comment for getting all comments you want to display; assuming you have some kind of Post for which comments can be written, the query would find all comments referencing the current post.
The second query again is for on Comment, but this time
constrained to the comments retrieved in the first query, e.g.: containedIn("objectID", arrayOfCommentIDs)
and constrained to the comments having the current user in their likers relation, e.g.: equalTo("likers", currentUser)

Well a join collection is not really a noSQL way of thinking ;-)
I don't know ParseServer, so below is just based on pure MongoDB.
What i would do is, in the Comment document use an array of ObjectId's for each user who likes the comment.
Sample document layout
{
"_id" : ObjectId(""),
"name" : "Comment X",
"liked" : [
ObjectId(""),
....
]
}
Then use a aggregation to get the data. I asume you have the _id of the comment and you know the _id of the user.
The following aggregation returns the comment with a like count and a boolean which indicates the user liked the comment.
db.Comment.aggregate(
[
{
$match: {
_id : ObjectId("your commentId")
}
},
{
$project: {
_id : 1,
name :1,
number_of_likes : {$size : "$liked"},
user_liked: {
$gt: [{
$size: {
$filter: {
input: "$liked",
as: "like",
cond: {
$eq: ["$$like", ObjectId("your userId")]
}
}
}
}, 0]
},
}
},
]
);
this returns
{
"_id" : ObjectId(""),
"name" : "Comment X",
"number_of_likes" : NumberInt(7),
"user_liked" : true
}
Hope this is what your after.

Which model is better for elasticsearch indexing?

I have 2 JSON model that can represented model for elasticsearch indexing. First :
{
"id" : 1,
"nama" : "satu",
"child" : {
"id" : 2,
"nama" : "dua",
"child" : [
{
"id" : 3
"nama" : "tiga"
},
{
"id" : 4,
"nama" : "empat"
}
}
}
}
And second :
[{
"parent1id" : 1,
"parent1nama" : "satu",
"parent2id" : 2,
"parent2nama" : "dua",
"id" : 3,
"nama" : "tiga"
},
{
"parent1id" : 1,
"parent1nama" : "satu",
"parent2id" : 2,
"parent2nama" : "dua",
"id" : 4,
"nama" : "empat"
}]
Actually both first and second have the same meaning and created for elasticsearch indexing. I think the first model is less redundant, and the second ones is more redundant. But the first ones, represented as 1 elastic record, but the second ones represented as 2 elastic record. This thing will impact when I do searching for example ID = 3. The first ones, will return the whole record, and the second ones will return the record that the ID = 3.
So, I want your suggestion all, which model better for elasticsearch. Thanks...

There's no diference inside elasticsearch because he uses Apache lucene to save your fields as key = value. for example you first example i'll be save as child.id = 3, child.mama = tiga.
But a good point in your first case the child object will be indexed as Nested Object that have a lot of possibilities as filters, queries and another kind of things.
Take a look in nested object i think this will clarify your needs.
Note: use aggregated data when is possible, elasticsearch is a NoSql document oriented.

I strongly suggest your 2nd model. A key principal of a noSQL database is that you duplicate data to make it easier to query.
Using Nested or Parent/Child in ES is doable, but it makes all your queries more complicated. We have found that flattening everything is much easier to work with and allows us to use Kibana much more efficiently.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to store translations in nosql DB with minimal duplication? - database

Related

Performance issue when querying time-based objects

How can I generate a DB, that fits my room scheme?

JSON data Aggregation in flink 1.10 DataStream API

Parse Server, MongoDB - get "liked" state of an object

Which model is better for elasticsearch indexing?

Categories

Resources