Merging conflicts of reindex api in elasticsearch - database

I need to merge conflicts in reindex api. But it just completele removes the dest old document and replaces it with what I provide in source instead of just changing the fields that source has and not touching other destination fields
I have two indexes in elasticsearch named i1 and i2
i2 was reindexed from i1 a long time ago and shares ids with i1
i2 has new values for most of its fields and old values in i1 aren't usefull for me
But i1 has received a lot of new fields with new values recently and I want to add them to v2
I used reindex api on a test index with source fields limited to only the new fields from v1 (with _source: []). But saw that reindexing removed every other field from destination and only the fields that I specified to be moved to destination remain.
Is there a way to tell reindex api to only add new fields in case of id similarity and not remove the destination fields that don't don't exist in the source?
Tried setting version type to external but it just checks which one is newer before completely replacing or doing nothing.
document example in i1:
{
"id": "1",
"f1": "vOld",
"f2": "v"
}
document example in i2:
{
"id": "1",
"f1": "vNew"
}
result I want in i2 after reindexing:
{
"id": "1",
"f1": "vNew",
"f2": "v"
}
my query:
POST _reindex
{
"source": {
"index": "i1"
},
"dest": {
"index": "i2",
"_source": ["f2"]
}
}
}

Related

Azure Data Factory - converting lookup result array

I'm pretty new to Acure Data Factory - ADF and have stumbled into somthing I would have solved with a couple lines of code.
Background
Main flow:
Lookup Activity fetchin an array of ID's to process
ForEach Activity looping over input array and uisng a Copy Activity pulling data from a REST API storing it into a database
Step #1 would result in an array containing ID's
{
"count": 10000,
"value": [
{
"id": "799128160"
},
{
"id": "817379102"
},
{
"id": "859061172"
},
... many more...
Step #2 When the lookup returns a lot of ID's - individual REST calls takes a lot of time. The REST API supports batching ID's using a comma spearated input.
The question
How can I convert the array from the input into a new array with comma separated fields? This will reduce the number of Activities and reduce the time to run.
Expecting something like this;
{
"count": 1000,
"value": [
{
"ids": "799128160,817379102,859061172,...."
},
{
"ids": "n,n,n,n,n,n,n,n,n,n,n,n,...."
}
... many more...
EDIT 1 - 19th Des 22
Using "Until Activity" and keeping track of posistions, I managed to use plain ADF. Would be nice if this could have been done using some simple array manipulation in a code snippet.
The ideal response might be we have to do manipulation with Dataflow -
My sample input:
First, I took a Dataflow In that adding a key Generate (Surrogate key) after the source - Say new key field is 'SrcKey'
Data preview of Surrogate key 1
Add an aggregate where you group by mod(SrcKey/3). This will group similar remainders into the same bucket.
Add a collect column in the same aggregator to collect into an array with expression trim(toString(collect(id)),'[]').
Data preview of Aggregate 1
Store output in single file in blob storage.
OUTPUT

BigQuery | bq load with inline schema with STRING REPEATED

I am trying to load a bq table with the below definition and one of the column (ref_list) is of STRING REPEATED.
[
{
"name": "emp",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "ref_list",
"type": "STRING"
},
{
"name": "update_date",
"type": "DATE"
}
]
Below is how my input data is:
{"emp":"Adam","ref_list":["Roger","Calvin","Andrew","Kohl"],"update_date":"1999-01-01"}
{"emp":"AntiP27","ref_list":["John","Patrick","Nick","Chris"],"update_date":"2020-01-01"}
I am able to load the table by point the .schema file from my local but the same is failing when I provide the in-line schema.
Here is my bq load command with inline schema option. I am not quite sure how I could specify the mode = REPEATED
bq load --replace --source_format=NEWLINE_DELIMITED_JSON emp_stage.emp_dtl gs://1324-global-delivery/emp_dtl.json emp:STRING,ref_list:STRING,update_date:DATE
According to the documentation, it's not possible to specify a RECORD and the columns mode (NULLABLE, REPEATED), with an inline schema :
When you specify the schema on the command line, you cannot include a
RECORD (STRUCT) type, you cannot include a column description, and you
cannot specify the column's mode. All modes default to NULLABLE. To
include descriptions, modes, and RECORD types, supply a JSON schema
file instead.
bq_manually_specifying_schemas
If you need to use these parameters, you have to specify them in a Json schema in a dedicated file, as you used in your example.

DynamoDB How to design and query multiple fields

I have an item like this
{
"date": "2019-10-05",
"id": "2",
"serviceId": "1",
"time": {
"endTime": "1300",
"startTime": "1330"
}
}
Right now the way I design this is like so:
primary key --> id
Global secondary index --> primary key : serviceId
--> sort key : date
With the way I designed as of now,
* I can query the id
* I can query serviceId and range of date
I'd like to be able to query such that I can retrieve all items where
* serviceId = 1 AND
* date = "yyyy-mm-dd" AND
* time = {
"endTime": "1300",
"startTime": "1330"
}
I'd still like to be able to query based on the 2 previous condition (query by id, and query by serviceId and rangeOfDate
Is there a way to do this? one way I was thinking is to create a new field and use it as index e.g: combine all data so
combinedField: "1_yyyy-mm-dd_1300_1330
make that as primary key for global secondary index, and just query it like that.
I'm just not sure is this the way to do this or if there's a better or best practice way to do this?
Thank you
You could either use FilterExpression or composite sort keys.
FilterExpression
Here you could retrieve the items from the GSI you described by using specifying 'serviceId' and 'date' and then giving within the 'FilterExpression' specifying time.startTime and time.endTime. The sample Python code using boto3 would be as follows:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').eq("2019-10-05"),
FilterExpression=Attr(time.endTime).eq('1300') & Attr('time.startTime').eq('1330')
)
The drawback with this method is that all items specified with the sort key will be read and only then the results are filtered. So you will be charged according to what is specified in the sort key.
eg: if 1000 items have 'serviceId' as 1 and 'date' as '2019-10-05' but only 10 items have 'time.startTime' as 1330, then still you will be charged for reading the 1000 items even though only 10 items will be returned after the FilterExpression is applied.
Composite Sort Key
I believe this is the method you mentioned in the question. Here you will need to make an attribute as
'yyyy-mm-dd_startTime_endTime'
and use this as the sort key in your GSI. Now your items will look like this:
{ "date": "2019-10-05",
"id": "2",
"serviceId": "1",
"time": {
"endTime": "1300",
"startTime": "1330"
}
"date_time":"2019-10-05_1330_1300"
}
Your GSI will have 'serviceId' as partition key and 'date_time' as sort key. Now you will be able to query date range as:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').between('2019-07-05','2019-10-05')
)
For the query where date, start and end time are specified, you can query as:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').eq('2019-10-05_1330_1300')
)
This approach won't work if you need range of dates and start and end time together ie. you won't be able to make a query for items in a particular date range containing a specific start and end time. In that case you would have to use FilterExpression.
Yes, the solution you suggested (add a new field which is the combination of the fields and defined a GSI on it) is the standard way to achieve that. You need to make sure that the character you use for concatenation is unique, i.e., it cannot appear in any of the individual fields you combine.

How to store translations in nosql DB with minimal duplication?

I got this schema in DynamoDB
{
"timestamp" : "",
"fruit" : {
"name" : "orange",
"translations" : [
{
"en-GB" : "orange"
},
{
"sv-SE" : "apelsin"
},
....
]
}
I need to store translations for objects in a DynamoDB database, to be able to query them efficiently. E.g. my query has to be something like "give me all objects where translations array contains "
The problem is, is this a really dumb idea? There are 6500 languages out there, and this means I will be forcing all entries to each contain an array with thousands of properties with 99% of them empty string values. What's a better approach?
Thanks,
Unless your willing to let DynamoDB do a table scan to get your results, I think your using the wrong tool. Consider streaming your transactions to AWS ElasticSearch via something like Firehose. Firehose will give you a lot of nice to haves and can help you rotate transaction indexes. ElasticSearch should able to store that structure and run your query.
If you don't go that route, then at least consider dropping the language code in your structure if your not actually using it. Just make an array of the unique spellings of your fruit. This is the kind of query I might try to do with multiple queries instead of a single one; Go from the spelling of the fruit name to a fruit UUID which you can then query against.
I would rather save it as.
{
"primaryKey" : "orange",
"SecondaryKey": "en-GB"
"timestamp" : "",
"Metadata" : {
"name" : "orange",
}
And create a secondary-index with SecondaryKey as PK and primaryKey as SK.
By Doing this you can query
Get me orange in en-GB.
What all keys existing in en-GB
If you are updating multiple item at once. You can create 1 object like this
{
"KeyName" : "orange",
"SecondaryKey": "master"
"timestamp" : "",
"fruit" : {
"name" : "orange",
"translations" : [
{
"en-GB" : "orange"
},
{
"sv-SE" : "apelsin"
},
....
]
}
And create a lambda function who denormalises the above object and creates multiple entities in dynamodb. But you will have to take create of deleting the elements as well. If in the new object some language is not there.

Optimizing seemingly simple couchbase query for "items whose children satisfy"

I'm developing a system to store our translations using couchbase.
I have about 15,000 entries in my bucket that look like this:
{
"classifications": [
{
"documentPath": "Test Vendor/Test Project/Ordered",
"position": 1
}
],
"id": "message-Test Vendor/Test Project:first",
"key": "first",
"projectId": "project-Test Vendor/Test Project",
"translations": {
"en-US": [
{
"default": {
"owner": "414d6352-c26b-493e-835e-3f0cf37f1f3c",
"text": "first"
}
}
]
},
"type": "message",
"vendorId": "vendor-Test Vendor"
},
And I want, as an example, to find all messages that are classified with a "documentPath" of "Test Vendor/Test Project/Ordered".
I use this query:
SELECT message.*
FROM couchlate message UNNEST message.classifications classification
WHERE classification.documentPath = "Test Vendor/Test Project/Ordered"
AND message.type="message"
ORDER BY classification.position
But I'm very surprised that the query takes 2 seconds to execute!
Looking at the query execution plan, it seems that couchbase is looping over all the messages and then filtering on "documentPath".
I'd like it to first filter on "documentPath" (because there are in reality only 2 documentPaths matching my query) and then find the messages.
I've tried to create an index on "classifications" but it did not change anything.
Is there something wrong with my index setup, or should I structure my data differently to get fast results?
I'm using couchbase 4.5 beta if that matters.
Your query filters on the documentPath field, so an index on classifications doesn't actually help. You need to create an array index on the documentPath field itself using the new array index syntax on Couchbase 4.5:
CREATE INDEX ix_documentPath ON myBucket ( DISTINCT ARRAY c.documentPath FOR c IN classifications END ) ;
Then you can query on documentPath with a query like this:
SELECT * FROM myBucket WHERE ANY c IN classifications SATISFIES c.documentPath = "your path here" END ;
Add EXPLAIN to the start of the query to see the execution plan and confirm that it is indeed using the index ix_documentPath.
More details and examples here: http://developer.couchbase.com/documentation/server/4.5-dp/indexing-arrays.html

Resources