DynamoDB How to design and query multiple fields - database

I have an item like this
{
"date": "2019-10-05",
"id": "2",
"serviceId": "1",
"time": {
"endTime": "1300",
"startTime": "1330"
}
}
Right now the way I design this is like so:
primary key --> id
Global secondary index --> primary key : serviceId
--> sort key : date
With the way I designed as of now,
* I can query the id
* I can query serviceId and range of date
I'd like to be able to query such that I can retrieve all items where
* serviceId = 1 AND
* date = "yyyy-mm-dd" AND
* time = {
"endTime": "1300",
"startTime": "1330"
}
I'd still like to be able to query based on the 2 previous condition (query by id, and query by serviceId and rangeOfDate
Is there a way to do this? one way I was thinking is to create a new field and use it as index e.g: combine all data so
combinedField: "1_yyyy-mm-dd_1300_1330
make that as primary key for global secondary index, and just query it like that.
I'm just not sure is this the way to do this or if there's a better or best practice way to do this?
Thank you

You could either use FilterExpression or composite sort keys.
FilterExpression
Here you could retrieve the items from the GSI you described by using specifying 'serviceId' and 'date' and then giving within the 'FilterExpression' specifying time.startTime and time.endTime. The sample Python code using boto3 would be as follows:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').eq("2019-10-05"),
FilterExpression=Attr(time.endTime).eq('1300') & Attr('time.startTime').eq('1330')
)
The drawback with this method is that all items specified with the sort key will be read and only then the results are filtered. So you will be charged according to what is specified in the sort key.
eg: if 1000 items have 'serviceId' as 1 and 'date' as '2019-10-05' but only 10 items have 'time.startTime' as 1330, then still you will be charged for reading the 1000 items even though only 10 items will be returned after the FilterExpression is applied.
Composite Sort Key
I believe this is the method you mentioned in the question. Here you will need to make an attribute as
'yyyy-mm-dd_startTime_endTime'
and use this as the sort key in your GSI. Now your items will look like this:
{ "date": "2019-10-05",
"id": "2",
"serviceId": "1",
"time": {
"endTime": "1300",
"startTime": "1330"
}
"date_time":"2019-10-05_1330_1300"
}
Your GSI will have 'serviceId' as partition key and 'date_time' as sort key. Now you will be able to query date range as:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').between('2019-07-05','2019-10-05')
)
For the query where date, start and end time are specified, you can query as:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').eq('2019-10-05_1330_1300')
)
This approach won't work if you need range of dates and start and end time together ie. you won't be able to make a query for items in a particular date range containing a specific start and end time. In that case you would have to use FilterExpression.

Yes, the solution you suggested (add a new field which is the combination of the fields and defined a GSI on it) is the standard way to achieve that. You need to make sure that the character you use for concatenation is unique, i.e., it cannot appear in any of the individual fields you combine.

Related

Azure Data Factory - converting lookup result array

I'm pretty new to Acure Data Factory - ADF and have stumbled into somthing I would have solved with a couple lines of code.
Background
Main flow:
Lookup Activity fetchin an array of ID's to process
ForEach Activity looping over input array and uisng a Copy Activity pulling data from a REST API storing it into a database
Step #1 would result in an array containing ID's
{
"count": 10000,
"value": [
{
"id": "799128160"
},
{
"id": "817379102"
},
{
"id": "859061172"
},
... many more...
Step #2 When the lookup returns a lot of ID's - individual REST calls takes a lot of time. The REST API supports batching ID's using a comma spearated input.
The question
How can I convert the array from the input into a new array with comma separated fields? This will reduce the number of Activities and reduce the time to run.
Expecting something like this;
{
"count": 1000,
"value": [
{
"ids": "799128160,817379102,859061172,...."
},
{
"ids": "n,n,n,n,n,n,n,n,n,n,n,n,...."
}
... many more...
EDIT 1 - 19th Des 22
Using "Until Activity" and keeping track of posistions, I managed to use plain ADF. Would be nice if this could have been done using some simple array manipulation in a code snippet.
The ideal response might be we have to do manipulation with Dataflow -
My sample input:
First, I took a Dataflow In that adding a key Generate (Surrogate key) after the source - Say new key field is 'SrcKey'
Data preview of Surrogate key 1
Add an aggregate where you group by mod(SrcKey/3). This will group similar remainders into the same bucket.
Add a collect column in the same aggregator to collect into an array with expression trim(toString(collect(id)),'[]').
Data preview of Aggregate 1
Store output in single file in blob storage.
OUTPUT

Update a subtotal value field everytime an item is added /deleted to a different collection

I need your help and guidance about what is the best way to update a field table everytime an item is added into a table, or deleted from the same table.
In my main collection I have a field called site, that is populated with one of these values => a, b or c.
Because with graphql I was not able to do so, the quick an easier fix was to have this subTotal collection that stores the subTotal of these 3 fields. Now I need to update this collection with lifecycle hooks, or with a cronjob. I don't know how to do this with either of these 2 methods.
I am looking for something like this
Add a new item with value a in field called "a", then do a +1 to my new collection with the field "a".
The same applies while deleting an item with the value "c" in the field site, then -1 the value in the field "c" from subTotal collection.
POST Old.site.a ==> New.a +1 | DEL Old.site.b ==> New.b -1
This new collection stores the subtotals of each category I have. I did this, just because I was not able to retrieve the subtotals with a graphql query. Please see here ==> GraphQL Subcategory Count Aggregated Query
What I was looking for, was a query that would retrieve all the subtotal of site field from webs collection in a format like this:
{
"data": {
"webs": { for each value from site field retrieve subtotal
"meta": {
"pagination": {
"a": {
"total": 498},
"b": :{
"total": 3198},
},
"c": :{
"total": 998},
},
}
}
}
}
I know that would be a big stress for Strapi to update these fields everytime a POST or DEL method is made, so maybe a cronjob would suffix, that can run every 5 minutes or so, would be a better idea.
Could you please help me?
I would owe you a lot!

Postgres JSONB Query and Index on Nested String Array

I have some troubles wrapping my head around how to formulate queries and provide proper indices for the following situation. I have customer entities represented in JSON like this (only relevant properties are retained):
{
"id": "50000",
"address": [
{
"line": [
"2nd Main Street",
"123 Harris Plaza"
],
"city": "Boston",
"state": "Massachusetts",
"country": "US",
},
{
"line": [
"1st Av."
],
"city": "Jamestown",
"state": "Massachusetts",
"country": "US",
}
]
}
The customers are stored in the following customer table:
CREATE TABLE Customer (
id BIGSERIAL PRIMARY KEY,
resource JSONB
);
I manage to do simple queries on the resource column, e.g. a projection query like this works (retrieve all lower-case address lines for cities starting with "bo"):
SELECT LOWER(jsonb_array_elements_text(jsonb_array_elements(c.resource#>'{address}') #> '{line}')) FROM Customer c, jsonb_array_elements(c.resource #> '{address}') a WHERE LOWER(a->>'city') LIKE 'bo%';
I have trouble doing the following: my goal is to query all customers that have at least one address line beginning with "12". Case insensitivity is a requirement for my use case. The example customer would match my query, as the first address object has an address line starting with "12". Please note that "line" is an Array of JSON Strings, not complex objects. So far the closest thing I could come up with is this:
SELECT c.resource FROM Customer c, jsonb_array_elements(c.resource #> '{address}') a WHERE a->'line' ?| array['123 Harris Plaza'];
Obviously this is not a case-insensitive LIKE query. Any help/pointers on how to formulate both query and accompanying GIN index are greatly appreciated. My first query already selects all address lines as text, so maybe this could be used in a GIN index?
I'm using Postres 9.5, but am happy to upgrade if this can only be achieved in more recent Postgres versions.
While GIN indexes have machinery to support prefix matching, this machinery is only hooked up for tsvectors. array_ops does not have it hooked up, nor does json_ops or json_path_ops. So unless you want to create new operator class/families (or normalize your data into separate tables) you will have to shoe-horn your data into a tsvector.
Here is a crude way to do that, which doesn't account for the possibility that a address line might contain literal single quotes or perhaps other meaningful characters:
create function addressline_tsvector(jsonb) returns tsvector immutable language SQL as $$
select string_agg('''' || lower(value) || '''', ' ')::tsvector
from jsonb_array_elements($1->'address') a(a),
jsonb_array_elements_text(a->'line')
$$;
create index on customer using gin (addressline_tsvector(resource));
select * from customer where addressline_tsvector(resource) ## lower('''2nd Main'':*')::tsquery;
Given that your example table only has one row, the index will probably not actually be used unless you set enable_seqscan = off first.

Postgres SQL add key to all json array

In my database I have table which has a column called items, this column has jsonb type and almost every record has data like this one:
{"items": [{"id": "item-id", "sku": "some-sku", "quantity": 1, "master_sku": "some-master-sku"}]}
I need to create a migration which add type to every item, so it should looks like this:
{"items": [{"id": "item-id", "sku": "some-sku", "quantity": 1, "master_sku": "some-master-sku", "type": "product"}]}
I must add type to every record, and every record looks the same.
The problem is, that i have about milion records in my table, and I cannot iterate every item and add new type because it can take too long time and my deploy script may crash.
Guys, how can i do that in the simplest way?
As long as the format/content is always more or less the same, the fastest way to do it would probably be as a string operation, something like this:
UPDATE your_table
SET your_field = REGEXP_REPLACE(your_field::TEXT, '\}(.+?)', ', "type": "product"}\1', 'g')::JSONB
WHERE your_field IS NOT NULL;
Example: https://www.db-fiddle.com/f/fsHoFKz9szpmV5aF4dt7r/1
This just checks for any } character which is followed by something (so we know it's not the final one from the "items" object, and replaces it with the missing key/value (and whatever character was following).
Of course this performs practically no validation, such as whether that key/value already exists. Where you want to be between performance and correctness depends on what you know about your data.

Getting the latest (timestamp wise) value from cloudant query

I have a cloudant DB where each document looks like:
{
"_id": "2015-11-20_attr_00",
"key": "attr",
"value": "00",
"employeeCount": 12,
"timestamp": "2015-11-20T18:16:05.366Z",
"epocTimestampMillis": 1448043365366,
"docType": "attrCounts"
}
For a given attribute there is an employee count. As you can see I have a record for the same attribute every day. I am trying to create a view or index that will give me the latest record for this attribute. Meaning if I inserted a record on 2015-10-30 and another on 2015-11-10, then the one that is returned to me is just employee count for the record with timestamp 2015-11-10.
I have tried view, but I am getting all the entries for each attribute not just the latest. I did not look at indexes because I thought they do not get pre calculated. I will be querying this from client side, so having it pre calculated (like views are) is important.
Any guidance would be most appreciated. thank you
I created a test database you can see here. Just make sure your when you insert your JSON document into Cloudant (or CouchDB), your timestamps are not strings but JavaScript data objects:
https://examples.cloudant.com/latestdocs/_all_docs?include_docs=true
I built a search index like this (name the design doc "summary" and the search index "latest"):
function (doc) {
if ( doc.docType == "totalEmployeeCounts" && doc.key == "div") {
index("division", doc.value, {"store": true});
index("timestamp", doc.timestamp, {"store": true});
}
}
Then here's a query that will return only the latest record for each division. Note that the limit value will apply to each group, so with limit=1, if there are 4 groups you will get 4 documents not 1.
https://examples.cloudant.com/latestdocs/_design/summary/_search/latest?q=*:*&limit=1&group_field=division&include_docs=true&sort_field=-timestamp
Indexing TimeStamp as a string is not recommended.
Reference:
https://cloudant.com/blog/defensive-coding-in-mapindex-functions/#.VvRVxtIrJaT
I have the same problem. I converted the timestamp value to milliseconds (number) and then indexed that value.
var millis= Date.parse(timestamp);
index("millis",millis,{"store": false});
You can use the same query as Raj suggested but with the 'millis' field instead of the timestamp .

Resources