In my database I have table which has a column called items, this column has jsonb type and almost every record has data like this one:
{"items": [{"id": "item-id", "sku": "some-sku", "quantity": 1, "master_sku": "some-master-sku"}]}
I need to create a migration which add type to every item, so it should looks like this:
{"items": [{"id": "item-id", "sku": "some-sku", "quantity": 1, "master_sku": "some-master-sku", "type": "product"}]}
I must add type to every record, and every record looks the same.
The problem is, that i have about milion records in my table, and I cannot iterate every item and add new type because it can take too long time and my deploy script may crash.
Guys, how can i do that in the simplest way?
As long as the format/content is always more or less the same, the fastest way to do it would probably be as a string operation, something like this:
UPDATE your_table
SET your_field = REGEXP_REPLACE(your_field::TEXT, '\}(.+?)', ', "type": "product"}\1', 'g')::JSONB
WHERE your_field IS NOT NULL;
Example: https://www.db-fiddle.com/f/fsHoFKz9szpmV5aF4dt7r/1
This just checks for any } character which is followed by something (so we know it's not the final one from the "items" object, and replaces it with the missing key/value (and whatever character was following).
Of course this performs practically no validation, such as whether that key/value already exists. Where you want to be between performance and correctness depends on what you know about your data.
Related
I'm pretty new to Acure Data Factory - ADF and have stumbled into somthing I would have solved with a couple lines of code.
Background
Main flow:
Lookup Activity fetchin an array of ID's to process
ForEach Activity looping over input array and uisng a Copy Activity pulling data from a REST API storing it into a database
Step #1 would result in an array containing ID's
{
"count": 10000,
"value": [
{
"id": "799128160"
},
{
"id": "817379102"
},
{
"id": "859061172"
},
... many more...
Step #2 When the lookup returns a lot of ID's - individual REST calls takes a lot of time. The REST API supports batching ID's using a comma spearated input.
The question
How can I convert the array from the input into a new array with comma separated fields? This will reduce the number of Activities and reduce the time to run.
Expecting something like this;
{
"count": 1000,
"value": [
{
"ids": "799128160,817379102,859061172,...."
},
{
"ids": "n,n,n,n,n,n,n,n,n,n,n,n,...."
}
... many more...
EDIT 1 - 19th Des 22
Using "Until Activity" and keeping track of posistions, I managed to use plain ADF. Would be nice if this could have been done using some simple array manipulation in a code snippet.
The ideal response might be we have to do manipulation with Dataflow -
My sample input:
First, I took a Dataflow In that adding a key Generate (Surrogate key) after the source - Say new key field is 'SrcKey'
Data preview of Surrogate key 1
Add an aggregate where you group by mod(SrcKey/3). This will group similar remainders into the same bucket.
Add a collect column in the same aggregator to collect into an array with expression trim(toString(collect(id)),'[]').
Data preview of Aggregate 1
Store output in single file in blob storage.
OUTPUT
I usually use jsonb field store array data.
for example, I want to store customer's barcode info, I will create a table like this:
create table customers(fcustomerid bigint, fcodes jsonb);
One customer has one row, all barcode info stored in its fcodes field, just like below:
[
{
"barcode":"000000001",
"codeid":1,
"product":"Coca Cola",
"createdate":"2021-01-19",
"lottorry":true,
"lottdate":"2021-01-20",
"bonus":50
},
{
"barcode":"000000002",
"codeid":2,
"product":"Coca Cola",
"createdate":"2021-01-19",
"lottorry":false,
"lottdate":"",
"bonus":0
}
...
{
"barcode":"000500000",
"codeid":500000,
"product":"Pepsi Cola",
"createdate":"2021-01-19",
"lottorry":false,
"lottdate":"",
"bonus":0
}
]
The jsonb array maybe store millions of barcode's objects with the same structure. Perhaps this is not a good idea, but you konw when I have thousands of customer, I can store all the data in one table, one customer has one row in this table, all its data store in one field, it looks very tersely and easy to manage.
For this kind of application scenarios, how to efficiently to insert or modify or query the data?
I can use jsonb_insert to insert one object, just like:
update customers
set fcodes=jsonb_insert(fcodes,'{-1}','{...}'::jsonb)
where fcustomerid=999;
When I want modify some object, I found it is a little difficulty, I should know the index of object first, if I use the incremental key codeid as the array index, things looks easilly. I can use jsonb_modify,Just like below:
update customers
set fcodes=jsonb_set(fcodes,concat('{',(mycodeid-1)::text,',lottery}'),'true'::jsonb)
where fcustomerid=999;
But if I want to query the objects in the jsonb array with createdate or bonus or lottorry or product, I should use jsonpath operator. just like:
select jsonb_path_query_array(fcodes,'$ ? (product=="Pepsi Cola")'
from customer
where fcustomerid=999;
or like:
select jsonb_path_query_array(fcodes,'$ ? (lottdate.datetime()>="2021-01-01".datetime() && lottdate.datetime()<="2021-01-31".datetime())'
from customer
where fcustomerid=999;
Thie jsonb index looks useful, But it looks useful between different row, and my operation mostly works in one row's one jsonb field.
I am very worrying about the efficiency, for millions of objects stored in one row's one jsonb field, is this a good idea? And how to improve the efficiency in this scenarios? Especially for the query.
You are right to worry. With a huge JSON like that, you will never get good performance.
Your data don't need JSON at all. Create a table that stores a single barcode and has a foreign key reference to customers. Then everything will be simple and efficient.
Using JSON in the database is almost always the wrong choice, judging from the questions in this forum.
I have a json file which I would like to convert (say csv) by expanding one of the fields into columns.
I used explode() for this, but it gives an error even if one of the many records is not having the exact schema.
Input File:
{"place": "KA",
"id": "200",
"swversion": "v.002",
"events":[ {"time": "2020-05-23T22:34:32.770Z", "eid": 24, "app": "testing", "state": 0} ]}
{"place": "AP",
"id": "100",
"swversion": "v.001",
"events":[[]]}
In the above, i want to expand the fields of the "events" and they should become columns.
Ideal case, "events" is a Array of Struct Type..
Expected Output File Columns:
*place, id, swversion, time, eid, app, state*
For this, i have used explode() available in pyspark.sql, but because my second record in the Input file, does not follow the schema where "events" is an Array of Struct Type, explode() fails here by giving an error.
The code i have used to explode :
df = spark.read.json("InputFile")
ndf = df.withColumn("event", explode("events")).drop("events")
ndf.select("place", "id", "swversion", "event.*")
The last line fails because of the second record in my input file..
It should not be too difficult for explode() to handle this i believe.
Can you suggest how to avoid
Cannot expand star types
If I change the "events":[[]] to "events":[{}], explode() works fine since again its an Array of StructType, but since I have no control of the input data, i need to handle this.
I have some troubles wrapping my head around how to formulate queries and provide proper indices for the following situation. I have customer entities represented in JSON like this (only relevant properties are retained):
{
"id": "50000",
"address": [
{
"line": [
"2nd Main Street",
"123 Harris Plaza"
],
"city": "Boston",
"state": "Massachusetts",
"country": "US",
},
{
"line": [
"1st Av."
],
"city": "Jamestown",
"state": "Massachusetts",
"country": "US",
}
]
}
The customers are stored in the following customer table:
CREATE TABLE Customer (
id BIGSERIAL PRIMARY KEY,
resource JSONB
);
I manage to do simple queries on the resource column, e.g. a projection query like this works (retrieve all lower-case address lines for cities starting with "bo"):
SELECT LOWER(jsonb_array_elements_text(jsonb_array_elements(c.resource#>'{address}') #> '{line}')) FROM Customer c, jsonb_array_elements(c.resource #> '{address}') a WHERE LOWER(a->>'city') LIKE 'bo%';
I have trouble doing the following: my goal is to query all customers that have at least one address line beginning with "12". Case insensitivity is a requirement for my use case. The example customer would match my query, as the first address object has an address line starting with "12". Please note that "line" is an Array of JSON Strings, not complex objects. So far the closest thing I could come up with is this:
SELECT c.resource FROM Customer c, jsonb_array_elements(c.resource #> '{address}') a WHERE a->'line' ?| array['123 Harris Plaza'];
Obviously this is not a case-insensitive LIKE query. Any help/pointers on how to formulate both query and accompanying GIN index are greatly appreciated. My first query already selects all address lines as text, so maybe this could be used in a GIN index?
I'm using Postres 9.5, but am happy to upgrade if this can only be achieved in more recent Postgres versions.
While GIN indexes have machinery to support prefix matching, this machinery is only hooked up for tsvectors. array_ops does not have it hooked up, nor does json_ops or json_path_ops. So unless you want to create new operator class/families (or normalize your data into separate tables) you will have to shoe-horn your data into a tsvector.
Here is a crude way to do that, which doesn't account for the possibility that a address line might contain literal single quotes or perhaps other meaningful characters:
create function addressline_tsvector(jsonb) returns tsvector immutable language SQL as $$
select string_agg('''' || lower(value) || '''', ' ')::tsvector
from jsonb_array_elements($1->'address') a(a),
jsonb_array_elements_text(a->'line')
$$;
create index on customer using gin (addressline_tsvector(resource));
select * from customer where addressline_tsvector(resource) ## lower('''2nd Main'':*')::tsquery;
Given that your example table only has one row, the index will probably not actually be used unless you set enable_seqscan = off first.
I have an item like this
{
"date": "2019-10-05",
"id": "2",
"serviceId": "1",
"time": {
"endTime": "1300",
"startTime": "1330"
}
}
Right now the way I design this is like so:
primary key --> id
Global secondary index --> primary key : serviceId
--> sort key : date
With the way I designed as of now,
* I can query the id
* I can query serviceId and range of date
I'd like to be able to query such that I can retrieve all items where
* serviceId = 1 AND
* date = "yyyy-mm-dd" AND
* time = {
"endTime": "1300",
"startTime": "1330"
}
I'd still like to be able to query based on the 2 previous condition (query by id, and query by serviceId and rangeOfDate
Is there a way to do this? one way I was thinking is to create a new field and use it as index e.g: combine all data so
combinedField: "1_yyyy-mm-dd_1300_1330
make that as primary key for global secondary index, and just query it like that.
I'm just not sure is this the way to do this or if there's a better or best practice way to do this?
Thank you
You could either use FilterExpression or composite sort keys.
FilterExpression
Here you could retrieve the items from the GSI you described by using specifying 'serviceId' and 'date' and then giving within the 'FilterExpression' specifying time.startTime and time.endTime. The sample Python code using boto3 would be as follows:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').eq("2019-10-05"),
FilterExpression=Attr(time.endTime).eq('1300') & Attr('time.startTime').eq('1330')
)
The drawback with this method is that all items specified with the sort key will be read and only then the results are filtered. So you will be charged according to what is specified in the sort key.
eg: if 1000 items have 'serviceId' as 1 and 'date' as '2019-10-05' but only 10 items have 'time.startTime' as 1330, then still you will be charged for reading the 1000 items even though only 10 items will be returned after the FilterExpression is applied.
Composite Sort Key
I believe this is the method you mentioned in the question. Here you will need to make an attribute as
'yyyy-mm-dd_startTime_endTime'
and use this as the sort key in your GSI. Now your items will look like this:
{ "date": "2019-10-05",
"id": "2",
"serviceId": "1",
"time": {
"endTime": "1300",
"startTime": "1330"
}
"date_time":"2019-10-05_1330_1300"
}
Your GSI will have 'serviceId' as partition key and 'date_time' as sort key. Now you will be able to query date range as:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').between('2019-07-05','2019-10-05')
)
For the query where date, start and end time are specified, you can query as:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').eq('2019-10-05_1330_1300')
)
This approach won't work if you need range of dates and start and end time together ie. you won't be able to make a query for items in a particular date range containing a specific start and end time. In that case you would have to use FilterExpression.
Yes, the solution you suggested (add a new field which is the combination of the fields and defined a GSI on it) is the standard way to achieve that. You need to make sure that the character you use for concatenation is unique, i.e., it cannot appear in any of the individual fields you combine.