I have a problem where I need to convert a JSON payload into SQL tables, while maintaining the relationships established in the payload. This is so that later I have the ability to query the tables and recreate the JSON payload structure in the future.
For example:
{
"batchId": "batch1",
"payees" : [
{
"payeeId": "payee1",
"payments": [
{
"paymentId": "paymentId1",
"amount": 200,
"currency": "USD"
},
{
"paymentId": "paymentId2",
"amount": 200,
"currency": "YEN"
},
{
"paymentId": "paymentId2",
"amount": 200,
"currency": "EURO"
}
]
}
]
}
For the above payload, I have a batch with payments grouped by payees. At its core it all boils down to a batch and its payments. But in that you can have groupings, for example above, it's grouped by payees.
One thing to note is that the payload may not necessarily always follow the above structure. Instead of grouping by payees, it could be by something else like currency for example. Or even no grouping at all, just a root level batch and an array of payments.
I want to know if there are conventions/rules I can follow to approach represent such data into relational tables? Thanks.
edit:
I am primarily looking to use Postgres and have looked into the jsonb feature that it provides for storing json data. However, I'm still struggling to figure out how/where (in terms of which table) to best store the grouping info.
Related
I'm pretty new to Acure Data Factory - ADF and have stumbled into somthing I would have solved with a couple lines of code.
Background
Main flow:
Lookup Activity fetchin an array of ID's to process
ForEach Activity looping over input array and uisng a Copy Activity pulling data from a REST API storing it into a database
Step #1 would result in an array containing ID's
{
"count": 10000,
"value": [
{
"id": "799128160"
},
{
"id": "817379102"
},
{
"id": "859061172"
},
... many more...
Step #2 When the lookup returns a lot of ID's - individual REST calls takes a lot of time. The REST API supports batching ID's using a comma spearated input.
The question
How can I convert the array from the input into a new array with comma separated fields? This will reduce the number of Activities and reduce the time to run.
Expecting something like this;
{
"count": 1000,
"value": [
{
"ids": "799128160,817379102,859061172,...."
},
{
"ids": "n,n,n,n,n,n,n,n,n,n,n,n,...."
}
... many more...
EDIT 1 - 19th Des 22
Using "Until Activity" and keeping track of posistions, I managed to use plain ADF. Would be nice if this could have been done using some simple array manipulation in a code snippet.
The ideal response might be we have to do manipulation with Dataflow -
My sample input:
First, I took a Dataflow In that adding a key Generate (Surrogate key) after the source - Say new key field is 'SrcKey'
Data preview of Surrogate key 1
Add an aggregate where you group by mod(SrcKey/3). This will group similar remainders into the same bucket.
Add a collect column in the same aggregator to collect into an array with expression trim(toString(collect(id)),'[]').
Data preview of Aggregate 1
Store output in single file in blob storage.
OUTPUT
As the title suggests I'm wondering how to create an effective index for GROUP BY queries in CosmosDB.
Say the documents look something like:
{
"pk": "12345",
"speed": 500
},
{
"pk": "6789",
"speed": 100
}
Doing a query to find out the SUM of the speed grouped by the partition key would look something like:
SELECT c.pk, SUM(c.speed) FROM c WHERE c.pk IN ('12345','6789') GROUP BY c.pk
With about ~1.6 million documents this query costs 1489.51 RUs. However, splitting this up into two queries such as:
SELECT SUM(c.speed) FROM c WHERE c.pk = '12345'
SELECT SUM(c.speed) FROM c WHERE c.pk = '6789'
each of them cost only ~2.8 RUs each. Obviously the results would need some post-processing compared to the GROUP BY query to match. But a total of 5.6 RUs compared to 1489 RUs makes it worth it.
The indexing on the collection is as follows:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
],
"compositeIndexes": [
[
{
"path": "/pk",
"order": "ascending"
},
{
"path": "/speed",
"order": "ascending"
}
]
]
}
Am I completely missing something or how can the GROUP BY be so much more expensive? Is there any indexing I can do to bring it down?
Thanks in advance!
Currently GROUP BY does not not yet use the index.
This is currently being worked on. I would revisit sometime towards the end of the year to verify it is supported.
This feature is supported now , The query engine in Azure Cosmos DB Core (SQL) API now has a new system function and optimizations for a set of query operations to better use the index.
In my application, I choose Document Model but I still have some questions.
Here is my example document:
{
"catalogs": {
"cat-id1": {
"name": "catalog-1",
"createdAt": 123,
"products": {
"pro-id1": {
"name": "product-1",
"createdAt": 321,
"ingredients": {}
},
"pro-id2": {
"name": "product-2",
"createdAt": 654,
"ingredients": {}
}
}
},
"cat-id2": {
"name": "catalog-2",
"createdAt": 456,
"products": {
"pro-id3": {
"name": "product-3",
"createdAt": 322,
"ingredients": {}
},
"pro-id4": {
"name": "product-4",
"createdAt": 655,
"ingredients": {}
}
}
}
}
}
But ingredients in product is referrer to another Document.
{
"ingredients": {
"ing-id1": {},
"ing-id2": {}
}
}
Document Model has several benefits:
Easy to edit schema, like if (user.first_name) user.first_name = user.name.split(' ')[0]
No need to join, easily take all data in once.
Also I know that:
On updates to a document, the entire document usually needs to be rewritten.
For these reasons, it is generally recommended that you keep documents fairly small and avoid writes that increase the size of a document .
Main idea is: Which data model leads to simpler application code?
My question will be:
What size of Document should I keep?
My application already have a Relation DB, should I combined Document Model to Relation DB to reducing complexity?
Since you already have a relational database in use, I don't see a real benefit to using a document based DB as well.
Your Database schema seems simple enough to be using a relational DB. Wheras, if catalog entries would be very different from each other, you might consider a document based model. But this does not seem to be the case.
Therefore, my advice is, you stick with a relational model.
I would design the model like this:
A table for each entity (catalog, product, ingredient) where each entry has a unique Id
A relation table for each n:m relationship (catalogProduct, productIngredient) that only contain the Id of the entities of the relationship.
An example:
The ingredients ing1, ing2 and ing3 are stored in the table ingredient.
The products prod1 and prod2 are stored in product.
ing1 and ing2 are needed for prod1
ing2 and ing3 for prod2
In productIngredient in each entry, you store the ID of an ingredient and the ID of the product it is used in.
prod1 : ing1
prod1 : ing2
prod2 : ing2
prod2 : ing3
Browser type and version: GoogleChrome 67.0.3396.99
We are trying to train our model to identify values from multiple types of tables whom contain different number of rows and columns. A text row was extracted to begin the training, first we configure our system types and then, marked the entities and also the relation “AllInOne”. We are able to train 10 relations in a training set, but when the model is tested, we are only able to see 8 relations even creating other document sets for training and test the model multiple times. Is there another way to associate the column value with the row values in a single relation considering there isn’t a standard for the types of tables we are analyzing with the Discovery service?
We are expecting the discovery service response as the following:
"relations": [
{
"type": "AllInOne",
"sentence": "…",
"arguments": [
{
"entities": [
{
"“text": "””",
"type": "entity1"
}
]
},
{
"entities": [
{
"“text": "””",
"type": "entity2"
}
]
},
{
"entities": [
{
"“text": "””",
"type": "\"entity..n”,"
}
]
},
{ "..." }
]
}
The machine learning model that is trained in Watson Knowledge Studio targets unstructured natural language text. It may not be suitable for (semi-) structured format like table, especially for relations.
I'm developing a system to store our translations using couchbase.
I have about 15,000 entries in my bucket that look like this:
{
"classifications": [
{
"documentPath": "Test Vendor/Test Project/Ordered",
"position": 1
}
],
"id": "message-Test Vendor/Test Project:first",
"key": "first",
"projectId": "project-Test Vendor/Test Project",
"translations": {
"en-US": [
{
"default": {
"owner": "414d6352-c26b-493e-835e-3f0cf37f1f3c",
"text": "first"
}
}
]
},
"type": "message",
"vendorId": "vendor-Test Vendor"
},
And I want, as an example, to find all messages that are classified with a "documentPath" of "Test Vendor/Test Project/Ordered".
I use this query:
SELECT message.*
FROM couchlate message UNNEST message.classifications classification
WHERE classification.documentPath = "Test Vendor/Test Project/Ordered"
AND message.type="message"
ORDER BY classification.position
But I'm very surprised that the query takes 2 seconds to execute!
Looking at the query execution plan, it seems that couchbase is looping over all the messages and then filtering on "documentPath".
I'd like it to first filter on "documentPath" (because there are in reality only 2 documentPaths matching my query) and then find the messages.
I've tried to create an index on "classifications" but it did not change anything.
Is there something wrong with my index setup, or should I structure my data differently to get fast results?
I'm using couchbase 4.5 beta if that matters.
Your query filters on the documentPath field, so an index on classifications doesn't actually help. You need to create an array index on the documentPath field itself using the new array index syntax on Couchbase 4.5:
CREATE INDEX ix_documentPath ON myBucket ( DISTINCT ARRAY c.documentPath FOR c IN classifications END ) ;
Then you can query on documentPath with a query like this:
SELECT * FROM myBucket WHERE ANY c IN classifications SATISFIES c.documentPath = "your path here" END ;
Add EXPLAIN to the start of the query to see the execution plan and confirm that it is indeed using the index ix_documentPath.
More details and examples here: http://developer.couchbase.com/documentation/server/4.5-dp/indexing-arrays.html