Optimizing seemingly simple couchbase query for "items whose children satisfy" - query-optimization

I'm developing a system to store our translations using couchbase.
I have about 15,000 entries in my bucket that look like this:
{
"classifications": [
{
"documentPath": "Test Vendor/Test Project/Ordered",
"position": 1
}
],
"id": "message-Test Vendor/Test Project:first",
"key": "first",
"projectId": "project-Test Vendor/Test Project",
"translations": {
"en-US": [
{
"default": {
"owner": "414d6352-c26b-493e-835e-3f0cf37f1f3c",
"text": "first"
}
}
]
},
"type": "message",
"vendorId": "vendor-Test Vendor"
},
And I want, as an example, to find all messages that are classified with a "documentPath" of "Test Vendor/Test Project/Ordered".
I use this query:
SELECT message.*
FROM couchlate message UNNEST message.classifications classification
WHERE classification.documentPath = "Test Vendor/Test Project/Ordered"
AND message.type="message"
ORDER BY classification.position
But I'm very surprised that the query takes 2 seconds to execute!
Looking at the query execution plan, it seems that couchbase is looping over all the messages and then filtering on "documentPath".
I'd like it to first filter on "documentPath" (because there are in reality only 2 documentPaths matching my query) and then find the messages.
I've tried to create an index on "classifications" but it did not change anything.
Is there something wrong with my index setup, or should I structure my data differently to get fast results?
I'm using couchbase 4.5 beta if that matters.

Your query filters on the documentPath field, so an index on classifications doesn't actually help. You need to create an array index on the documentPath field itself using the new array index syntax on Couchbase 4.5:
CREATE INDEX ix_documentPath ON myBucket ( DISTINCT ARRAY c.documentPath FOR c IN classifications END ) ;
Then you can query on documentPath with a query like this:
SELECT * FROM myBucket WHERE ANY c IN classifications SATISFIES c.documentPath = "your path here" END ;
Add EXPLAIN to the start of the query to see the execution plan and confirm that it is indeed using the index ix_documentPath.
More details and examples here: http://developer.couchbase.com/documentation/server/4.5-dp/indexing-arrays.html

Related

Indexing for GROUP BY in CosmosDB

As the title suggests I'm wondering how to create an effective index for GROUP BY queries in CosmosDB.
Say the documents look something like:
{
"pk": "12345",
"speed": 500
},
{
"pk": "6789",
"speed": 100
}
Doing a query to find out the SUM of the speed grouped by the partition key would look something like:
SELECT c.pk, SUM(c.speed) FROM c WHERE c.pk IN ('12345','6789') GROUP BY c.pk
With about ~1.6 million documents this query costs 1489.51 RUs. However, splitting this up into two queries such as:
SELECT SUM(c.speed) FROM c WHERE c.pk = '12345'
SELECT SUM(c.speed) FROM c WHERE c.pk = '6789'
each of them cost only ~2.8 RUs each. Obviously the results would need some post-processing compared to the GROUP BY query to match. But a total of 5.6 RUs compared to 1489 RUs makes it worth it.
The indexing on the collection is as follows:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
],
"compositeIndexes": [
[
{
"path": "/pk",
"order": "ascending"
},
{
"path": "/speed",
"order": "ascending"
}
]
]
}
Am I completely missing something or how can the GROUP BY be so much more expensive? Is there any indexing I can do to bring it down?
Thanks in advance!
Currently GROUP BY does not not yet use the index.
This is currently being worked on. I would revisit sometime towards the end of the year to verify it is supported.
This feature is supported now , The query engine in Azure Cosmos DB Core (SQL) API now has a new system function and optimizations for a set of query operations to better use the index.

Representing JSON data in relational database table

I have a problem where I need to convert a JSON payload into SQL tables, while maintaining the relationships established in the payload. This is so that later I have the ability to query the tables and recreate the JSON payload structure in the future.
For example:
{
"batchId": "batch1",
"payees" : [
   {
 "payeeId": "payee1",
"payments": [
{
"paymentId": "paymentId1",
"amount": 200,
"currency": "USD"
},
{
"paymentId": "paymentId2",
"amount": 200,
"currency": "YEN"
},
{
"paymentId": "paymentId2",
"amount": 200,
"currency": "EURO"
}
]
}
]
}
For the above payload, I have a batch with payments grouped by payees. At its core it all boils down to a batch and its payments. But in that you can have groupings, for example above, it's grouped by payees.
One thing to note is that the payload may not necessarily always follow the above structure. Instead of grouping by payees, it could be by something else like currency for example. Or even no grouping at all, just a root level batch and an array of payments.
I want to know if there are conventions/rules I can follow to approach represent such data into relational tables? Thanks.
edit:
I am primarily looking to use Postgres and have looked into the jsonb feature that it provides for storing json data. However, I'm still struggling to figure out how/where (in terms of which table) to best store the grouping info.

Postgres JSONB Query and Index on Nested String Array

I have some troubles wrapping my head around how to formulate queries and provide proper indices for the following situation. I have customer entities represented in JSON like this (only relevant properties are retained):
{
"id": "50000",
"address": [
{
"line": [
"2nd Main Street",
"123 Harris Plaza"
],
"city": "Boston",
"state": "Massachusetts",
"country": "US",
},
{
"line": [
"1st Av."
],
"city": "Jamestown",
"state": "Massachusetts",
"country": "US",
}
]
}
The customers are stored in the following customer table:
CREATE TABLE Customer (
id BIGSERIAL PRIMARY KEY,
resource JSONB
);
I manage to do simple queries on the resource column, e.g. a projection query like this works (retrieve all lower-case address lines for cities starting with "bo"):
SELECT LOWER(jsonb_array_elements_text(jsonb_array_elements(c.resource#>'{address}') #> '{line}')) FROM Customer c, jsonb_array_elements(c.resource #> '{address}') a WHERE LOWER(a->>'city') LIKE 'bo%';
I have trouble doing the following: my goal is to query all customers that have at least one address line beginning with "12". Case insensitivity is a requirement for my use case. The example customer would match my query, as the first address object has an address line starting with "12". Please note that "line" is an Array of JSON Strings, not complex objects. So far the closest thing I could come up with is this:
SELECT c.resource FROM Customer c, jsonb_array_elements(c.resource #> '{address}') a WHERE a->'line' ?| array['123 Harris Plaza'];
Obviously this is not a case-insensitive LIKE query. Any help/pointers on how to formulate both query and accompanying GIN index are greatly appreciated. My first query already selects all address lines as text, so maybe this could be used in a GIN index?
I'm using Postres 9.5, but am happy to upgrade if this can only be achieved in more recent Postgres versions.
While GIN indexes have machinery to support prefix matching, this machinery is only hooked up for tsvectors. array_ops does not have it hooked up, nor does json_ops or json_path_ops. So unless you want to create new operator class/families (or normalize your data into separate tables) you will have to shoe-horn your data into a tsvector.
Here is a crude way to do that, which doesn't account for the possibility that a address line might contain literal single quotes or perhaps other meaningful characters:
create function addressline_tsvector(jsonb) returns tsvector immutable language SQL as $$
select string_agg('''' || lower(value) || '''', ' ')::tsvector
from jsonb_array_elements($1->'address') a(a),
jsonb_array_elements_text(a->'line')
$$;
create index on customer using gin (addressline_tsvector(resource));
select * from customer where addressline_tsvector(resource) ## lower('''2nd Main'':*')::tsquery;
Given that your example table only has one row, the index will probably not actually be used unless you set enable_seqscan = off first.

How to store translations in nosql DB with minimal duplication?

I got this schema in DynamoDB
{
"timestamp" : "",
"fruit" : {
"name" : "orange",
"translations" : [
{
"en-GB" : "orange"
},
{
"sv-SE" : "apelsin"
},
....
]
}
I need to store translations for objects in a DynamoDB database, to be able to query them efficiently. E.g. my query has to be something like "give me all objects where translations array contains "
The problem is, is this a really dumb idea? There are 6500 languages out there, and this means I will be forcing all entries to each contain an array with thousands of properties with 99% of them empty string values. What's a better approach?
Thanks,
Unless your willing to let DynamoDB do a table scan to get your results, I think your using the wrong tool. Consider streaming your transactions to AWS ElasticSearch via something like Firehose. Firehose will give you a lot of nice to haves and can help you rotate transaction indexes. ElasticSearch should able to store that structure and run your query.
If you don't go that route, then at least consider dropping the language code in your structure if your not actually using it. Just make an array of the unique spellings of your fruit. This is the kind of query I might try to do with multiple queries instead of a single one; Go from the spelling of the fruit name to a fruit UUID which you can then query against.
I would rather save it as.
{
"primaryKey" : "orange",
"SecondaryKey": "en-GB"
"timestamp" : "",
"Metadata" : {
"name" : "orange",
}
And create a secondary-index with SecondaryKey as PK and primaryKey as SK.
By Doing this you can query
Get me orange in en-GB.
What all keys existing in en-GB
If you are updating multiple item at once. You can create 1 object like this
{
"KeyName" : "orange",
"SecondaryKey": "master"
"timestamp" : "",
"fruit" : {
"name" : "orange",
"translations" : [
{
"en-GB" : "orange"
},
{
"sv-SE" : "apelsin"
},
....
]
}
And create a lambda function who denormalises the above object and creates multiple entities in dynamodb. But you will have to take create of deleting the elements as well. If in the new object some language is not there.

WKS - Training model to identify entities on tables

Browser type and version: GoogleChrome 67.0.3396.99
We are trying to train our model to identify values from multiple types of tables whom contain different number of rows and columns. A text row was extracted to begin the training, first we configure our system types and then, marked the entities and also the relation “AllInOne”. We are able to train 10 relations in a training set, but when the model is tested, we are only able to see 8 relations even creating other document sets for training and test the model multiple times. Is there another way to associate the column value with the row values in a single relation considering there isn’t a standard for the types of tables we are analyzing with the Discovery service?
We are expecting the discovery service response as the following:
"relations": [
{
"type": "AllInOne",
"sentence": "…",
"arguments": [
{
"entities": [
{
"“text": "””",
"type": "entity1"
}
]
},
{
"entities": [
{
"“text": "””",
"type": "entity2"
}
]
},
{
"entities": [
{
"“text": "””",
"type": "\"entity..n”,"
}
]
},
{ "..." }
]
}
The machine learning model that is trained in Watson Knowledge Studio targets unstructured natural language text. It may not be suitable for (semi-) structured format like table, especially for relations.

Resources