Indexing for GROUP BY in CosmosDB - database

As the title suggests I'm wondering how to create an effective index for GROUP BY queries in CosmosDB.
Say the documents look something like:
{
"pk": "12345",
"speed": 500
},
{
"pk": "6789",
"speed": 100
}
Doing a query to find out the SUM of the speed grouped by the partition key would look something like:
SELECT c.pk, SUM(c.speed) FROM c WHERE c.pk IN ('12345','6789') GROUP BY c.pk
With about ~1.6 million documents this query costs 1489.51 RUs. However, splitting this up into two queries such as:
SELECT SUM(c.speed) FROM c WHERE c.pk = '12345'
SELECT SUM(c.speed) FROM c WHERE c.pk = '6789'
each of them cost only ~2.8 RUs each. Obviously the results would need some post-processing compared to the GROUP BY query to match. But a total of 5.6 RUs compared to 1489 RUs makes it worth it.
The indexing on the collection is as follows:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
],
"compositeIndexes": [
[
{
"path": "/pk",
"order": "ascending"
},
{
"path": "/speed",
"order": "ascending"
}
]
]
}
Am I completely missing something or how can the GROUP BY be so much more expensive? Is there any indexing I can do to bring it down?
Thanks in advance!

Currently GROUP BY does not not yet use the index.
This is currently being worked on. I would revisit sometime towards the end of the year to verify it is supported.

This feature is supported now , The query engine in Azure Cosmos DB Core (SQL) API now has a new system function and optimizations for a set of query operations to better use the index.

Related

Representing JSON data in relational database table

I have a problem where I need to convert a JSON payload into SQL tables, while maintaining the relationships established in the payload. This is so that later I have the ability to query the tables and recreate the JSON payload structure in the future.
For example:
{
"batchId": "batch1",
"payees" : [
   {
 "payeeId": "payee1",
"payments": [
{
"paymentId": "paymentId1",
"amount": 200,
"currency": "USD"
},
{
"paymentId": "paymentId2",
"amount": 200,
"currency": "YEN"
},
{
"paymentId": "paymentId2",
"amount": 200,
"currency": "EURO"
}
]
}
]
}
For the above payload, I have a batch with payments grouped by payees. At its core it all boils down to a batch and its payments. But in that you can have groupings, for example above, it's grouped by payees.
One thing to note is that the payload may not necessarily always follow the above structure. Instead of grouping by payees, it could be by something else like currency for example. Or even no grouping at all, just a root level batch and an array of payments.
I want to know if there are conventions/rules I can follow to approach represent such data into relational tables? Thanks.
edit:
I am primarily looking to use Postgres and have looked into the jsonb feature that it provides for storing json data. However, I'm still struggling to figure out how/where (in terms of which table) to best store the grouping info.

Ludwig preprocessing

I'm running a model with Ludwig.
Dataset is Adult Census:
Features
workclass has almost 70% instances of Private, the Unknown (?) can be imputed with this value.
native_country, 90% of the instances are United States which can be used to impute for the Unknown (?) values. Same cannot be said about occupation column as the values are more distributed.
capital_gain has 72% instances with zero values for less than 50K and 19% instances with zero values for >50K.
capital_loss has 73% instances with zero values for less than 50K and 21% instances with zero values for >50K.
When I define the model what is the best way to do it for the above cases?
{
"name": "workclass",
"type": "category"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
{
"name": "native_country",
"type": "category"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
{
"name": "capital_gain",
"type": "numerical"
"preprocessing": {
"missing_value_strategy": "fill_with_mean",
}
},
{
"name": "capital_loss",
"type": "numerical"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
Questions:
1) For category features how to define: If you find ?, replace it with X.
2) For numerical features how to define: If you find 0, replace it with mean?
Ludwig currently considers missing values in the CSV file, like with two consecutive commas for it's replacement strategies. In your case I would suggest to do some minimal preprocessing to your dataset by replacing the zeros and ? with missing values or depending on the type of feature. You can easily do it in pandas with something like:
df[df.my_column == <value>].my_column = <new_value>.
The alternative is to perform the replacement already in your code (for instance replacing 0s with averages) so that Ludwig doesn't have to do it and you have full control of the replacement strategy.

WKS - Training model to identify entities on tables

Browser type and version: GoogleChrome 67.0.3396.99
We are trying to train our model to identify values from multiple types of tables whom contain different number of rows and columns. A text row was extracted to begin the training, first we configure our system types and then, marked the entities and also the relation “AllInOne”. We are able to train 10 relations in a training set, but when the model is tested, we are only able to see 8 relations even creating other document sets for training and test the model multiple times. Is there another way to associate the column value with the row values in a single relation considering there isn’t a standard for the types of tables we are analyzing with the Discovery service?
We are expecting the discovery service response as the following:
"relations": [
{
"type": "AllInOne",
"sentence": "…",
"arguments": [
{
"entities": [
{
"“text": "””",
"type": "entity1"
}
]
},
{
"entities": [
{
"“text": "””",
"type": "entity2"
}
]
},
{
"entities": [
{
"“text": "””",
"type": "\"entity..n”,"
}
]
},
{ "..." }
]
}
The machine learning model that is trained in Watson Knowledge Studio targets unstructured natural language text. It may not be suitable for (semi-) structured format like table, especially for relations.

How to make aggregations fast on Vespa?

We have 60M documents in an index. hosted on 4 nodes cluster.
I want to make sure the configuration is optimised for aggregations on the documents.
This is the sample query:
select * from sources * where (sddocname contains ([{"implicitTransforms": false}]"tweet")) | all(group(n_tA_c) each(output(count() as(count))));
The field n_tA_c contains array of strings. This is the sample document:
{
"fields": {
"add_gsOrd": 63829,
"documentid": "id:firehose:tweet::815347045032742912",
"foC": 467,
"frC": 315,
"g": 0,
"ln": "en",
"m": "ya just wants some fried rice",
"mTp": 2,
"n_c_p": [],
"n_tA_c": [
"fried",
"rice"
],
"n_tA_s": [],
"n_tA_tC": [],
"sN": "long_delaney1",
"sT_dlC": 0,
"sT_fC": 0,
"sT_lAT": 0,
"sT_qC": 0,
"sT_r": 0.0,
"sT_rC": 467,
"sT_rpC": 0,
"sT_rtC": 0,
"sT_vC": 0,
"sddocname": "tweet",
"t": 1483228858608,
"u": 377606303,
"v": "false"
},
"id": "id:firehose:tweet::815347045032742912",
"relevance": 0.0,
"source": "content-root-cluster"
}
The n_tA_c is attribute with mode fast-search
field n_tA_c type array<string> {
indexing: summary | attribute
attribute: fast-search
}
The simple term aggregation query does not come back in 20s. And times-out. What are additional check-list we need to ensure to reduce this latency?
$ curl 'http://localhost:8080/search/?yql=select%20*%20from%20sources%20*%20where%20(sddocname%20contains%20(%5B%7B%22implicitTransforms%22%3A%20false%7D%5D%22tweet%22))%20%7C%20all(group(n_tA_c)%20each(output(count()%20as(count))))%3B' | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 270 100 270 0 0 13 0 0:00:20 0:00:20 --:--:-- 67
{
"root": {
"children": [
{
"continuation": {
"this": ""
},
"id": "group:root:0",
"relevance": 1.0
}
],
"errors": [
{
"code": 12,
"message": "Timeout while waiting for sc0.num0",
"source": "content-root-cluster",
"summary": "Timed out"
}
],
"fields": {
"totalCount": 0
},
"id": "toplevel",
"relevance": 1.0
}
}
These nodes are aws i3.4x large boxes.(16 cores, 120 GB)
I might me missing something silly.
You are asking for every unique value and their count() as your grouping expression does not contain any max(x) limitation, this is a very cpu and network intensive task to compute and limiting number of groups is much faster by e.g
all(group(n_tA_c) max(10) each(output(count() as(count))));
General comments:
With vespa like any other serving engine it's important to have enough memory and e.g swap disabled so you can index and search data without getting into high memory pressure.
How much memory you'll use per document type is dependent on several factors but how many fields defined with attribute and number of documents per node is important. Redundancy and number of searchable copies also plays a major role.
Grouping over the entire corpus is memory intensive (memory bandwidth reading attribute values), cpu intensive and also network intensive when there is a high fan-out (See more on the precision here http://docs.vespa.ai/documentation/grouping.html which can limit number of groups returned per node).
Summarising the checkpoints to take care while making aggregations from the conversation in other answer and more documentation help.
Always add max(x) in the group for size of buckets needed. When data is distributed across multiple content nodes this result can be inaccurate. To increase accuracy we need to use precision(x) as well to tune accuracy as we need.
If you only need aggregation buckets and no hits - pass limit 0 in the yql; this will save the step to load summary to be returned for container.
The attribute fields we are filtering/aggregating to be on mode fast-search; otherwise it is not B-tree like index - and has to be traversed.
Ensure constant score for docs with &ranking=unranked in the query.
Enable groupingSessionCache: http://docs.vespa.ai/documentation/reference/search-api-reference.html#groupingSessionCache
Sizing the content node for tradeoffs of latency vs no. of docs. by max-hits as described: http://docs.vespa.ai/documentation/performance/sizing-search.html
If memory is the bottleneck one can look at attribute flush strategy configuration. http://docs.vespa.ai/documentation/proton.html#proton-maintenance-jobs
If CPU is the bottleneck; increase parallelism. Ensure all cores are used in Searcher. http://docs.vespa.ai/documentation/content/setup-proton-tuning.html#requestthreads-persearch. Changes for that in service.xml:
<persearch>16</persearch>
Threads persearch is by default 1.
Above changes, ensured that query is returned with result before timeout. But learned that Vespa is not made for aggregations with primary goal. The latency for write and search are much less than ES with same scale on identical hardware. But aggregation (specially with multi-valued string fields) is more CPU intensive and more latency compare to ES for the same aggregation query.

Optimizing seemingly simple couchbase query for "items whose children satisfy"

I'm developing a system to store our translations using couchbase.
I have about 15,000 entries in my bucket that look like this:
{
"classifications": [
{
"documentPath": "Test Vendor/Test Project/Ordered",
"position": 1
}
],
"id": "message-Test Vendor/Test Project:first",
"key": "first",
"projectId": "project-Test Vendor/Test Project",
"translations": {
"en-US": [
{
"default": {
"owner": "414d6352-c26b-493e-835e-3f0cf37f1f3c",
"text": "first"
}
}
]
},
"type": "message",
"vendorId": "vendor-Test Vendor"
},
And I want, as an example, to find all messages that are classified with a "documentPath" of "Test Vendor/Test Project/Ordered".
I use this query:
SELECT message.*
FROM couchlate message UNNEST message.classifications classification
WHERE classification.documentPath = "Test Vendor/Test Project/Ordered"
AND message.type="message"
ORDER BY classification.position
But I'm very surprised that the query takes 2 seconds to execute!
Looking at the query execution plan, it seems that couchbase is looping over all the messages and then filtering on "documentPath".
I'd like it to first filter on "documentPath" (because there are in reality only 2 documentPaths matching my query) and then find the messages.
I've tried to create an index on "classifications" but it did not change anything.
Is there something wrong with my index setup, or should I structure my data differently to get fast results?
I'm using couchbase 4.5 beta if that matters.
Your query filters on the documentPath field, so an index on classifications doesn't actually help. You need to create an array index on the documentPath field itself using the new array index syntax on Couchbase 4.5:
CREATE INDEX ix_documentPath ON myBucket ( DISTINCT ARRAY c.documentPath FOR c IN classifications END ) ;
Then you can query on documentPath with a query like this:
SELECT * FROM myBucket WHERE ANY c IN classifications SATISFIES c.documentPath = "your path here" END ;
Add EXPLAIN to the start of the query to see the execution plan and confirm that it is indeed using the index ix_documentPath.
More details and examples here: http://developer.couchbase.com/documentation/server/4.5-dp/indexing-arrays.html

Resources