How to make aggregations fast on Vespa? - vespa

We have 60M documents in an index. hosted on 4 nodes cluster.
I want to make sure the configuration is optimised for aggregations on the documents.
This is the sample query:
select * from sources * where (sddocname contains ([{"implicitTransforms": false}]"tweet")) | all(group(n_tA_c) each(output(count() as(count))));
The field n_tA_c contains array of strings. This is the sample document:
{
"fields": {
"add_gsOrd": 63829,
"documentid": "id:firehose:tweet::815347045032742912",
"foC": 467,
"frC": 315,
"g": 0,
"ln": "en",
"m": "ya just wants some fried rice",
"mTp": 2,
"n_c_p": [],
"n_tA_c": [
"fried",
"rice"
],
"n_tA_s": [],
"n_tA_tC": [],
"sN": "long_delaney1",
"sT_dlC": 0,
"sT_fC": 0,
"sT_lAT": 0,
"sT_qC": 0,
"sT_r": 0.0,
"sT_rC": 467,
"sT_rpC": 0,
"sT_rtC": 0,
"sT_vC": 0,
"sddocname": "tweet",
"t": 1483228858608,
"u": 377606303,
"v": "false"
},
"id": "id:firehose:tweet::815347045032742912",
"relevance": 0.0,
"source": "content-root-cluster"
}
The n_tA_c is attribute with mode fast-search
field n_tA_c type array<string> {
indexing: summary | attribute
attribute: fast-search
}
The simple term aggregation query does not come back in 20s. And times-out. What are additional check-list we need to ensure to reduce this latency?
$ curl 'http://localhost:8080/search/?yql=select%20*%20from%20sources%20*%20where%20(sddocname%20contains%20(%5B%7B%22implicitTransforms%22%3A%20false%7D%5D%22tweet%22))%20%7C%20all(group(n_tA_c)%20each(output(count()%20as(count))))%3B' | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 270 100 270 0 0 13 0 0:00:20 0:00:20 --:--:-- 67
{
"root": {
"children": [
{
"continuation": {
"this": ""
},
"id": "group:root:0",
"relevance": 1.0
}
],
"errors": [
{
"code": 12,
"message": "Timeout while waiting for sc0.num0",
"source": "content-root-cluster",
"summary": "Timed out"
}
],
"fields": {
"totalCount": 0
},
"id": "toplevel",
"relevance": 1.0
}
}
These nodes are aws i3.4x large boxes.(16 cores, 120 GB)
I might me missing something silly.

You are asking for every unique value and their count() as your grouping expression does not contain any max(x) limitation, this is a very cpu and network intensive task to compute and limiting number of groups is much faster by e.g
all(group(n_tA_c) max(10) each(output(count() as(count))));
General comments:
With vespa like any other serving engine it's important to have enough memory and e.g swap disabled so you can index and search data without getting into high memory pressure.
How much memory you'll use per document type is dependent on several factors but how many fields defined with attribute and number of documents per node is important. Redundancy and number of searchable copies also plays a major role.
Grouping over the entire corpus is memory intensive (memory bandwidth reading attribute values), cpu intensive and also network intensive when there is a high fan-out (See more on the precision here http://docs.vespa.ai/documentation/grouping.html which can limit number of groups returned per node).

Summarising the checkpoints to take care while making aggregations from the conversation in other answer and more documentation help.
Always add max(x) in the group for size of buckets needed. When data is distributed across multiple content nodes this result can be inaccurate. To increase accuracy we need to use precision(x) as well to tune accuracy as we need.
If you only need aggregation buckets and no hits - pass limit 0 in the yql; this will save the step to load summary to be returned for container.
The attribute fields we are filtering/aggregating to be on mode fast-search; otherwise it is not B-tree like index - and has to be traversed.
Ensure constant score for docs with &ranking=unranked in the query.
Enable groupingSessionCache: http://docs.vespa.ai/documentation/reference/search-api-reference.html#groupingSessionCache
Sizing the content node for tradeoffs of latency vs no. of docs. by max-hits as described: http://docs.vespa.ai/documentation/performance/sizing-search.html
If memory is the bottleneck one can look at attribute flush strategy configuration. http://docs.vespa.ai/documentation/proton.html#proton-maintenance-jobs
If CPU is the bottleneck; increase parallelism. Ensure all cores are used in Searcher. http://docs.vespa.ai/documentation/content/setup-proton-tuning.html#requestthreads-persearch. Changes for that in service.xml:
<persearch>16</persearch>
Threads persearch is by default 1.
Above changes, ensured that query is returned with result before timeout. But learned that Vespa is not made for aggregations with primary goal. The latency for write and search are much less than ES with same scale on identical hardware. But aggregation (specially with multi-valued string fields) is more CPU intensive and more latency compare to ES for the same aggregation query.

Related

Performance issue when querying time-based objects

I'm currently working on a mongoDB collection containing documents that looks like the following :
{ startTime : Date, endTime: Date, source: String, metaData: {}}
And my usecase is to retrieve all documents that is included within a queried time frame, such as my query looks like this :
db.myCollection.find(
{
$and: [
{"source": aSource},
{"startTime" : {$lte: timeFrame.end}},
{"endTime" : {$gte: timeFrame.start}}
]
}
).sort({ "startTime" : 1 })
With an index defined as the following :
db.myCollection.createIndex( { "source" : 1, "startTime": 1, "endTime": 1 } );
The problem is that queries are very slow (multiple hundreds of ms on a local database) as soon as the number of document per source increase.
Using mongo explain shows me that i'm efficiently using this index (only found documents are scanned, otherwise only index-access is made), so the slowness seems to come from the index scan itself, as this query needs to go over a large portion of this index.
In addition to that, such an index gets huge pretty quickly and therefore seems inefficient.
Is there anything i'm missing that could help makes those queries faster, or am I condemned to retrieve all the documents belonging to a given source as the best way to go ? I see that mongo now provides some time-series features, could that bring any help in regard of my problem ?

Indexing for GROUP BY in CosmosDB

As the title suggests I'm wondering how to create an effective index for GROUP BY queries in CosmosDB.
Say the documents look something like:
{
"pk": "12345",
"speed": 500
},
{
"pk": "6789",
"speed": 100
}
Doing a query to find out the SUM of the speed grouped by the partition key would look something like:
SELECT c.pk, SUM(c.speed) FROM c WHERE c.pk IN ('12345','6789') GROUP BY c.pk
With about ~1.6 million documents this query costs 1489.51 RUs. However, splitting this up into two queries such as:
SELECT SUM(c.speed) FROM c WHERE c.pk = '12345'
SELECT SUM(c.speed) FROM c WHERE c.pk = '6789'
each of them cost only ~2.8 RUs each. Obviously the results would need some post-processing compared to the GROUP BY query to match. But a total of 5.6 RUs compared to 1489 RUs makes it worth it.
The indexing on the collection is as follows:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
],
"compositeIndexes": [
[
{
"path": "/pk",
"order": "ascending"
},
{
"path": "/speed",
"order": "ascending"
}
]
]
}
Am I completely missing something or how can the GROUP BY be so much more expensive? Is there any indexing I can do to bring it down?
Thanks in advance!
Currently GROUP BY does not not yet use the index.
This is currently being worked on. I would revisit sometime towards the end of the year to verify it is supported.
This feature is supported now , The query engine in Azure Cosmos DB Core (SQL) API now has a new system function and optimizations for a set of query operations to better use the index.

Read Flink latency tracking metric in Datadog

I'm following this doc https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/metrics/#end-to-end-latency-tracking and enabled metrics.latency.interval in flink-conf.yaml as shown below:
metrics.latency.interval: 60000
metrics.latency.granularity: operator
Now, I have the following questions:
how could I know what kind of metrics(a list of metrics name) are enabled? I didn't find any in metrics UI.
Datadog is my reporter, will the latency metrics send to Datadog just like other system metrics listed here https://docs.datadoghq.com/integrations/flink/#data-collected? If yes, what's their name? If no, is there anything I need to do to get them in Datadog?
I'm new to the Flink and the Datadog.Many thanks!
You can access these metrics via rest api integration:
http://{job_manager_address}:8081/jobs/{job_id}/metrics
which will return:
[
{
"id": "latency.source_id.3d28eee20f19966ad0843c8183e96045.operator_id.9c9bbdbebfd61a4aaac39e2c417a4f21.operator_subtask_index.7.latency_min"
},
{
"id": "latency.source_id.bca0e5ddee87a6f64a26077804c63e69.operator_id.197249262ed30764bb323b65405e10b4.operator_subtask_index.14.latency_p75"
},
{
"id": "latency.source_id.bca0e5ddee87a6f64a26077804c63e69.operator_id.b9d4ed4c91fec482427d3584100b1c90.operator_subtask_index.12.latency_median"
},
]
This means that latency from the source_id 3d28eee20... to operator_id 9c9bbdbe with subtask index 7.
However I don't know exact meaning of the latency_p75 or latency_min. Maybe someone else can help us both.
#monstero has explained where to find the latency metrics -- they are job metrics.
The latency metrics are histogram metrics. latency_p75, for example, is the 75th percentile latency, meaning that 75% of the time the latency was less than the reported value.
In all, you can access the min, max, mean, median, stddev, p75, p90, p95, p98, p99, and p999.

Database schema design for stock market financial data

I'm figuring out the optimal structure to store financial data with daily inserts.
There are 3 use cases for querying the data:
Querying specific symbols for current data
Finding symbols current by values (e.g. where price < 10 and dividend.amountPaid > 3)
Charting historical values per symbol (e.g. query all dividend.yield between 2010 and 2020)
I am considering MongoDB, but I don't know which structure would be optimal. Embedding all the data per symbol for a duration of 10 years is too much, so I was thinking of embedding the current data per symbol, and creating references to historical documents.
How should I store this data? Is MongoDB not a good solution?
Here's a small example for a data for one symbol.
{
"symbol": "AAPL",
"info": {
"company_name": "Apple Inc.",
"description": "some long text",
"website": "http://apple.com",
"logo_url": "http://apple.com"
},
"quotes": {
"open": 111,
"close": 321,
"high": 111,
"low": 100
},
"dividends": {
"amountPaid": 0.5,
"exDate": "2020-01-01",
"yieldOnCost": 10,
"growth": { value: 111, pct_chg: 10 } /* some fields could be more attributes than just k/v */
"yield": 123
},
"fundamentals": {
"num_employees": 123213213,
"shares": 123123123123,
....
}
}
What approach would you take for storing this data?
Based upon the info (the sample data and the use cases) you had posted, I think storing the historical data as a separate collection sounds fine.
Some of the important factors that affect the database design (or data model) is the amount of data and the kind of queries - the most important queries you plan to perform on the data. Assuming that the JSON data you had posted (for a stock symbol) can be used to perform the first two queries - you can start with the idea that storing the historical data as a separate collection. The historical data document for a symbol can be for a year or for a range of years - depends upon the queries, the data size, and the type of information.
MongoDB's document based model allows flexible schema which can be useful for implementing future changes and requirements easily. Note that a MongoDB document can store upto 16 MB data maximum.
For reference, see MongoDB Data Model Design.
Stock market data by itself is huge. Keep it all in one place per company otherwise you got a mess sooner or later.
your example above: logos are '.png' etc, not .html
your "quotes" section will be way too big ... keep it all on the top level... that's the nice thing with mongo. each quotes-document should have a date ;) associated with it... use a dateformat mongo also has, not a string for it...

Ludwig preprocessing

I'm running a model with Ludwig.
Dataset is Adult Census:
Features
workclass has almost 70% instances of Private, the Unknown (?) can be imputed with this value.
native_country, 90% of the instances are United States which can be used to impute for the Unknown (?) values. Same cannot be said about occupation column as the values are more distributed.
capital_gain has 72% instances with zero values for less than 50K and 19% instances with zero values for >50K.
capital_loss has 73% instances with zero values for less than 50K and 21% instances with zero values for >50K.
When I define the model what is the best way to do it for the above cases?
{
"name": "workclass",
"type": "category"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
{
"name": "native_country",
"type": "category"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
{
"name": "capital_gain",
"type": "numerical"
"preprocessing": {
"missing_value_strategy": "fill_with_mean",
}
},
{
"name": "capital_loss",
"type": "numerical"
"preprocessing": {
"missing_value_strategy": "fill_with_mean"
}
},
Questions:
1) For category features how to define: If you find ?, replace it with X.
2) For numerical features how to define: If you find 0, replace it with mean?
Ludwig currently considers missing values in the CSV file, like with two consecutive commas for it's replacement strategies. In your case I would suggest to do some minimal preprocessing to your dataset by replacing the zeros and ? with missing values or depending on the type of feature. You can easily do it in pandas with something like:
df[df.my_column == <value>].my_column = <new_value>.
The alternative is to perform the replacement already in your code (for instance replacing 0s with averages) so that Ludwig doesn't have to do it and you have full control of the replacement strategy.

Resources