We've got a time-series network data which we store in a Vertica table. The UI needs to show data in descending order of the timestamp. I tried passing the query to the database designer but it doesn't suggest any projection for the descending order, it already has the projection which order by timestamp in ascending order. I also tried creating projection with the timestamp order by descending but Vertica throws and error - "Projections can only be sorted in ascending order".
Since UI needs to show events in descending order of timestamp the SORT cost of the query is very high - can we optimize it in any way?
The following query is very slow (SORT takes a lot of time even I supply an event_timestamp filter to consider only 1 day worth of events)
select * from public.f_network_events order by event_timestamp desc limit 1000;
You can't ORDER BY ts DESCENDING a projection in Vertica, I'm afraid.
The trick I use for this necessity is to add a column:
tssort INTEGER DEFAULT TIMESTAMPDIFF(microsecond,ts,'2100-01-01::TIMESTAMP)
.. to sort the projection by that, to calculate that TIMESTAMPDIFF() in the query and use it for the WHERE condition.
Not of breathtaking beauty, I agree, but worth the trouble in Big Data scenarios ...
Related
I'm starting to learn Neo4j graph database and I see that you can order you returned data by different(unlimited) parameters in different ways. In what cases this kind of ordering would be useful? If you order by different parameters in different orders you will loose correlation between parameters in the rows, as say m.year is ordered in descending order while m.title can be ordered in ascending order. I can think of Analytics queries, where you want to let's say see in one row the latest day that an order has been placed and in another the maximum amount for the order and in another one the max number of items in a order. Many thanks.
I think you got the wrong idea, of how sorting by multiple parameters works in Cypher. When you specify multiple sort parameters, like this:
MATCH (n)
RETURN n.name, n.age
ORDER BY n.age ASC, n.name DESC
The sorting first happens on the basis of age in ascending order, and if there are two result objects having the same age, then they are sorted by name in descending order. So the correlation between the parameters in a row remains intact.
Here's the documentation link.
I have a data model that consists of 8 entity types and I need to design a DynamoDB NoSQL model around some access patterns. The access patterns are not entity specific so I am in the process of translating them, but most of the access patterns rely on getting items by a date range. From previous related questions, people usually assume that getting the item by both an itemID (Partition Key) and date range (Sort Key) is the norm, but in my case, I need to get all entities by a date range.
This would mean the partition key is the entity type and the sort key is the date range. Am I correct with this statement?
Given the large size of the data (>100GB), I am not sure if this is true.
Update: List of access patterns and data example
The access patterns so far look like this:
Get all transactions during a given date range
Get all transactions during a given date range for a given locationId
Get all transactions during a given date range for a given departmentId
Get all transactions during a given date range for a given categoryId
Get all transactions during a given date range for a given productId
Get all transactions during a given date range for a given transactionItemId
Get all transactions during a given date range for a given supplierId
Get all product on transactions during a given date range
And a transaction entity has the following attributes (I have only included a snippet but there are 52 attributes altogether):
identifier
confirmationNumber **(contains date information)**
priceCurrency
discount
discountInvoiced
price
shippingAmount
subtotal
subtotalInclTax
.
.
.
I don't think DynamoDB will make you very happy for this use case, you seem to have a bunch of different filter categories, that's typically not what DynamoDB excels at.
An implementation would require lots of data duplication through global secondary indexes as well as trouble with hot partitions. The general approach could be to have a base table with the PK as the date and the timestamp as the SK. You then create global secondary indexes based on locationId, departmentId and the other categories you filter by. This will result in data duplication and depending on your filter categories hot partitions.
I'd probably use a relational database with indexes on the filter fields and partition that by the transaction time.
Say I have blog post comments. On insert they get the current utc date time as their creation time (via sysutcdatetime default value) and they get an ID (via integer identity column as PK).
Now I want to sort the comments descending by their age. Is it safe to just do a ORDER BY ID or is it required to use the creation time? I'm thinking about "concurrent" commits and rollbacks of inserts and an isolation level of read committed. Is it possible that the IDs sometimes do not represent the insert order?
I'm asking this because if sorting by IDs is safe then I could have the following benefits:
I don't need an index for the creation time.
Sorting by ID's is probably faster
I don't need a high precision on the datetime2 column because that would only be required for sorting anyway (in order to not have two rows with the same creation time).
This answer says it is possible when you don't have the creation time but is it always safe?
This answer says it is not safe with an identity column. But when it's also the PK the answer gives an example with sorting by ID without mentioning if this is safe.
Edit:
This answer suggests sorting by date and then by ID.
Yes, the IDs can be jumbled because ID generation is not part of the insert transaction. This is in order to not serialize all insert transactions on the table.
The most correct way to sort would be ORDER BY DateTime DESC, ID DESC with the ID being added as a tie breaker in case the same date was generated multiple times. Tie breakers in sorts are important to achieve deterministic results. You don't want different data to be shown for multiple refreshes of the page for example.
You can define a covering index on DateTime DESC, ID DESC and achieve the same performance as if you had ordered by the CI key (here: ID). There's no relevant physical difference between CI and NCIs.
Since you mention the PK somewhere I want to point out that the choice of the PK does not affect any of this. Only indexes do. The query processor does not ever care about PKs and unique keys.
I would order by ID.
Technically you may get different results when sorting by ID vs sorting by time.
The sysutcdatetime will return the time when transaction starts. ID could be generated somewhere later during the transaction. Also, the clock on any computer always drifts. When computer clock is synchronized with the time source, the clock may jump forward or backwards. If you do the sync often, the jump will be small, but it will happen.
From the practical point of view, if two comments were posted within, say, one second of each other, does it really matter which of these comments is shown first?
What I think does matter is the consistency of the display results. If the system somehow decides that comment A should go before comment B, then this order should be preserved everywhere across the system.
So, even with the highest precision datetime2(7) column it is possible to have two comments with exactly the same timestamp and if you order just by this timestamp it is possible that sometimes they will appear as A, B and sometimes as B, A.
If you order by ID (primary key), you are guaranteed that it is unique, so the order will be always well defined.
I would order by ID.
On a second thought, I would order by time and ID.
If you show the time of the comment to the user it is important to show comments according to this time. To guarantee consistency sort by both time and ID in case two comments have the same timestamp.
if you sort on id based on descending order and you are filtering on basis of user then your blog will automatically show latest post on above and that will do the job for you. so dont use date as sorting
I want to index orders and corresponding order entries in Solr to display it in our e-commerce site.
I am planning to adopt a De-normalized approach by repeating order details with every order entries to reduce request latency. But at the same time I need to group records by orderid to find order total for a specified duration.
Is it possible to achieve this without going for a separate index for orders alone?
Yes this is possible, you can user Result Grouping / Field Collapsing for your query. In your case the group field should be orderid and you should add group=true&group.field=orderid to your request to Solr to enable this feature.
I'm trying to figure out how I can create a calculated measure that produces a count of only unique facts in my fact table. My fact table basically stores events from a historical perspective. But I need the measure to filter out redundant events.
Using sales as an example(Since all material around OLAP always uses sales in examples):
The fact table stores sales EVENTS. When a sale is first made it has a unique sales reference which is a column in the fact table. A unique sale however can be amended(Items added or returned) or completely canceled. The fact table stores these changes to a sale as different rows.
If I create a count measure using SSAS I get a count of all sales events which means an unique sale will be counted multiple times for every change made to it (Which in some reports is desirable). However I also want a measure that produces a count of unique sales rather than events but not just based on counting unique sales references. If the user filters by date then they should see unique sales that still exist on that date (If a sale was canceled by that date if should not be represented in the count at all).
How would I do this in MDX/SSAS? It seems like I need have a count query work from a subset from a query that finds the latest change to a sale based on the time dimension.
In SQL it would be something like:
SELECT COUNT(*) FROM SalesFacts FACT1 WHERE Event <> 'Cancelled' AND
Timestamp = (SELECT MAX(Timestamp) FROM SalesFact FACT2 WHERE FACT1.SalesRef=FACT2.SalesRef)
Is it possible or event performant to have subqueries in MDX?
In SSAS, create a measure that is based on the unique transaction ID (The sales number, or order number) then make that measure a 'DistinctCount' aggregate function in the properties window.
Now it should count distinct order numbers, under whichever dimension slice it finds itself under.
The posted query might probably be rewritten like this:
SELECT COUNT(DISTINCT SalesRef)
FROM SalesFacts
WHERE Event <> 'Cancelled'
An simple answer would be just to have a 'sales count' column in your fact view / dsv query that supplies a 1 for an 'initial' event, a zero for all subsiquent revisions to the event and a -1 if the event is cancelled. This 'journalling' approach plays nicely with incremental fact table loads.
Another approach, probably more useful in the long run, would be to have an Events dimension: you could then expose a calculated measure that was the count of the members in that dimension non-empty over a given measure in your fact table. However for sales this is essentially a degenerate dimension (a dimension based on a fact table) and might get very large. This may be inappropriate.
Sometimes the requirements may be more complicated. If you slice by time, do you need to know all the distinct events that existed then, even if they were later cancelled? That starts to get tricky: there's a recent post on Chris Webb's blog where he talks about one (slightly hairy) solution:
http://cwebbbi.wordpress.com/2011/01/22/solving-the-events-in-progress-problem-in-mdx-part-2role-playing-measure-groups/