Ordering your query result by various parameters ordered differently - database

I'm starting to learn Neo4j graph database and I see that you can order you returned data by different(unlimited) parameters in different ways. In what cases this kind of ordering would be useful? If you order by different parameters in different orders you will loose correlation between parameters in the rows, as say m.year is ordered in descending order while m.title can be ordered in ascending order. I can think of Analytics queries, where you want to let's say see in one row the latest day that an order has been placed and in another the maximum amount for the order and in another one the max number of items in a order. Many thanks.

I think you got the wrong idea, of how sorting by multiple parameters works in Cypher. When you specify multiple sort parameters, like this:
MATCH (n)
RETURN n.name, n.age
ORDER BY n.age ASC, n.name DESC
The sorting first happens on the basis of age in ascending order, and if there are two result objects having the same age, then they are sorted by name in descending order. So the correlation between the parameters in a row remains intact.
Here's the documentation link.

Related

Vertica - projections for order by descending order

We've got a time-series network data which we store in a Vertica table. The UI needs to show data in descending order of the timestamp. I tried passing the query to the database designer but it doesn't suggest any projection for the descending order, it already has the projection which order by timestamp in ascending order. I also tried creating projection with the timestamp order by descending but Vertica throws and error - "Projections can only be sorted in ascending order".
Since UI needs to show events in descending order of timestamp the SORT cost of the query is very high - can we optimize it in any way?
The following query is very slow (SORT takes a lot of time even I supply an event_timestamp filter to consider only 1 day worth of events)
select * from public.f_network_events order by event_timestamp desc limit 1000;
You can't ORDER BY ts DESCENDING a projection in Vertica, I'm afraid.
The trick I use for this necessity is to add a column:
tssort INTEGER DEFAULT TIMESTAMPDIFF(microsecond,ts,'2100-01-01::TIMESTAMP)
.. to sort the projection by that, to calculate that TIMESTAMPDIFF() in the query and use it for the WHERE condition.
Not of breathtaking beauty, I agree, but worth the trouble in Big Data scenarios ...

Sorting on ID (int PK) vs creation time

Say I have blog post comments. On insert they get the current utc date time as their creation time (via sysutcdatetime default value) and they get an ID (via integer identity column as PK).
Now I want to sort the comments descending by their age. Is it safe to just do a ORDER BY ID or is it required to use the creation time? I'm thinking about "concurrent" commits and rollbacks of inserts and an isolation level of read committed. Is it possible that the IDs sometimes do not represent the insert order?
I'm asking this because if sorting by IDs is safe then I could have the following benefits:
I don't need an index for the creation time.
Sorting by ID's is probably faster
I don't need a high precision on the datetime2 column because that would only be required for sorting anyway (in order to not have two rows with the same creation time).
This answer says it is possible when you don't have the creation time but is it always safe?
This answer says it is not safe with an identity column. But when it's also the PK the answer gives an example with sorting by ID without mentioning if this is safe.
Edit:
This answer suggests sorting by date and then by ID.
Yes, the IDs can be jumbled because ID generation is not part of the insert transaction. This is in order to not serialize all insert transactions on the table.
The most correct way to sort would be ORDER BY DateTime DESC, ID DESC with the ID being added as a tie breaker in case the same date was generated multiple times. Tie breakers in sorts are important to achieve deterministic results. You don't want different data to be shown for multiple refreshes of the page for example.
You can define a covering index on DateTime DESC, ID DESC and achieve the same performance as if you had ordered by the CI key (here: ID). There's no relevant physical difference between CI and NCIs.
Since you mention the PK somewhere I want to point out that the choice of the PK does not affect any of this. Only indexes do. The query processor does not ever care about PKs and unique keys.
I would order by ID.
Technically you may get different results when sorting by ID vs sorting by time.
The sysutcdatetime will return the time when transaction starts. ID could be generated somewhere later during the transaction. Also, the clock on any computer always drifts. When computer clock is synchronized with the time source, the clock may jump forward or backwards. If you do the sync often, the jump will be small, but it will happen.
From the practical point of view, if two comments were posted within, say, one second of each other, does it really matter which of these comments is shown first?
What I think does matter is the consistency of the display results. If the system somehow decides that comment A should go before comment B, then this order should be preserved everywhere across the system.
So, even with the highest precision datetime2(7) column it is possible to have two comments with exactly the same timestamp and if you order just by this timestamp it is possible that sometimes they will appear as A, B and sometimes as B, A.
If you order by ID (primary key), you are guaranteed that it is unique, so the order will be always well defined.
I would order by ID.
On a second thought, I would order by time and ID.
If you show the time of the comment to the user it is important to show comments according to this time. To guarantee consistency sort by both time and ID in case two comments have the same timestamp.
if you sort on id based on descending order and you are filtering on basis of user then your blog will automatically show latest post on above and that will do the job for you. so dont use date as sorting

Are attributes of a dimension in hierarchical order?

Do the different 'attributes' of a dimension of an OLAP cube have to have a hierarchical order? If not, would the corresponding cube store the results for each possible combination of the dimension attributes?
Let us assume a cube with only two dimensions: time and product.
Time (year, quarter, month, day)
Product (product channel [direct vs. indirect], product group)
While the attributes (how are these called technically?) of the dimension time are clearly strictly hierarchical, the two attributes of the product dimensions are not. We may combine either Channel-Product group or Product group-channel (depending on which one's first).
Is such dimension even possible (non-hierarchical)? If so, which aggregations would the cube store? Each combination (aggregation where first grouped according to channel, then according to product group and the other way around)?
I think Attributes is a perfectly fine name for them - I knew exactly what you meant.
Dimensions don't have to be hierarchical, and very often aren't.
As to which aggregations it will store, there is no simple answer. It will depend on what DBMS you are using, and what you tell it to do. For example with SQL Server (SSAS) you can tell it to precalculate a given percentage of results, from 0 to 100. However within that you can't tell it which ones: it'll do that itself; you can only tell it e.g. 50%. I usually specify 100%.
Other DBMS's will have different facilities.

How to change sort order of a large amount of database records

Say you have 1,000,000 db records.
Say each item has an order, 1->1,000,000.
Say I want to move item 2 to 823,423.
This means each item between 2 and 823,423 needs to be decrimented by 1, to maintain order and be unique.
This seems a pretty intensive task.
What solutions are there to this? Linked list in a database? Non unique priority record?
Databases do not store records in sorted order. A database uses an index to maintain sort order.
See http://en.wikipedia.org/wiki/Database_index

Calculated Measure aggregating on certain cells only

I'm trying to figure out how I can create a calculated measure that produces a count of only unique facts in my fact table. My fact table basically stores events from a historical perspective. But I need the measure to filter out redundant events.
Using sales as an example(Since all material around OLAP always uses sales in examples):
The fact table stores sales EVENTS. When a sale is first made it has a unique sales reference which is a column in the fact table. A unique sale however can be amended(Items added or returned) or completely canceled. The fact table stores these changes to a sale as different rows.
If I create a count measure using SSAS I get a count of all sales events which means an unique sale will be counted multiple times for every change made to it (Which in some reports is desirable). However I also want a measure that produces a count of unique sales rather than events but not just based on counting unique sales references. If the user filters by date then they should see unique sales that still exist on that date (If a sale was canceled by that date if should not be represented in the count at all).
How would I do this in MDX/SSAS? It seems like I need have a count query work from a subset from a query that finds the latest change to a sale based on the time dimension.
In SQL it would be something like:
SELECT COUNT(*) FROM SalesFacts FACT1 WHERE Event <> 'Cancelled' AND
Timestamp = (SELECT MAX(Timestamp) FROM SalesFact FACT2 WHERE FACT1.SalesRef=FACT2.SalesRef)
Is it possible or event performant to have subqueries in MDX?
In SSAS, create a measure that is based on the unique transaction ID (The sales number, or order number) then make that measure a 'DistinctCount' aggregate function in the properties window.
Now it should count distinct order numbers, under whichever dimension slice it finds itself under.
The posted query might probably be rewritten like this:
SELECT COUNT(DISTINCT SalesRef)
FROM SalesFacts
WHERE Event <> 'Cancelled'
An simple answer would be just to have a 'sales count' column in your fact view / dsv query that supplies a 1 for an 'initial' event, a zero for all subsiquent revisions to the event and a -1 if the event is cancelled. This 'journalling' approach plays nicely with incremental fact table loads.
Another approach, probably more useful in the long run, would be to have an Events dimension: you could then expose a calculated measure that was the count of the members in that dimension non-empty over a given measure in your fact table. However for sales this is essentially a degenerate dimension (a dimension based on a fact table) and might get very large. This may be inappropriate.
Sometimes the requirements may be more complicated. If you slice by time, do you need to know all the distinct events that existed then, even if they were later cancelled? That starts to get tricky: there's a recent post on Chris Webb's blog where he talks about one (slightly hairy) solution:
http://cwebbbi.wordpress.com/2011/01/22/solving-the-events-in-progress-problem-in-mdx-part-2role-playing-measure-groups/

Resources