Storing occurrences for reporting - database

What is the best way to store occurrences of an event in a database so you can quickly pull reports on it? ie (total number of occurrences, number of occurrences between date range).
right now I have two database tables, one which holds all individual timestamps of the event - so I can query on a date range, and one which holds a total count so I can quickly pull that number for a tally
Table 1:
Event | Total_Count
------+------------
bar | 1
foo | 3
Table 2:
Event | Timestamp
------+----------
bar | 1/1/2010
foo | 1/1/2010
foo | 1/2/2010
foo | 1/2/2010
Is there a better approach to this problem? I'm thinking of converting Table 2, to hold date tallies, it should be more efficient, since my date range queries are only done on whole dates, not a timestamp (1/1/2010 vs 1/1/2010 00:01:12)
ie:
Updated Table 2
Event | Date | Total_Count
------+----------+------------
bar | 1/1/2010 | 1
foo | 1/1/2010 | 1
foo | 1/2/2010 | 2
Perhaps theres an even smarter way to tackle this problem? any ideas?

Your approach seems good. I see table 2 more as a detail table, while table 1 as a summary table. For the most part, you would be doing inserts only to table 2, and inserts and updates on table 1.
The updated table 2 may not give you much additional benefit. However, you should consider it if aggregations by day is most important to you.
You may consider adding more attributes (columns) to the tables. For example, you could add a first_date, and last date to table 1.

I would just have the one table with the timestamp of your event(s). Then your reporting is simply setting up your where clause correctly...
Or am I missing something in your question?

Seems like you don't really have any requirements:
Changing from timestamp to just the date portion is a big deal.
You don't ever want to do a time-of-day analysis?
like what's the best time of day to do maintenance if that stops "foo" from happening.
And you're not worried about size? You say you have millions of records (like that's a lot) and then you extend every single row by an extra column. One column isn't a lot until the row count skyrockets and then you really have to think about each column.
So to get the sum of event for the last 3 days you'd rather do this
SELECT SUM(totcnt) FROM (
SELECT MAX(Total_count) as totcnt from table where date = today and event = 'Foo'
UNION ALL
SELECT MAX(Total_count) from table where date = today-1 and event = 'Foo'
UNION ALL
SELECT MAX(Total_count) from table where date = today-2 and event = 'Foo'
)
Yeah, that looks much easier than>
SELECT COUNT(*) FROM table WHERE DATE BETWEEN today-2 and today and event = 'foo'
And think about the trigger it would take to add a row... get the max for that day and event and add one... every time you insert?
Not sure what kind of server you have but I summed 1 Million rows in 285ms. So... how many millions will you have and how many times do you need to sum them and is each time for the same date range or completely random?

Related

Can I create a Running Totals Calculated Column in Ag Grid

I want to create a new Custom Column in AG grid which will display the calculated value of another column together with the value of the column in the previous row.
We have created lots of calculated columns in AdapTable but i cannot work out how do this.
In our example we have a Price and a Date Column and a Running Price Calculated Column.
For the row where Date is Today, I want to the value in the Running Price column to be 'Price' in this Row plus whatever the Running Price value is in the Row where Date is Yesterday.
And for yesterday's row I want Running Price to include the value for 2 days ago. And so on.
Perhaps this example will help explain:
Price | Date | Running Price
5 | 2 Days Ago | 10
7 | Yesterday | 17
9 | Today | 26
If I can do this without needing to sort AG Grid on the Date column then even better as my users like to do their own sorts and I don't want it to break the running total.
Yes, this can be done fairly easily in AdapTable.
You need to use what it calls an AggregatedScalarQuery.
Assuming that the columns in your grid are called 'Price', and 'MyDate' then the Expression for the 'RunningPrice' Calculated Column will be something like:
CUMUL(SUM([Price]), OVER([MyDate]))
See more at: https://docs.adaptabletools.com/guide/adaptable-ql-expression-aggregation-scalar#cumulative-aggregation
Edit: I should add that you dont need to sort the 'MyDate' column as per your initial message, since OVER will run over the dates in natural sort order. So your users can continue to sort AG Grid how they like without it affecting your Calculated Column.

Is there a better way to reconstruct a data from hundreds of millions of entries spread over a long period of time?

(first of all - apologies for the title but I couldn't come up with a better one)
Here is my problem - I have a table with 4 columns - entity::INT, entry::TEXT, state::INT and day::INT.
There could be anywhere from 50 to 1,000 entities. Each entity can have over 100 million entries. Each entry can have one or more states, which changes if the data stored in the entry has changed but only one state can be written for any particular day. The day starts at one and is incremented each day.
Example:
entity | entry | state | day
-------------------------------------
1 | ABC123 | 1 | 1
1 | ABC124 | 2 | 1
1 | ABC125 | 3 | 1
...
1 | ABC999 | 999 | 1
2 | BCD123 | 1000 | 1
...
1 | ABC123 | 1001 | 2
2 | BCD123 | 1002 | 3
The index is set to (entity, day, state).
What I want to achieve is to efficiently select the most current state of each entry on day N.
Currently, every week I write all the entries with their latest state to the table so to minimize the number of days that we need to scan, however, given the total number of entries (worst case scenario - 1,000 entities times 100,000,000 entries is a lot of rows to write each week) the table is slowly but surely bloats and everything becomes really slow.
I need to be able to stop writing this "full" version weekly and instead have a setup that will still be fast enough to achieve that. I considered to use DISTINCT ON with a different index set to (entity, entry, day DESC, state) so that I could:
SELECT DISTINCT ON (entity, entry) entry, state
FROM table
WHERE entity = <entity> AND day <= <day>
ORDER BY entity, entry, day DESC, state;
Would that be the most efficient way to do it or there are better ways? Or entry possibly having hundreds of millions of unique values makes it a poor choice for the second column in the index and the performance will eventually come to a halt?
You want to rank the entries by time an take the latest one. That's the same as ranking them in reverse time orse and taking the first one. And ROW_NUMBER() is one way to do that.
WITH
ranked AS
(
SELECT
*,
ROW_NUMBER()
OVER (
PARTITION BY entity, entry
ORDER BY day DESC
)
AS entity_entry_rank
FROM
yourTable
)
SELECT
*
FROM
ranked
WHERE
entity_entry_rank = 1
The day column can then become a timestamp, and you don't need to store a new copy every day.
The appropriate index would be (entity, entry, timestamp)
Also, it's common to have two tables. One with the history, one with the latest value. That makes use of the current value quicker, at a minor disk overhead.
(Apologies for errors or formating, I'm on my phone.)
DISTINCT ON is simple, and performance is great - for few rows per entry. See:
Select first row in each GROUP BY group?
Not for many rows per entry, though.
Each entity can have over 100 million entries
See:
Optimize GROUP BY query to retrieve latest row per user
Assuming an entry table that holds one row for each existing entry (each relevant distinct combination of (entity, entry)), this query is very efficient to get the latest state for a given day:
SELECT e.entity, e.entry, t.day, t.state
FROM entry e
LEFT JOIN LATERAL (
SELECT day, state
FROM tbl
WHERE (entity, entry) = (e.entity, e.entry)
AND day <= <day> -- given day
ORDER BY day DESC
LIMIT 1
) t ON true;
ORDER BY e.entity, e.entry; -- optional
Use CROSS JOIN LATERAL instead of the LEFT JOIN if you only want entries that have at least one row in tbl.
The perfect index for this is on (entity, entry, day) INCLUDE (state).
If you have no table entry, consider creating one. (Typically, there should be one.) The rCTE techniques outlined in the linked answer above can also be used to create such a table.

What is the best approach to insert a record between two sequencial rows?

I have a simple sequential table such:
ID | Name | Rank
=======+======+=====
327 | Ali | 1
-------+------+-----
846 | Sara | 2
-------+------+-----
657 | Dani | 3
-------+------+-----
...
the ID is primary key and indexed, also the Rank is indexed too. I have couples of these records, and what i want is that insert a record between records of this table in SQL Server, as keeps its sequences without breaking the ranking.
for example i insert Sahar by into the above table with ranking 2, It cause to shifting greater ranks , so :
ID | Name | Rank
=======+======+=====
327 | Ali | 1
-------+------+-----
196 | Sahar| 2 ----> Inserted
-------+------+-----
846 | Sara | 3
-------+------+-----
657 | Dani | 4
-------+------+-----
...
I have searched and i have found some solution to do it for instance :
UPDATE TABLE table SET Rank += 1 WHERE Rank >= 2;
INSERT INTO TABLE (Rank, ...) VALUES (2, ...);
In this answer, or another approach may be this answer. and some other answers i found but all of them have a heavy cost in operation.
Also may I need to change some Ranks or exchange two Ranks so.
In the other hand i have to do it in UPDATE, Delete and Insert Triggers or else-ever you recommend.
Is there a mechanism such as identity(1,1) or other built-in
service in SQL Server which aims to solve this issue?
If NO, what is the best approach in performance aspect of this
operation?( the answer should have a good explanation about
implementation place (Trigger or ...), Indexing issue and also may
be need to change my table definition)
Thanks.
If your table has N rows with Rank from 1 to N, and you insert a new row with Rank=2, then you'll have to UPDATE (i.e. change) values in N-2 rows. You'll have to write a lot of changes to the table. I'm afraid there is no magic way to speed it up.
If you really have to update the Rank, that is.
But, maybe, you don't really need to have the Rank as an integer without gaps.
The real purpose of the Rank is to define a certain order of rows. To define an order you need to know which row comes after each row. So, when a user says that he wants to add Sahar with ranking 2 it really means that Sahar should go after Ali, but before Sara, so the rank of the new rows can be set to, say, (1+2)/2 = 1.5.
So, if you make Rank a float you would be able to insert new rows in the middle of the table without changing values of the Rank of all other rows.
If you want to present Rank to the user as a sequence of integer numbers without gaps use something like:
ROW_NUMBER() OVER(ORDER BY FloatRank) AS IntegerRankWithoutGaps
Besides, if you delete a row, you don't need to update all rows of a table as well. The persisted FloatRank value would have a gap, but it will disappear when ROW_NUMBER is applied.
Technically, you can't keep dividing an 8-byte float interval in half indefinitely, so once in a while you should run a maintenance procedure that "normalizes" your float ranks and updates all rows in a table. But, at least this costly procedure could be run not often and when the load on the system is minimal.
Also, you can start with float values not 1, but further apart. For example, instead of 1, 2, 3, 4... start with 1000, 2000, 3000, 4000, ....
The most simple and transparent way is to add column where you will define the sequence of rows in your table and than create clustered index on that column. This index will keep the data as you want.

Retrieving data from 2 tables that have a 1 to many relationship - more efficient with 1 query or 2?

I need to selectively retrieve data from two tables that have a 1 to many relationship. A simplified example follows.
Table A is a list of events:
Id | TimeStamp | EventTypeId
--------------------------------
1 | 10:26... | 12
2 | 11:31... | 13
3 | 14:56... | 12
Table B is a list of properties for the events. Different event types have different numbers of properties. Some event types have no properties at all:
EventId | Property | Value
------------------------------
1 | 1 | dog
1 | 2 | cat
3 | 1 | mazda
3 | 2 | honda
3 | 3 | toyota
There are a number of conditions that I will apply when I retrieve the data, however they all revolve around table A. For instance, I may want only events on a certain day, or only events of a certain type.
I believe I have two options for retrieving the data:
Option 1
Perform two queries: first query table A (with a WHERE clause) and store data somewhere, then query table B (joining on table A in order to use same WHERE clause) and "fill in the blanks" in the data that I retrieved from table A.
This option requires SQL Server to perform 2 searches through table A, however the resulting 2 data sets contain no duplicate data.
Option 2
Perform a single query, joining table A to table B with a LEFT JOIN.
This option only requires one search of table A but the resulting data set will contain many duplicated values.
Conclusion
Is there a "correct" way to do this or do I need to try both ways and see which one is quicker?
Ex
Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id
and a subquery something like this -
Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)
When I consider performance which of the two queries would be faster and why ?
would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
More REF

How to Insert row between two rows and give it priority in database?

i have a stack of messages in database table.
i want to send these messages by their priority so i added "Priority" column to "messages" table.
but what if i want to insert "cram" message between two messages and give the previous priority to this new message?.
Should i update all messages priority under this message.
So please give me the perfect design for my database table to support priority update.
Use a float column for Priority rather than an int.
Then, to insert a message between two others, assign the average of the two messages' Priority values as the new message's Priority. (E.g., to insert a cram message between a messages with Priority 2 and 3, assign it a Priority of 2.5).
By doing this, you don't have to update any other messages' priority, and you can continue to average/insert between those, etc. until you bump up against the decimal accuracy limits of a float (which will take awhile, especially if the raw Priority values tend to be small).
Or, add another column after Priority in the ORDER BY. In the simplest case, use bit column called "ShowAfter" with a default value of 0. When you insert a cram message, give it the same Priority as the message you want to see it after, but a [ShowAfter] value of 1.
Just wild idea, haven't test this for performance, but link list kind of structure should net you want you want here. At maximum you will only need to change 3 records
Find out where you want to put your new record,
note what record comes before it and what record comes after it.
new record, establish previous record and the next record.
relink the previous and next records accordingly to the new record.
You do this by adding 2 fields (next and previous) in the schema.
Just include a timestamp column with the default getdate() value. This way, when sending messages, order by priority asc, createtime desc.
If you don't always want to do Last-In-First-Out (LIFO), you can do order by priority, senddate and then set senddate to 1/1/1900 for anything you want pushed out first.
If you want to do it by some method of ranking them, you'd have to update every single row below a given priority if you wanted to "cram" a message in. With a getdate() default column, you just don't have to worry about that.
Interesting, maybe you could use an identity column as the primary key but use an increment that skips a few values?
This way, you would reserve space should you need to insert/update a messages priority to be between an existing boundary.
Make sense?
Similar to #Jimmy Chandra's idea, but use a single link column.
So, you might have these columns:
ID | SortAfterID | OtherColumn1 | OtherColumn2
Now, let's say you have five records with IDs 1 through 5, and you want record 5 to be sorted between 2 and 3. Your table would look something like this:
ID | SortAfterID | OtherColumn1 | OtherColumn2
1 | NULL | ... | ...
2 | 1 | ... | ...
3 | 5 | ... | ...
4 | 3 | ... | ...
5 | 2 | ... | ...
I would set a constraint so that SortAfterID references ID.
If you now wanted to insert a new record (ID = 6) that goes between 1 and 2, you would:
Insert a new record with ID = 6 and SortAfterID = 1.
Update record with ID = 2 so that SortAfterID = 6.
I think this should be pretty low maintenance, and it's guaranteed to work no matter how many times you cram. The floating point number idea by #richardtallent should work as well, but as he mentioned, you could run into a limit.
EDIT
I just noticed the paragraph at the end of #richardtallent answer that mentions this same idea, but since I typed this all out, I figure I'll keep it here since it provides a little extra detail.

Resources