We have 2 tables. One holds measurements, the other one holds timestamps (one for every minute)
every measurement holds a FK to a timestamp.
We have 8M (million) measurements, and 2M timestamps.
We are creating a report database via replication, and my first solution was this: when a new measurement was received via the replication process, lookup the right timestamp and add it to the measurement table.
Yes, it's duplication of data, but it is for reporting and since we have measurements every 5 minutes and users can query for yearly data (105.000 measurements) we have to optimize for speed.
But a co-developer said: you don't have to do that, we'll just query with a join (on the two tables), SqlServer is so fast, you don't see the difference.
My first reaction was: a join on two tables with 8M and 2M records can't make 'no difference'.
What is your first feeling on this?
EDIT:
new measurements: 400 records per 5 minutes
EDIT 2:
maybe the question is not so clear:
the first solution is to get the data from the timestamp table and copy it to the measurement table when the measurement record is inserted.
In that case we have an action when the record is inserted AND an extra (duplicated) timestamp value. In this case we lonly query ONE table because it holds all the data.
The second solution is to join the two tables in a query.
With the proper index the join will make no difference*. My initial thought is that if the report is querying over the entire dataset, the joins might actually be faster because there is literally 6 million fewer timestamps that it has to read from the disk.
*This is just a guess based on my experience with tables with millions of records. You results will vary based on your queries.
I'd create an Indexed View (similar to a Materialized view in Oracle) which joins the tables using appropriate indexes.
If the query just retrieves the data for the given date ranges, there will be a merge join - that is, a range scan for each of tow tables. Since the timestamp table presumably contains only timestamp, this shouldn't be expensive.
On the other hand, if you have only one table and index on the date column, the index itself becomes larger and more expensive to scan.
So, with properly constructed indexes and queries I won't expect a significant difference in performance.
I'd suggest you to keep properly normalized design until you start having performance problems that force you to change it. And then you need to carefully analyze query plans and measure performance with different options - there're lots of thing that could matter in your particular case.
Frankly in this case your best bet is try both solutions and see which one is better. Performance tuning is an art when you start talking about large data sets and is highly dependant onthe not only the database design you have but the hardware and the whther you are using partioning, etc. Be sure to test both getting the data out and putting the data in. Since you have so many inserts, insert speed is critical and tthe index you would need on on the datetime field is critical to select performance, so you really need to thouroughly test this. Don't forget about dumping the cache when you test. And test multiple times and if possible test under a typical query load.
Related
I have a table that consists of 56millions rows.
This table is handling high load of UPSERTS every 5 minutes as it's loading streaming data from KAFKA. Approx 200-500k updates every load.
When I run a SELECT with an ORDER BY against one of the timestamp columns, it takes over 5-7 minutes to return a result.
I tried Cluster Key for that column but since there is a high DML operation on that table and high cardinality on the column itself, the clustering was ineffective and costly.
So far, the only think that has significantly reduced query time to about 15 seconds is increasing the warehouse size to an X-Large from a Small.
I am not convinced that the only solution is to increase the warehouse size. Any advice here would be great!
Clustering on date(timestamp) (or something that's lower cardinality) would be more effective, although because of the volume of updates it will still be expensive.
At a happy hour event, I heard a Snowflake user that achieved acceptable results on a similar (ish) scenario by clustering on late arriving facts (e.g. iff(event_date<current_date, true, false))) (although I think they were INSERTing not UPSERTing and in the later case the micropartitions have to be re-written anyway so it might not help much.)
There are other things to consider too.
Inspect the query plan to confirm that ordering is the problem (e.g is a lot of time spent on ordering.) Without seeing your actual query, I wonder if a majority of the time is spent on the table scan (when it is grabbing the data from remote storage.) If a larger warehouse improves performance, this is likely the case since every added node in the cluster means more micro-partitions can be read concurrently.
Are you running against:
A true timestamp column?
A JSON column cast as time stamp but no
additional function?
How many fields in the JSON
What is the relative ratio of UPDATEs to INSERTs?
Have you looked at the cluster statistics?
I am fetching data from a table having 20K row from a third-party source where the way of filling the table can't be changed table.
On the third party, the table is filled as following
New data is coming at every 15 seconds approx 7K rows.
At any given time only the last three timestamps will be available rest data will be deleted.
No index on the table is there. Neither it can be requested due to unavoidable reasons and might be slowness in the insert.
I am aware of the following
Row locks and up the hierarchy other locks are being taken while data insert.
The problem persists with select with NO LOCK.
There is no Join with any other table while fetching as we are joining the tables when data is at the local with us in the temp table.
When the data insertion is stopped at the third party the data comes in 100ms to 122ms.
When service is on it takes 3 to 5 seconds.
Any help/suggestion/approach is appreciated in advance.
The following is a fairly high-end solution. Based on what you have said I believe it would work, but there'd be a lot of detail to work out.
Briefly: table partitions.
Set up a partition scheme on this table
Based on an article I read recently, this CAN be done with unindexed heaps
Data is loaded every 15 seconds? Then the partitions need to be based on those 15 second intervals
For a given dataload (i.e. once per 15 seconds):
Create the "next" partition
Load the data
SWITCH the new partition (new data) into the main table
SWITCH the oldest partition out (data for only three time periods present at a time, right?
Drop that "retired" partition
While potentially efficient and effective, this would be very messy. The big problem I see is, if they can't add a simple index, I don't see how they could possibly set up table partitioning.
Another similar trick is to set up partitioned views, which essentially is "roll your own partitioning". This would go something like:
Have a set of identically structured tables
Create a view UNION ALLing the tables
On dataload, create a new table, load data into that table, then ALTER VIEW to include that newest table and remove the oldest table.
This could have worse locking/blocking issues than the partitioning solution, though much depends on how heavy your read activity is. And, of course, it is much messier than just adding an index.
This is the re-submission of my previous question:
I have a collection of ordered time-series data(stock minute price information). My current database structure using PostgreSQL is below:
symbol_table - where I keep the list of the symbols with the symbol_id as a primary key(serial).
time_table, date_table - time/date values are stored there. time_id/date_id are primary keys(serial/serial).
My main minute_table contains the minute pricing information where
date_id|time_id|symbol_id are primary keys(also foreign keys from the corresponding tables)
Using this main minute_table I'm performing different statistical analyses and keep the results in a separate tables, like one_minute_std - where one minute standard deviation measures are kept.
Every night I'm updating the tables with the current price information from the last day's closing prices.
With the current implementation my tables contain all the symbols with around 50m records each.
Primary keys are indexed.
If I want to query for all the symbols where closing price > x and one_minute_std >2 and one_minute_std < 4 for the specific date it takes about 3-4 minutes for the search.
To speed up the process I was thinking of separating each symbol to its own table but not 100% sure if this is a 'proper' way of doing it.
Could you advise me on how I can speed up the query process?
It sounds like you want a combination of approaches.
First, you should look into table partitioning. This stores a single table across multiple storage units ("files"), but still gives you the flexibility of a single table. (Here is postgres documentation http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html).
You would want to partition either by day or by ticker symbol. My first reaction would be by time (day/week/month), since that is the unit of updates. However, if you analyses are only by a single ticker and often span multiple days, then there is an argument for using that instead.
After partitioning, you may want to consider indexes. However, I suspect that partitioning will solve your performance problems.
Since your updates are at night, you should be folding in your summarization process in with the updates. For instance, one_minute_std should be calculated during this process. You might find it best to load the nightly data into a temporary table, do the calculation for summaries such as one_minute_std, and then load the data into the final partitioned table scheme.
With so many rows that have so few columns, you are probably better off with a good partitioning scheme than an indexing scheme. In particular, indexes have a space overhead, and the smaller the record in each row, the more that using the index incurs an overhead comparable to scanning the entire table.
Which option is better:
Writing a very complex query having large number of joins, or
Writing 2 queries one after the other, applying the obtained result set of the processed query on other.
Generally, one query is better than two, because the optimizer has more information to work with and may be able to produce a more efficient query plan than either separately. Additionally, using two (or more) queries typically means you'll be running the second query multiple times, and the DBMS might have to generate the query plan for the query repeatedly (but not if you prepare the statement and pass the parameters as placeholders when the query is (re)executed). This means fewer back and forth exchanges between the program and the DBMS. If your DBMS is on a server on the other side of the world (or country), this can be a big factor.
Arguing against combining the two queries, you might end up shipping a lot of repetitive data between the DBMS and the application. If each of 10,000 rows in table T1 is joined with an average of 30 rows from table T2 (so there are 300,000 rows returned in total), then you might be shipping a lot of data repeatedly back to the client. If the row size of (the relevant projection of) T1 is relatively small and the data from T2 is relatively large, then this doesn't matter. If the data from T1 is large and the data from T2 is small, then this may matter; measure before deciding.
When I was a junior DB person I once worked for a year in a marketing dept where I had so much free time I did each task 2 or 3 different ways. I made a habit of writing one mega-select that grabbed everything in one go and comparing it to a script that built interim tables of selected primary keys and then once I had the correct keys went and got the data values.
In almost every case the second method was faster. the cases where it wasn't were when dealing with a small number of small tables. Where it was most noticeably faster was of course large tables and multiple joins.
I got into the habit of select the required primary keys from tableA, select the required primary keys from tableB, etc. Join them and select the final set of primary keys. Use the selected primary keys to go back to the tables and get the data values.
As a DBA I now understand that this method resulted in less purging of the data cache and played nicer with others using the DB (as mentioned by Amir Raminfar).
It does however require the use of temporary tables which some places / DBA don't like (unfairly in my mind)
Depends a lot on the actual query and the actual database i.e. SQL, Oracle mySQL.
At large companies, they prefer option 2 because option 1 will hog the database cpu. This results in all other connections being slow and everything being a bottle neck. That being said, it all depends on your data and the ammount you are joining. If you are joining on 10000 to 1000 then you are going to get back 10000 x 1000 records. (Assuming an inner join)
Possible duplicate MySQL JOIN Abuse? How bad can it get?
Assuming "better" means "faster", you can easily test these scenarios in a junit test. Note that a determining factor that you may not be able to get from a unit test is network latency. If the database sits right next to your machine where you run the unit test, you may see no difference in performance that is attributed to the network. If your production servers are in another town, country, or continent from the database, network traffic becomes more of a bottleneck. You do not want to go back and forth across the wire- you more likely want to make one round trip and get everything at once.
Again, it all depends :)
It could depend on many things: ,
the indexes you have set up
how many tables,
what the actual query is,
how big the data set is,
what the underlying DB is,
what table engine you are using
The best thing to do would probably test both methods on a variety of test data and see which one bottle necks.
If you are using MySQL, ( and Oracle maybe? ) you can use
EXPLAIN SELECT .....
and it will give you a lot of info on how it will execute the query, and therefor how you can improve it etc.
I have to design a database to store log data but I don't have experience before. My table contains about 19 columns (about 500 bytes each row) and daily grows up to 30.000 new rows. My app must be able to query effectively again this table.
I'm using SQL Server 2005.
How can I design this database?
EDIT: data I want to store contains a lot of type: datetime, string, short and int. NULL cells are about 25% in total :)
However else you'll do lookups, a logging table will almost certainly have a timestamp column. You'll want to cluster on that timestamp first to keep inserts efficient. That may mean also always constraining your queries to specific date ranges, so that the selectivity on your clustered index is good.
You'll also want indexes for the fields you'll query on most often, but don't jump the gun here. You can add the indexes later. Profile first so you know which indexes you'll really need. On a table with a lot of inserts, unwanted indexes can hurt your performance.
Well, given the description you've provided all you can really do is ensure that your data is normalized and that your 19 columns don't lead you to a "sparse" table (meaning that a great number of those columns are null).
If you'd like to add some more data (your existing schema and some sample data, perhaps) then I can offer more specific advice.
Throw an index on every column you'll be querying against.
Huge amounts of test data, and execution plans (with query analyzer) are your friend here.
In addition to the comment on sparse tables, you should index the table on the columns you wish to query.
Alternatively, you could test it using the profiler and see what the profiler suggests in terms of indexing based on actual usage.
Some optimisations you could make:
Cluster your data based on the most likely look-up criteria (e.g. clustered primary key on each row's creation date-time will make look-ups of this nature very fast).
Assuming that rows are written one at a time (not in batch) and that each row is inserted but never updated, you could code all select statements to use the "with (NOLOCK)" option. This will offer a massive performance improvement if you have many readers as you're completely bypassing the lock system. The risk of reading invalid data is greatly reduced given the structure of the table.
If you're able to post your table definition I may be able to offer more advice.