I have got a database consisting of about several thousand of tables (from 2005) where about 20-30% of them have incremental rowcount of about 200k/yr.
The requirement is to visualize the statistics of the table based on the column lastAccessedDate. The scheme is to classify into parition groups (10 y, 5y, 3y, 1y, 6m, 3m, 1m, 2w, currentDate). Say the partitions are named p1,p2,..p10
I am able to understand that multiple partitionining groups can be defined to the tables link.
The job runs every week and therefore, the partitioning scheme varies on the current date; i.e. After a week, the p10 becomes p9. After two weeks, p9 becomes p8 and after a month, p8 becomes p7; I hope you get the idea.
Is the partitioning scheme based on the current date feasible?
If this is feasible, is it worthwhile to horizontally partition the tables and query them instead of running the query through the entire table? The SQL Server reports suggests the total space usage is around 31,556 MB.
I am running this on a SQL Server 2008 instance.
You mention data storage and statistics requirements. These two should not be confused. Storage solutions are made for quick access of data, and statistics are in the domain of reporting and visualisation.
With respect to storage you should use partitioning, but not in the proposed way. You can have 10.000 partitions in SQL server, so you could use one day or one week as a usefull partition key. Once created you can switch sections in and out of the partitioned table, but you should not move the data from an existing partition to another partition, for this implies the recalculation all constraints and indexes. In views you can create additional values for later 'group by' actions. These groups can describe the required '1d, 1w, 1m, 1q, 1y' reporting values.
So yes it's feasible and recommended to use partitions for your case, but in a different way then proposed.
Related
We have a 4-5TB SQL Server database. The largest table is around 800 GB big containing 100 million rows. 4-5 other comparable tables are 1/3-2/3 of this size. We went through a process to create new indexes to optimize performance. While the performance certainly improved we saw that the newly inserted data was slowest to query.
It's a financial reporting application with a BI tool working on top of the database. The data is loaded overnight continuing in the late morning, though the majority of the data is loaded by 7am. Users start to query data around 8am through the BI tool and are most concerned with the latest (daily) data.
I wanted to know if newly inserted data causes indexes to go out of order. Is there anything we can do so that we get better performance on the newly inserted data than the old data. I hope I have explained the issue well here. Let me know in case of any missing information. Thanks
Edit 1
Let me describe the architecture a bit.
I have a base table (let’s call it Base) with Date,id as clustered index.
It has around 50 columns
Then we have 5 derived tables (Derived1, Derived2,...) , according to different metric types, which also have Date,Id as clustered index and foreign key constraint on the Base table.
Tables Derived1 and Derived2 have 350+ columns. Derived3,4,5 have around 100-200 columns. There is one large view created to join all the data tables due limitations of the BI tool. The date,ID are the joining columns for all the tables joining to form the view (Hence I created clustered index on those columns). The main concern is with regard to BI tool performance. The BI tool always uses the view and generally sends similar queries to the server.
There are other indexes as well on other filtering columns.
The main question remains - how to prevent performance from deteriorating.
In addition I would like to know
If NCI on Date,ID on all tables would be better bet in addition to the clustered index on date,ID.
Does it make sense to have 150 columns as included in NCI for the derived tables?
You have about a 100 million rows, increasing every day with new portions and those new portions are usually selected. I should use partitioned indexes with those numbers and not regular indexes.
Your solution within sql server would be partitioning. Take a look at sql partitioning and see if you can adopt it. Partitioning is a form of clustering where groups of data share a physical block. If you use year and month for example, all 2018-09 records will share the same physical space and easy to be found. So if you select records with those filters (and plus more) it is like the table has the size of 2018-09 records. That is not exactly accurate but its is quite like it. Be careful with data values for partitioning - opposite to standard PK clusters where each value is unique, partitioning column(s) should result a nice set of different unique combinations thus partitions.
If you cannot use partitions you have to create 'partitions' yourself using regular indexes. This will require some experiments. The basic idea is data (a number?) indicating e.g. a wave or set of waves of imported data. Like data imported today and the next e.g. 10 days will be wave '1'. Next 10 days will be '2' and so on. Filtering on the latest e.g. 10 waves, you work on the latest 100 days import effectively skip out all the rest data. Roughly, if you divided your existing 100 million rows to 100 waves and start on at wave 101 and search for waves 90 or greater then you have 10 million rows to search if SQL is put correctly to use the new index first (will do eventually)
This is a broad question especially without knowing your system. But one thing that I would try is manually update your stats on the indexes/table once you are done loading data. With tables that big, it is unlikely that you will manipulate enough rows to trigger an auto-update. Without clean stats, SQL Server won't have an accurate histogram of your data.
Next, dive into your execution plans and see what operators are the most expensive.
Here is my scenario with SQLServer 2008 R2 database table
(Update: Migration to SQL Server 2014 SP1 is in progress, so SQL Server 2014 can be used here).
A. Maintain daily history in the table (which is a fact table)
B. Create tableau graphs using the fact and dimension tables
A few steps to follow to create the table
A copy of the table from the source database will be pushed to my SQLServer DAILY which contain 120,000 to 130,000 rows with 20 columns approximately
a. 1st day, we get 120,000 records, sample structure is below.
(Modified or New records are highlighted in Yellow)
Source System Data:
b. 2nd day, we get, say 122,000 records (2,000 are newly inserted and 1,000 are modified/updated on previous day's data and 119,000 are as it is from previous day)
c. 3rd day, we get, say 123,000 records (1,000 are newly inserted and 1,000 are modified / updated on 2nd day's data and 121,000 are as it is from 2nd day)
Since the daily history has to be maintained in the Fact table, within a week the table will have 1 million rows,
for 2 weeks - 2 million rows
for 1 month - 5 million rows
for 1 year - say 65 - 70 million rows
for 12 years - say 1 billion rows (1,000 million)
12 years history has to be maintained
What could be right strategy to store data in the table to handle this scenario, which should also provide sufficient performance while generating reports ?
Partitioning the table by month wise (the table will contain 5 million rows approx.) ?
Thought of copying the differential data only in the table daily (new and modified rows only) but it is not possible to create tableau reports with Approach-2.
Fact Table Approaches:
Tableau graphs have to created using the fact and dimension tables for scenarios like
Weekly Bar graph for Sample Count
Weekly (week no. on X-axis) plotter graph for average Sample values (on Y-axis)
Weekly (week no. on x-axis) average sample values (on Y-axis) by quality
How to handle this scenario ?
Please provide references on the approach to follow.
Should we create any indexes on the fact table ?
A data warehouse can handle millions of rows these days without a lot of difficulty. Many have tens of billions of rows, and then things get a little difficult. You should look at both table partitioning over time and at columnstore compression and page compression in terms of seeing what is out there. Large warehouses often use both. 2008 R2 is quite old at this point, and note that huge progress has been made in this area in current versions of SQL Server.
Use a standard fact-dimensional design, and try to avoid tweaking the actual schema with workarounds just to conserve space - that generally will bite you in the long run.
For proven, time tested designs in warehousing I like the Kimball group's patterns, e.g. The Data Warehouse Lifecycle Toolkit book.
There are a few different requirements in your case. Because of that, I suggest splitting the requirements according to the standard data warehouse three-tier model.
DWH model (delta-driven, historized, high performance)
Presentation model (Again, high performance, should fit Tableau)
Front end
DWH model
Basically, you have three different approaches here, all with their pros and cons.
3NF
Can become cumbersome down the road. Is highly flexible if being used right. Time-to-market is long (depending on complexity). Historization can become complicated.
Star Schema (for DWH storage!)
Has a very, very fast time-to-market. Will become extremely complicated to maintain when business rules or business structure changes. Helpful for a very small business but not in the case of businesses which want to expand their Business Intelligence infrastructure. Historization can become a mess if the star schema is the DWH main model.
Data Vault
Has a medium time-to-market. Is easier to understand than 3NF but can be puzzling for people used to a star schema. Automatically historized, parallelizable and very flexible for changing business needs, because the business rules are implemented downstream. Scales quickly.
Anchor Modelling
Another highly flexible approach which I haven't used yet. Is in some kind the same approach as Data Vault but with some differences.
Presentation model
Now, to represent the never-touched-again data in the DWH layer, nothing fits better than Star Schema. Also, while creating the star schema, you can implement business logic.
Front end
Shouldn't matter, take the tool you like.
In your case, it would be smart to implement a DWH (using one of those models) and put the presentation model on top of it. If any problems are in the star schema, you could always re-generate it with the new changes.
NOTE: If you would use a star schema as a DWH model, you cannot re-create the star schema in the presentation layer without using some complex transformation logic to begin with.
NOTE: Also, sometimes the star schema is seen as a DWH. I don't think that this is a good use for it for any requirement which could become more complex.
EDIT
To clarify my last note, see this blog post: http://www.tobiasmaasland.de/2016/08/24/why-your-data-warehouse-is-not-a-data-warehouse/
We have a few tables that are periodically recomputed within SQL Server. The computation takes a few seconds to a few minutes and we do the following:
Dump the results in computed_table_tmp
Drop computed_table
Rename computed_table_tmp to computed_table. (and all indexes).
However, we seem to still run into concurrency issues where we have our application requesting a view that utilizes this computed table at the precise moment where it no longer exists.
What would be the best technique to avoid this type of problem while ensuring high availability?
If this table is part of your high-availability requirement, then you can't do this the way you've been doing it. Dropping a table in a production SQL environment breaks the concept of high availability.
You might be able to accomplish what you're trying to achieve by creating one or more partitions on this table. A partitioned table is divided into subgroups of rows that can be spread across more than one filegroup in your database. For querying purposes, however, the table is still a single logical entity. The advantage of using a table partition is that you can move around subsets of your data without breaking the integrity of the database, i.e., high-availability is still in place.
In your scenario, you'd have to modify your process such that all activity takes place in the production version of the table. The new rows are dumped in to a separate partition, based on the value of your partition function. Then you'll need to switch the partitions.
One of the things you'll need to do is identify a column in your table that may be used as the partition column, which determines which partition a row will be allocated to. This might be, for example, a datetime column indicating when the row was generated. You can even use a computed column for this purpose, provided it is a PERSISTED column.
One caveat: Table partitioning is not available in all editions of SQL Server... I don't believe Standard has it.
I'm looking to manage a large dataset of log files. There is an average of 1.5 million new events per month that I'm trying to keep. I've used access in the past, though it's clearly not meant for this, and managing the dataset is a nightmare, because I'm having to split the datasets into months.
For the most part, I just need to filter event types and count the number. But before I do a bunch of work on the data import side of things, I wanted to see if anyone can verify that this SQL Server is a good choice for this. Is there an entry limit I should avoid and archive entries? Is there a way of archiving entries?
The other part is that I'm entering logs from multiple sources, with this amount of entries, is it wise to put them all into the same table, or should each source have their own table, to make queries faster?
edit...
There would be no joins, and about 10 columns. Data would be filtered through a view, and I'm interested to see if the results from a select query that filter based on one or more columns would have a reasonable response time? Does creating a set of views speed things up for frequent queries?
In my experience, SQL Server is a fine choice for this, and you can definitely expect better performance from SQL Server than MS-Access, with generally more optimization methods at your disposal.
I would probably go ahead and put this stuff into SQL Server Express as you've said, hopefully installed on the best machine you can use (though you did mention only 2GB of RAM). Use one table so long as it only represents one thing (I would think a pilot's flight log and a software error log wouldn't be in the same "log" table, as an absurdly contrived example). Check your performance. If it's an issue, move forward with any number of optimization techniques available to your edition of SQL Server.
Here's how I would probably do it initially:
Create your table with a non-clustered primary key, if you use a PK on your log table -- I normally use an identity column to give me a guaranteed order of events (unlike duplicate datetimes) and show possible log insert failures (missing identities). Set a clustered index on the main datetime column (you mentioned that your're already splitting into separate tables by month, so I assume you'll query this way, too). If you have a few queries that you run on this table routinely, by all means make views of them but don't expect a speedup by simply doing so. You'll more than likely want to look at indexing your table based upon the where clauses in those queries. This is where you'll be giving SQL server the information it needs to run those queries efficiently.
If you're unable to get your desired performance through optimizing your queries, indexes, using the smallest possible datatypes (especially on your indexed columns) and running on decent hardware, it may be time to try partitioned views (which require some form of ongoing maintenance) or partitioning your table. Unfortunately, SQL Server Express may limit you on what you can do with partitioning, and you'll have to decide if you need to move to a more feature-filled edition of SQL Server. You could always test partitioning with the Enterprise evaluation or Developer editions.
Update:
For the most part, I just need to filter event types and count the number.
Since past logs don't change (sort of like past sales data), storing the past aggregate numbers is an often-used strategy in this scenario. You can create a table which simply stores your counts for each month and insert new counts once a month (or week, day, etc.) with a scheduled job of some sort. Using the clustered index on your datetime column, SQL Server could much more easily aggregate the current month's numbers from the live table and add them to the stored aggregates for displaying the current values of total counts and such.
Sounds like one table to me, that would need indexes on exactly the sets of columns you will filter. Restricting access through views is generally a good idea and ensures your indexes will actually get used.
Putting each source into their own table will require UNION in your queries later, and SQL-Server is not very good optimizing UNION-queries.
"Archiving" entries can of course be done manually, by moving entries in a date-range to another table (that can live on another disk or database), or by using "partitioning", which means you can put parts of a table (e.g. defined by date-ranges) on different disks. You have to plan for the partitions when you plan your SQL-Server installation.
Be aware that Express edition is limited to 4GB, so at 1.5 million rows per month this could be a problem.
I have a table like yours with 20M rows and little problems querying and even joining, if the indexes are used.
I'm considering various ways to partition my data in SQL Server. One approach I'm looking at is to partition a particular huge table into 8 partitions, then within each of these partitions to partition on a different partition column. Is this even possible in SQL Server, or am I limited to definining one parition column+function+scheme per table?
I'm interested in the more general answer, but this strategy is one I'm considering for Distributed Partitioned View, where I'd partition the data under the first scheme using DPV to distribute the huge amount of data over 8 machines, and then on each machine partition that portion of the full table on another parition key in order to be able to drop (for example) sub-paritions as required.
You are incorrect that the partitioning key cannot be computed. Use a computed, persisted column for the key:
ALTER TABLE MYTABLE ADD PartitionID AS ISNULL(Column1 * Column2,0) persisted
I do it all the time, very simple.
The DPV across a set of Partitioned Tables is your only clean option to achieve this, something like a DPV across tblSales2007, tblSales2008, tblSales2009, and then each of the respective sales tables are partitioned again, but they could then be partitioned by a different key. There are some very good benefits in doing this in terms of operational resiliance (one partitioned table going offline does not take the DPV down - it can satisfy queries for the other timelines still)
The hack option is to create an arbitary hash of 2 columns and store this per record, and partition by it. You would have to generate this hash for every query / insertion etc since the partition key can not be computed, it must be a stored value. It's a hack and I suspect would lose more performance than you would gain.
You do have to be thinking of specific management issues / DR over data quantities though, if the data volumes are very large and you are accessing it in a primarily read mechanism then you should look into SQL 'Madison' which will scale enormously in both number of rows as well as overall size of data. But it really only suits the 99.9% read type data warehouse, it is not suitable for an OLTP.
I have production data sets sitting in the 'billions' bracket, and they reside on partitioned table systems and provide very good performance - although much of this is based on the hardware underlying a system, not the database itself. Scalaing up to this level is not an issue and I know of other's who have gone well beyond those quantities as well.
The max partitions per table remains at 1000, from what I remember of a conversation about this, it was a figure set by the testing performed - not a figure in place due to a technical limitation.