We write daily about 1 million records into a sql server table.
Records has a insertdate and status fields, among others of course.
I need to delete records from time to time to free space on the volume but leaving the last 4 days records there.
The problem is the deletion takes hours and lots of resources.
I have think about partition tables setting the partition field on the insertdate, but I never used that kind of tables.
How can I achieve the goal using the less cpu/disk resources and having the solution the less drawbacks possible? (I assume any solution has its own drawbacks, but please explain them if you know).
There are two approaches you can take to speed up the deletion. One is to delete 10000 rows at a time so the transaction log does not grow to an enormous size. Based on some logic you keep deleting top 10000 rows until all the rows fulfill a condition are deleted. This can, depending on your system, speed up the deletes by a factor of 100.
The other approach is to create a partition on the table. You have to create a partition schema and function and if all the rows you are deleting are in one partition, let's say a days worth of sales, then deleting the partition will remove all the rows in a meta operation and it will only take a few seconds to do. Partitioning is not hard but you have to spend some time to set up a rolling window properly. It's more then an hour but less then a week.
Related
I am fetching data from a table having 20K row from a third-party source where the way of filling the table can't be changed table.
On the third party, the table is filled as following
New data is coming at every 15 seconds approx 7K rows.
At any given time only the last three timestamps will be available rest data will be deleted.
No index on the table is there. Neither it can be requested due to unavoidable reasons and might be slowness in the insert.
I am aware of the following
Row locks and up the hierarchy other locks are being taken while data insert.
The problem persists with select with NO LOCK.
There is no Join with any other table while fetching as we are joining the tables when data is at the local with us in the temp table.
When the data insertion is stopped at the third party the data comes in 100ms to 122ms.
When service is on it takes 3 to 5 seconds.
Any help/suggestion/approach is appreciated in advance.
The following is a fairly high-end solution. Based on what you have said I believe it would work, but there'd be a lot of detail to work out.
Briefly: table partitions.
Set up a partition scheme on this table
Based on an article I read recently, this CAN be done with unindexed heaps
Data is loaded every 15 seconds? Then the partitions need to be based on those 15 second intervals
For a given dataload (i.e. once per 15 seconds):
Create the "next" partition
Load the data
SWITCH the new partition (new data) into the main table
SWITCH the oldest partition out (data for only three time periods present at a time, right?
Drop that "retired" partition
While potentially efficient and effective, this would be very messy. The big problem I see is, if they can't add a simple index, I don't see how they could possibly set up table partitioning.
Another similar trick is to set up partitioned views, which essentially is "roll your own partitioning". This would go something like:
Have a set of identically structured tables
Create a view UNION ALLing the tables
On dataload, create a new table, load data into that table, then ALTER VIEW to include that newest table and remove the oldest table.
This could have worse locking/blocking issues than the partitioning solution, though much depends on how heavy your read activity is. And, of course, it is much messier than just adding an index.
I am modelling for the Database CrateDB.
I have an avg. of 400 customers and the produce different amounts of time-series data every day. (Between 5K and 500K; avg. ~15K)
Later I should be able to query per customer_year_month and per customer_year_calendar_week.
That means that I will only query for the intervals:
week
and month
Now I'am asking myself how to partition this table?
I would partion per customer and year.
Does this make sense?
Or would it be better to partion by customer, year and month?
so the question of partitioning a table is quite complex and should consider a lot of things. Among others:
What queries should be run?
The way the data is inserted
Available hardware resources
Cluster size
Essentially, each partition also creates overhead by multiplying the shard count (a partition can be considered a "sub-table" based on a column value), which - if chosen improperly - can hinder performance a lot.
So in your case 15k inserts a day is not too much, however the distribution of inserts might cause problems, a customer's partition that grows with 500k inserts a day will run into performance problems earlier than the 5k person. As a consequence I would use weekly partitioning only.
create table "customer-logging" (
customer_id long,
log string,
ts timestamp,
week as date_trunc('week', ts)
) partitioned by (week) into 8 shards
Please only use 8 shards if you have an appropriate amount of CPU cores ;)
Docs: date_trunc(), partitioned tables
Ideally you try out a few different combinations and find what works best for you. Insights into shard sizes and locations are provided by our sys tables, so you can see if there's a particularly fat shard that overloads a node ;)
Cheers, Claus
There is a table with 5 columns and no more. The size of each row is less then 200 bytes but the number of the table rows may be increased to several tens of billions during the time.
The application will be storing data at a rate of 100 per second or more. Once these data are stored, they will never be updated but they will be removed after 1 year. They will not be read many times though, but may be queried by selecting within a time range, e.g. selecting rows for a given hour in a given day.
Questions
Which type of Nosql database is suited for this?
Which of these databases would be best suited? (Doesn't have to be listed)
If your Oracle license includes the partitioning option, partition by month or year, and if most/all of your queries include the date column you partitioned on, that will help dramatically. It also makes dropping a year's worth of data take a few seconds.
As others noted in the comments, depends on how much data is being returned by a query. If a query is returning millions of rows, then yes, it may take 15 minutes. Oracle can handle queries against billion row tables in a few seconds if the criteria is restricting enough and appropriate indexes are present, and statistics gathered appropriately.
So how many rows are returned by your 15 minute query?
We have a SQL Server 2000 in place which has about 6,000,000 records. And one column has a pdf stored in it.
The question is every month we delete about 250,000 records and insert about 250,000 on a specific day. After that there are no updates, only reads.
Question is: Is it optimal to delete 500 records and insert 500 records then delete then insert then delete and insert and so on..
Or delete 250,000 at a time and insert 250,000 in 500 batch?
Which option is optimal? and have the best memory management?
+1 for Anyone who points me to a MSDN article or something..
http://msdn.microsoft.com/en-us/library/aa178096%28v=sql.80%29.aspx
As you don't mention it it's worth underlining the standard practice for inserting/deleting/updating any very large volume of data on a rdbms of dropping all indexes before applying the changes, and reapplying them afterwards.
Two reasons for this.
It's faster for the system to rebuild the indexes in one go rather than on a by record basis (less head movement on the disk).
If you rebuild the index from scratch your subsequent accesses using it are likely to be faster as the index tree will be more likely to be better balanced.
You might want to consider partitioning. If you organize it so you can simply drop a partition at a time, deleting will take milliseconds. See http://msdn.microsoft.com/en-us/library/aa902650(v=sql.80).aspx
I am developing a Job application which executes multiple parallel jobs. Every job will pull data from third party source and process. Minimum records are 100,000. So i am creating new table for each job (like Job123. 123 is jobId) and processing it. When job starts it will clear old records and get new records and process. Now the problem is I have 1000 jobs and the DB has 1000 tables. The DB size is drastically increased due to lots of tables.
My question is whether it is ok to create new table for each job. or have only one table called Job and have column jobId, then enter data and process it. Only problem is every job will have 100,000+ records. If we have only one table, whether DB performance will be affected?
Please let me know which approach is better.
Don't create all those tables! Even though it might work, there's a huge performance hit.
Having a big table is fine, that's what databases are for. But...I suspect that you don't need 100 million persistent records, do you? It looks like you only process one Job at a time, but it's unclear.
Edit
The database will grow to the largest size needed, but the space from deleted records is reused. If you add 100k records and delete them, over and over, the database won't keep growing. But even after the delete it will take up as much space as 100k records.
I recommend a single large table for all jobs. There should be one table for each kind of thing, not one table for each thing.
If you make the Job ID the first field in the clustered index, SQL Server will use a b-tree index to determine the physical order of data in the table. In principle, the data will automatically be physically grouped by Job ID due to the physical sort order. This may not stay strictly true forever due to fragmentation, but that would affect a multiple table design as well.
The performance impact of making the Job ID the first key field of a large table should be negligible for single-job operations as opposed to having a separate table for each job.
Also, a single large table will generally be more space efficient than multiple tables for the same amount of total data. This will improve performance by reducing pressure on the cache.