Related
I'm doing a rather large import to a SQL Database, 10^8+ items and I am doing this with a bulk insert. I'm curious to know if the speed at which the bulk insert runs can be improved by importing multiple rows of data as a single row and splitting them once imported?
If the time to import data is defined by the sheer volume of data itself (ie. 10GB), then I'd expect that importing 10^6 rows vs 10^2 with the data consolidated would take about the same amount of time.
If the time to import however is limited more by row operations and logging each line and not by the data itself then I'd expect that consolidating data would have a performance benefit. I'm not sure however how this would carry over if one had tot then break up the data in DB later on.
Does anyone have experience with this and can shed some light on what specifically can be done to reduce bulk insert time without simply adding that time later to split the data in DB?
Given a 10GB import, is it better to import data on separate rows or consolidate and separate the rows in the DB?
[EDIT] I'm testing this on a Quad 2.5GH with 8GB or RAM and 300MB/sec of read/writes to disk (stripped array). The files are hosted n the same array and the average row size varies with some rows containing large amounts of data (> 100 KB) and many under 100 B.
I've chunked my data into 100 MB files and it takes about 40 seconds to import the file. Each file has 10^6 rows in it.
Where is the data that you are importing? If it is on another server, then the Network might be the bottleneck. This then depends on number of NIC'S and frame sizes.
If it is on the same server, things to play with are batch size and recovery model which effect the log file. In full recovery model, everything is written to a log file. Bulk copy recovery model is a little less overhead in the log.
Since this is staging data, maybe a full backup before the process, change the model to simple, then import might reduce the time. Of course, change the model back to full and do another backup.
As for importing non-normalized data, multiple rows at a time, I usually stay away from the extra coding.
Most of the time, I use SSIS packages. More packages, threads, means a fuller NIC pipe. I usually have at least a 4 GB back bone that is seldom full.
Other things that come to play are your disks. Do you have multiple files (path ways) to the RAID 5 array? If not, you might want to think about it.
In short, it really depends on your environment.
Use a DMAIC process.
1 - Define what you want to do
2 - Measure the current implementation
3 - Analyze ways to improve.
4 - Implement the change.
5 - Control the environment by remeasuring.
Did the change go in the positive direction?
If not, rollback the change and try another one.
Repeat the process until the desired result (timing) is achieve.
Good luck, J
If this is a one time thing and done in an offline change window.. you may want to consider to put the database in simple recovery model prior to inserting the data.
Keep in mind though this would break the log chain....
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
EDIT
This question has been closed on SO and reposted on ServerFault
https://serverfault.com/questions/333168/how-can-i-make-my-ssis-process-consume-more-resources-and-run-faster
I have a daily ETL process in SSIS that builds my warehouse so we can provide day-over-day reports.
I have two servers - one for SSIS and the other for the SQL Server Database. The SSIS server (SSIS-Server01) is an 8CPU, 32GB RAM box. The SQL Server database (DB-Server) is another8CPU, 32GB RAM box. Both are VMWare virtual machines.
In its oversimplified form, the SSIS reads 17 Million rows (about 9GB) from a single table on the DB-Server, unpivots them to 408M rows, does a few lookups and a ton of calculations, and then aggregates it back to about 8M rows that are written to a brand new table on the same DB-Server every time (this table will then be moved into a partition to provide day-over-day reports).
I have a loop that processes 18 months worth of data at a time - a grand total of 10 years of data. I chose 18 months based on my observation of RAM Usage on SSIS-Server - at 18 months it consumes 27GB of RAM. Any higher than that, and SSIS starts buffering to disk and the performance nosedives.
I am using Microsoft's Balanced Data Distributor to send data down 8 parallel paths to maximize resource usage. I do a union before starting work on my aggregations.
Here is the task manager graph from the SSIS server
Here is another graph showing the 8 individual CPUs
As you can see from these images, the memory usage slowly increases to about 27G as more and more rows are read and processed. However the CPU usage is constant around 40%.
The second graph shows that we are only using 4 (sometimes 5) CPUs out of 8.
I am trying to make the process run faster (it is only using 40% of the available CPU).
How do I go about making this process run more efficiently (least time, most resources)?
At the end of the day, all processing is bound by one of four factors
Memory
CPU
Disk
Network
The first step is to identify what the limiting factor is and then determine whether you can influence it (acquire more of or reduce usage of)
Component choices
The reason your server's memory runs out when you do more than 18 months is related to why it takes so long for it to process. The Pivot and Aggregate transformations are asynchronous components. Every row coming in from the source component has N bytes of memory allocated to it. That same bucket of data visits all the transformations, has their operations applied and is emptied at the destination. That memory bucket is reused over and over again.
When an async component enters the arena, the pipeline is split. The bucket that was transporting that row of data must now be emptied into a new bucket to complete the pipeline. That copying of data between execution trees is an expensive operation in terms of execution time and memory (could double it). This also reduces the opportunity for the engine to parallelize some of the execution opportunities as it's waiting on the async operations to complete. A further slow down to operations is encountered from the nature of the transformations. The Aggregate is a fully blocking component so all the data has to arrive and be processed before the transformation will release a single row to the downstream transformations.
If it's possible, can you push the pivot and/or the aggregate onto the server? That should decrease the time spent in the data flow as well as the resources consumed.
You can try increasing the amount of parallel operations the engine can chose. Jamie's article, SQL CAT's article
If you really want to know where your time is being spent in the data flow, log the OnPipelineRowsSent for an execution. Then you can use this query to rip it apart (after substituting sysssislog for the sysdtslog90)
Network transfer
Based on your graphs, it doesn't appear the CPU or Memory is taxed on either box. I believe you have indicated the source and destination server are on a single box but the SSIS package is hosted and processed on another box. You're paying a not-insignificant cost to transfer that data over the wire and back again. Is it possible to process the data on the source server? You'd need to allocate more resources to that box and I'm crossing my fingers that's a big beefy VM and that's not a problem.
If that's not an option, try settings the Packet Size property of the connection manager to 32767 and talk to network ops about whether jumbo frames are right for you. Both of those tips are in the Tune your Network section.
I suck at disk counters but you should be able to see if the wait types are disk related.
Have you previously tried breaking the 18 months processing further into 2 or 3 more batches? unless of course your partitioning scheme will require all 18 months together in that partition --but then it would be come a curious matter to see how and why you're partitioning the data with that scheme. And it would still be okay to break into batches if you have validations in place when you recreate your indexes/constraints..
In my experience, I once had to create a job that would process between 50 and 60 million records and although the source was from data files and the destination was into a table in the server, breaking them into batches proved to be a faster method than going all out at once.
Are you worried about the nosedive performance because this is a highly-transactional database? If so, do you happen to have any data redundancy in place at your disposal?
[edit#01]
Re:Comment#01: Sorry if I'm quite confusing; I meant that on the scheduled day for processing the records, it would be good to have a scheduled job for your ssis package run at certain intervals (so test how long 1 month gets processed and take the average and give it a buffer for time) handling a month or two at a time (if possible) and then just set an additional task at the top to compute/determine which month is to be processed.
Just an example:
< only assuming that two months take less than an hour to finish >
[scheduled run] : 01:00
[ssis task 01] get hour value of current time. if hour = 1 then set monthtoprocessstart = 1 and monthtoprocessend = 2
[ssis task 02 and so on] : work with data whose months are in the range (monthtoprocessstart and end for the year you are processing)
If this is more confusing just let me know so I can remove the edit.. Thanks..
We have flat files (CSV) with >200,000,000 rows, which we import into a star schema with 23 dimension tables. The biggest dimension table has 3 million rows. At the moment we run the importing process on a single computer and it takes around 15 hours. As this is too long time, we want to utilize something like 40 computers to do the importing.
My question
How can we efficiently utilize the 40 computers to do the importing. The main worry is that there will be a lot of time spent replicating the dimension tables across all the nodes as they need to be identical on all nodes. This could mean that if we utilized 1000 servers to do the importing in the future, it might actually be slower than utilize a single one, due to the extensive network communication and coordination between the servers.
Does anyone have suggestion?
EDIT:
The following is a simplification of the CSV files:
"avalue";"anothervalue"
"bvalue";"evenanothervalue"
"avalue";"evenanothervalue"
"avalue";"evenanothervalue"
"bvalue";"evenanothervalue"
"avalue";"anothervalue"
After importing, the tables look like this:
dimension_table1
id name
1 "avalue"
2 "bvalue"
dimension_table2
id name
1 "anothervalue"
2 "evenanothervalue"
Fact table
dimension_table1_ID dimension_table2_ID
1 1
2 2
1 2
1 2
2 2
1 1
You could consider using a 64bit hash function to produce a bigint ID for each string, instead of using sequential IDs.
With 64-bit hash codes, you can store 2^(32 - 7) or over 30 million items in your hash table before there is a 0.0031% chance of a collision.
This would allow you to have identical IDs on all nodes, with no communication whatsoever between servers between the 'dispatch' and the 'merge' phases.
You could even increase the number of bits to further lower the chance of collision; only, you would not be able to make the resultant hash fit in a 64bit integer database field.
See:
http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
http://code.google.com/p/smhasher/wiki/MurmurHash
http://www.partow.net/programming/hashfunctions/index.html
Loading CSV data into a database is slow because it needs to read, split and validate the data.
So what you should try is this:
Setup a local database on each computer. This will get rid of the network latency.
Load a different part of the data on each computer. Try to give each computer the same chunk. If that isn't easy for some reason, give each computer, say, 10'000 rows. When they are done, give them the next chunk.
Dump the data with the DB tools
Load all dumps into a single DB
Make sure that your loader tool can import data into a table which already contains data. If you can't do this, check your DB documentation for "remote table". A lot of databases allow to make a table from another DB server visible locally.
That allows you to run commands like insert into TABLE (....) select .... from REMOTE_SERVER.TABLE
If you need primary keys (and you should), you will also have the problem to assign PKs during the import into the local DBs. I suggest to add the PKs to the CSV file.
[EDIT] After checking with your edits, here is what you should try:
Write a small program which extract the unique values in the first and second column of the CSV file. That could be a simple script like:
cut -d";" -f1 | sort -u | nawk ' { print FNR";"$0 }'
This is a pretty cheap process (a couple of minutes even for huge files). It gives you ID-value files.
Write a program which reads the new ID-value files, caches them in memory and then reads the huge CSV files and replaces the values with the IDs.
If the ID-value files are too big, just do this step for the small files and load the huge ones into all 40 per-machine DBs.
Split the huge file into 40 chunks and load each of them on each machine.
If you had huge ID-value files, you can use the tables created on each machine to replace all the values that remained.
Use backup/restore or remote tables to merge the results.
Or, even better, keep the data on the 40 machines and use algorithms from parallel computing to split the work and merge the results. That's how Google can create search results from billions of web pages in a few milliseconds.
See here for an introduction.
This is a very generic question and does not take the database backend into account. Firing with 40 or 1000 machines on a database backend that can not handle the load will give you nothing. Such a problem is truly to broad to answer it in a specific way..you should get in touch with people inside your organization with enough skills on the DB level first and then come back with a more specific question.
Assuming N computers, X files at about 50GB files each, and a goal of having 1 database containing everything at the end.
Question: It takes 15 hours now. Do you know which part of the process is taking the longest? (Reading data, cleansing data, saving read data in tables, indexing… you are inserting data into unindexed tables and indexing after, right?)
To split this job up amongst the N computers, I’d do something like (and this is a back-of-the-envelope design):
Have a “central” or master database. Use this to mangae the overall process, and to hold the final complete warehouse.
It contains lists of all X files and all N-1 (not counting itself) “worker” databases
Each worker database is somehow linked to the master database (just how depends on RDBMS, which you have not specified)
When up and running, a "ready" worker database polls the master database for a file to process. The master database dolls out files to worker systems, ensuring that no file gets processed by more than one at a time. (Have to track success/failure of loading a given file; watch for timeouts (worker failed), manage retries.)
Worker database has local instance of star schema. When assigned a file, it empties the schema and loads the data from that one file. (For scalability, might be worth loading a few files at a time?) “First stage” data cleansing is done here for the data contained within that file(s).
When loaded, master database is updated with a “ready flagy” for that worker, and it goes into waiting mode.
Master database has it’s own to-do list of worker databases that have finished loading data. It processes each waiting worker set in turn; when a worker set has been processed, the worker is set back to “check if there’s another file to process” mode.
At start of process, the star schema in the master database is cleared. The first set loaded can probably just be copied over verbatim.
For second set and up, have to read and “merge” data – toss out redundant entries, merge data via conformed dimensions, etc. Business rules that apply to all the data, not just one set at a time, must be done now as well. This would be “second stage” data cleansing.
Again, repeat the above step for each worker database, until all files have been uploaded.
Advantages:
Reading/converting data from files into databases and doing “first stage” cleansing gets scaled out across N computers.
Ideally, little work (“second stage”, merging datasets) is left for the master database
Limitations:
Lots of data is first read into worker database, and then read again (albeit in DBMS-native format) across the network
Master database is a possible chokepoint. Everything has to go through here.
Shortcuts:
It seems likely that when a workstation “checks in” for a new file, it can refresh a local store of data already loaded in the master and add data cleansing considerations based on this to its “first stage” work (i.e. it knows code 5484J has already been loaded, so it can filter it out and not pass it back to the master database).
SQL Server table partitioning or similar physical implementation tricks of other RDBMSs could probably be used to good effect.
Other shortcuts are likely, but it totally depends upon the business rules being implemented.
Unfortunately, without further information or understanding of the system and data involved, one can’t tell if this process would end up being faster or slower than the “do it all one one box” solution. At the end of the day it depends a lot on your data: does it submit to “divide and conquer” techniques, or must it all be run through a single processing instance?
The simplest thing is to make one computer responsible for handing out new dimension item id's. You can have one for each dimension. If the dimension handling computers are on the same network, you can have them broadcast the id's. That should be fast enough.
What database did you plan on using with a 23-dimensional starscheme? Importing might not be the only performance bottleneck. You might want to do this in a distributed main-memory system. That avoids a lot of the materalization issues.
You should investigate if there are highly correlating dimensions.
In general, with a 23 dimensional star scheme with large dimensions a standard relational database (SQL Server, PostgreSQL, MySQL) is going to perform extremely bad with datawarehouse questions. In order to avoid having to do a full table scan, relational databases use materialized views. With 23 dimensions you cannot afford enough of them. A distributed main-memory database might be able to do full table scans fast enough (in 2004 I did about 8 million rows/sec/thread on a Pentium 4 3 GHz in Delphi). Vertica might be an other option.
Another question: how large is the file when you zip it? That provides a good first order estimate of the amount of normalization you can do.
[edit] I've taken a look at your other questions. This does not look like a good match for PostgreSQL (or MySQL or SQL server). How long are you willing to wait for query results?
Rohita,
I'd suggest you eliminate a lot of the work from the load by sumarising the data FIRST, outside of the database. I work in a Solaris unix environment. I'd be leaning towards a korn-shell script, which cuts the file up into more managable chunks, then farms those chunks out equally to my two OTHER servers. I'd process the chunks using a nawk script (nawk has an efficient hashtable, which they call "associative arrays") to calculate the distinct values (the dimensions tables) and the Fact table. Just associate each new-name-seen with an incrementor-for-this-dimension, then write the Fact.
If you do this through named pipes you can push, process-remotely, and readback-back the data 'on the fly' while the "host" computer sits there loading it straight into tables.
Remember, No matter WHAT you do with 200,000,000 rows of data (How many Gig is it?), it's going to take some time. Sounds like you're in for some fun. It's interesting to read how other people propose to tackle this problem... The old adage "there's more than one way to do it!" has never been so true. Good luck!
Cheers. Keith.
On another note you could utilize Windows Hyper-V Cloud Computing addon for Windows Server:http://www.microsoft.com/virtualization/en/us/private-cloud.aspx
It seems that your implementation is very inefficient as it's loading at the speed of less than 1 MB/sec (50GB/15hrs).
Proper implementation on a modern single server (2x Xeon 5690 CPUs + RAM that's enough for ALL dimensions loaded in hash tables + 8GB ) should give you at least 10 times better speed i.e at least 10MB/sec.
It seems to me this question will be without precise answer since requires too complex analysis and deep dive into details of our system.
We have distributed net of sensors. Information gathered in one database and futher processed.
Current DB design is to have one huge table partitioned per month. We try keep it at 1 billion (usually 600-800 million records), so fill rate is at 20-50 million records per day.
DB server currently is MS SQL 2008 R2 but we started from 2005 and upgrade during project development.
The table itself contains SensorId, MessageTypeId, ReceiveDate and Data field. Current solution is to preserve sensor data in Data field (binary, 16 byte fixed length) with partially decoding it's type and store it in messageTypeId.
We have different kind of message type sending by sensors (current is approx 200) and it can be futher increased on demand.
Main processing is done on application server which fetch records on demand (by type, sensorId and date range), decode it and carry out required processing. Current speed is enough for such amount of data.
We have request to increase capacity of our system in 10-20 times and we worry is our current solution is capable of that.
We have also 2 ideas to "optimise" structure which I want to discuss.
1 Sensor's data can be splitted into types, I'll use 2 primary one for simplicity: (value) level data (analog data with range of values), state data (fixed amount of values)
So we can redesign our table to bunch of small ones by using following rules:
for each fixed type value (state type) create it's own table with SensorId and ReceiveDate (so we avoid store type and binary blob), all depended (extended) states will be stored in own table similar Foreign Key, so if we have State with values A and B, and depended (or additional) states for it 1 and 2 we ends with tables StateA_1, StateA_2, StateB_1, StateB_2. So table name consist of fixed states it represents.
for each analog data we create seperate table it will be similar first type but cantains additional field with sensor value;
Pros:
Store only required amount of data (currently our binary blob Data contains space to longest value) and reduced DB size;
To get data of particular type we get access right table instead of filter by type;
Cons:
AFAIK, it violates recommended practices;
Requires framework development to automate table management since it will be DBA's hell to maintain it manually;
The amount of tables can be considerably large since requires full coverage of possible values;
DB schema changes on introduction new sensor data or even new state value for already defined states thus can require complex change;
Complex management leads to error prone;
It maybe DB engine hell to insert values in such table orgranisation?
DB structure is not fixed (constantly changed);
Probably all cons outweight a few pros but if we get significant performance gains and / or (less preferred but valuable too) storage space maybe we follow that way.
2 Maybe just split table per sensor (it will be about 100 000 tables) or better by sensor range and/or move to different databases with dedicated servers but we want avoid hardware span if it possible.
3 Leave as it is.
4 Switch to different kind of DBMS, e.g. column oriented DBMS (HBase and similar).
What do you think? Maybe you can suggest resource for futher reading?
Update:
The nature of system that some data from sensors can arrive even with month delay (usually 1-2 week delay), some always online, some kind of sensor has memory on-board and go online eventually. Each sensor message has associated event raised date and server received date, so we can distinguish recent data from gathered some time ago. The processing include some statistical calculation, param deviation detection, etc. We built aggregated reports for quick view, but when we get data from sensor updates old data (already processed) we have to rebuild some reports from scratch, since they depends on all available data and aggregated values can't be used. So we have usually keep 3 month data for quick access and other archived. We try hard to reduce needed to store data but decided that we need it all to keep results accurate.
Update2:
Here table with primary data. As I mention in comments we remove all dependencies and constrains from it during "need for speed", so it used for storage only.
CREATE TABLE [Messages](
[id] [bigint] IDENTITY(1,1) NOT NULL,
[sourceId] [int] NOT NULL,
[messageDate] [datetime] NOT NULL,
[serverDate] [datetime] NOT NULL,
[messageTypeId] [smallint] NOT NULL,
[data] [binary](16) NOT NULL
)
Sample data from one of servers:
id sourceId messageDate serverDate messageTypeId data
1591363304 54 2010-11-20 04:45:36.813 2010-11-20 04:45:39.813 257 0x00000000000000D2ED6F42DDA2F24100
1588602646 195 2010-11-19 10:07:21.247 2010-11-19 10:08:05.993 258 0x02C4ADFB080000CFD6AC00FBFBFBFB4D
1588607651 195 2010-11-19 10:09:43.150 2010-11-19 10:09:43.150 258 0x02E4AD1B280000CCD2A9001B1B1B1B77
Just going to throw some ideas out there, hope they are useful - they're some of the things I'd be considering/thinking about/researching into.
Partitioning - you mention the table is partitioned by month. Is that manually partitioned yourself, or are you making use of the partitioning functionality available in Enterprise Edition? If manual, consider using the built in partitioning functionality to partition your data out more which should give you increased scalability / performance. This "Partitioned Tables and Indexes" article on MSDN by Kimberly Tripp is great - lot of great info in there, I won't do it a injustice by paraphrasing! Worth considering this over manually creating 1 table per sensor which could be more difficult to maintain/implement and therefore added complexity (simple = good). Of course, only if you have Enterprise Edition.
Filtered Indexes - check out this MSDN article
There is of course the hardware element - goes without saying that a meaty server with oodles of RAM/fast disks etc will play a part.
One technique, not so much related to databases, is to switch to recording a change in values -- with having minimum of n records per minute or so. So, for example if as sensor no 1 is sending something like:
Id Date Value
-----------------------------
1 2010-10-12 11:15:00 100
1 2010-10-12 11:15:02 100
1 2010-10-12 11:15:03 100
1 2010-10-12 11:15:04 105
then only first and last record would end in the DB. To make sure that the sensor is "live" minimum of 3 records would be entered per minute. This way the volume of data would be reduced.
Not sure if this helps, or if it would be feasible in your application -- just an idea.
EDIT
Is it possible to archive data based on the probability of access? Would it be correct to say that old data is less likely to be accessed than new data? If so, you may want to take a look at look at Bill Inmon's DW 2.0 Architecture for The Next Generation of Data Warehousing where he discusses model for moving data through different DW zones (Interactive, Integrated, Near-Line, Archival) based on the probability of access. Access times vary from very fast (Interactive zone) to very slow (Archival). Each zone has different hardware requirements. The objective is to prevent large amounts of data clogging the DW.
Storage-wise you are probably going to be fine. SQL Server will handle it.
What worries me is the load your server is going to take. If you are receiving transactions constantly, you would have some ~400 transactions per second today. Increase this by a factor of 20 and you are looking at ~8,000 transactions per second. That's not a small number considering you are doing reporting on the same data...
Btw, do I understand you correctly in that you are discarding the sensor data when you have processed it? So your total data set will be a "rolling" 1 billion rows? Or do you just append the data?
You could store the datetime stamps as integers. I believe datetime stamps use 8 bytes and integers only use 4 within SQL. You'd have to leave off the year, but since you are partitioning by month it might not be a problem.
So '12/25/2010 23:22:59' would get stored as 1225232259 -MMDDHHMMSS
Just a thought...
I need to create a database that saves sensor data that will be queried to generate reports later on (Display a graph and AVG/MAX/MIN values for a given timeframe).
The data points look like this:
CREATE TABLE [dbo].[Table_1](
[time] [datetime] NOT NULL,
[sensor] [int] NOT NULL,
[value] [decimal](18, 0) NULL
)
Data can be added in intervals ranging from seconds to minutes (depending on the sensor).
Should I worry about my Database growing too big when several years of data accumulate (The DB will run on a MS SQL Server 2008 workgroup edition)?
There are specialized historian databases, such as OSISoft's PI Historian that handle this type of data a lot better than a relational database. With PI you can configure a compression deviation for each data point, such that the data will not be archived unless it changes by at least that compression deviation. When you query for the historical data for a given point, you can ask PI to do interpolation of what the value would have been at the specified time even though your time period is between the archived values.
It's capable of a whole lot more, but you will have to explore that on your own because I don't intend on becoming an OSISoft salesman. However, this is definitely the way you want to go for storing large quantities of sensor data.
It all depends what resources and effort you want to expend on it. At 1 row per second that table would still be less than 0.5GB per sensor per year, which is very small. If you have thousands of sensors then you might want to consider whether to create summary tables to help with the reporting and analysis of the data.
Sensor data like this is often very repetetive. There are more convenient ways to store repeated values - for example by storing one row with a range of times rather than multiple rows with different times.
There are many software packages that can help with storing and managing this kind of time series data. There is also a significant body of research and literature on the subject, which might help you. If you aren't already familiar with it then Google for terms like "Process Historian", "Complex Event Processing" and "SCADA".
It depends on how you're going to use the data, what indexes you add in addition, how many sensors, etc.
That table, as shown, could store 150 million rows (~ 1 sensor x 1 recording per second x 5 years) in ~6GB of space (assuming a heap). The file size limit is 16 terabytes, and I'm not aware of any restrictions on this for Workgroup edition.
If you are worried about the database to grow too big then I would suggest you can have a Archive_Table with the same structure and archive data for an interval like once a month or 6 months(entirely based on the volume of data).
So, this would allow you to have a check on the number of records in your Table. And, of course the archive tables would be available for report generation when you need it.