How to archive records in multiple tables within one access database - database

I have a database that contains data from last 4-5 years. The database has grown quite large and the application that uses this database has been working really slowly. I am looking to archive some data from all the tables. This database has 16-17 tables that have relationships amongst them and I am looking for a way to perform the archive operation so that I can archive/remove data for couple of years. I tried reading about APPEND and DELETE queries but I am not sure how to apply them on multiple tables.
Another problem is this application was created by somebody else I don't have enough knowledge about the database and the way tables are structured. Any help/suggestions are much appreciated.

The first thing you need to do is gain an understanding of your dataset. if you truly "don't have enough knowledge about the database and the way tables are structured", you're setting yourself up for one huge failure.
The most important piece is to determine how the tables are inter-related. If the original designer was competent, he should have set up relationships within the database and enforced referrential integrity. You will need to look at those relationships (go to the Database Tools tab and choose Relationships), or determine what relationships should exist between your data.
Once you've determined how your data is related, you will need to set up new Archive tables to mirror all of the tables you wish to archive. The easiest way to do this is to right-click on the table, then right-click elsewhere in the pane and choose "Paste". You will get a box that looks like this:
Choose "Structure Only", since you just want to set the table up. I would give them the same name as the original tables, with "_Archive" tacked onto the end. This way, it will be easy to determine which tables you're working with.
Next up, determine which are your "parent" tables and which are your "children" tables. You do this by determining which fields contain relationships to each other, and how they're related. Any tables with a One-to-Many relationship can be considered "Parents" on the "One" side, and "Children" on the "Many" side.
After this, you will need to determine how you wish to archive. Usually, there is some date field within your table that you can use as a guide. Say, for instance, you have a field called "Order Date". You can choose to archive anything with an Order Date of anything before 1/1/2010. So, to do so, you will need to write an Append query. You will append everything to your new archive table where Order Date <= 12/31/2009. You do this first for the Children tables, and then to the Parent tables.
After this, you will write a Delete query. Essentially the same process as above, but you're deleting from your original tables since the data has already been written to your archive tables. You MUST delete from the children tables first. Then do the same for the parent tables.
You can now move all the archive tables into a new database, and zip it up to minimize space. Once that's complete, you can Repair & Compact your database and the size should be much smaller.
Always remember to make a copy first! If you make any mistakes, you can't undo them. Creating a copy allows you to go back and retry without losing any data.

Related

UPDATE millions of rows, or DELETE/INSERT?

Sorry for the longish description... but here we go...
We have a fact table somewhat flattened with a few properties that you might have put in a dimension in a more "classic" data warehouse.
I expect to have billions of rows in that table.
We want to enrich these properties with some cleansing/grouping that would not change often, but would still do from time to time.
We are thinking of keeping this initial fact table as the "master" that we never update or delete from, and making an "extended fact" table copy of it where we just add the new derived properties.
The process of generating these extended property values requires mapping to some fort of lookup table, from which we get several possibilities for each row, and then select the best one (one per initial row).
This is likely to be processor intensive.
QUESTION (at last!):
Imagine my lookup table is modified and I want to re-assess the extended properties for only a subset of my initial fact table.
I would end up with a few million rows I want to MODIFY in the target extended fact table.
What would be the best way to achieve this update? (updating a couple of million rows within a couple of billion rows table)
Should I write an UPDATE statement with a join?
Would it be better to DELETE this million rows and INSERT the new ones?
Any other way, like creating a new extended fact table with only the appropriate INSERTs?
Thanks
Eric
PS: I come from a SQL Server background where DELETE can be slow
PPS: I still love SQL Server too! :-)
Write performance for Snowflake vs. traditional RDBS behaves quite differently. All your tables persist in S3, and S3 does not let you rewrite only select bytes of an existing object; the entire file object must be uploaded and replaced. So, while in say SQL server where data and indexes are modified in place, creating new pages as necessary, an UPDATE/DELETE in snowflake is a full sequential scan on the table file, creating an immutable copy of the original with applicable rows filtered out (deleted) or modified (update), which then replaces the file just scanned.
So, whether updating 1 row, or 1M rows, at minimum the entirety of the micro-partitions that the modified data exists in will have to be rewritten.
I would take a look at the MERGE command, which allows you to insert, update, and delete all in one command (effectively applying the differential from table A into table B. Among other things, it should keep your Time-Travel costs down vs constantly wiping and rewriting tables. Another consideration is that since snowflake is column oriented, a column update in theory should only require operations on the S3 files for that column, whereas an insert/delete would replace all S3 files for all columns, which would lower performance.

What affects does changing the column-order of a table in a database have on the server-memory?

I just started my new job and after looking at the DBM I was shocked. Its a huge mess.
Now first thing I wanted to do is get some consistency in the order of table columns. We publish new database versions over a .dacpac. My co-worker told me that changing the order of a column would force MSSQL to create a temporary table which stores all the data. MSSQL then creates a new table and inserts all the data into that table.
So lets say my server only runs 2GB of RAM and has 500MB storage left on the harddrive. The whole database weights 20GB. Is it possible that changing the order of columns will cause trouble (memory related)? Is the statement of my co-worker correct?
I couldnt find any good source for my question.
Thanks!
You should NOT "go one table by one".
You should leave your tables as they are, if you don't like the order of columns of some table just create a view reordering your columns as you want.
Not only changing order of columns will cause your tables to be recreated, all the indexes will be recreated, you'll get problems with FK constraints.
And after all, you'll gain absolutely nothig but do damage only. You'll waste server resources, make your tables temporarily inaccessible and the columns will not be stored as you defind anyway, internally they will be stored in "var-fix" format (divided into fixed-length and variable-length)

Safe/reliable/standard process for making major changes to a database with existing data?

I would like to take one table that is heavy with flags and fields, and break it into smaller tables. The parent table to be revised/broken down already contains live data that must be handled with care.
Here is my plan of attack, that I'm hoping to execute this weekend while no one is using the systsem.
Create the new tables that we will need
Rename the existing parent table, ParentTable, to ParentTableOLD
Create a new table called ParentTable with the unneeded fields gone, and new fields added
Run a procedure to copy the entries in ParentTableOLD to the new tables, mapping old data to new tables/fields where applicable
Delete the ParentTableOLD table
The above seems pretty reasonable and simple to me, I'm fairly certain it will work. I'm interested in other techniques to achieve this (the above is the only thing I can think of), as well as any kind of tools to help stay organized. Right now I'm running on pen and paper.
Reason I ask is that several times now, I've been re-inventing the wheel just because I didn't know any better, and someone more experienced came along and saw what I was doing and said, "oh there's a built-in way to help do this," or, "there's a simpler way to do this." I did coding for months and months with Visual Studio before someone stopped by and said "you know about breakpoints to step through the code, yeah?" --- life changing, hah.
I have SQL Server 2008 R2 with SSMS.
A good trick to assist you in creating your '_old' tables is:
SELECT *
INTO mytable_old
FROM mytable
SELECT INTO will copy all of the data and create your table for you in one step.
This said - I would actually retain the current table names and instead copy everything into another schema. This will make adapting queries and reports to run over the old schema (where needed) a lot easier then having to add '_old' to all the names (since instead you can just find/replace the schema names).
If at all possible, I'd be doing this is some sort of test environment first and foremost. If you have external applications that rely on the database, then make sure they all run against your modified structure without any hiccups.
Also do a search on your database objects that might reference the table you are going to rename. For example;
SELECT Name
FROM sys.procedures
WHERE OBJECT_DEFINITION(OBJECT_ID) LIKE '%MyTable%'
Try and ensure some sort of functional equivalence from queries between your new and old schemas. Have some query/queries that can be run against your renamed table and then have your reworked schema referencing your new table structure. This way you can make sure the data returned is the same for both structures. If possible, prepare these all ahead of time so that it is simply a series of checks you can do once you've done your modifications and if there are differences, this can help you decide whether you proceed with the change or back it out.
Lastly, have a plan for working out how you could revert to the old schema if something catastrophic were to occur. If you'd been working with your new table structure for a period of time, and then discovered a major issue, could you revert back to the old table and successfully get the data out of your modified table structure back to the old table? Basically, just be a follow the boy scout rules and be prepared.
This isn't really an answer for your overall problem, but a couple tools that you might find useful for your Step 4 is RedGate's SQL Compare and Data Compare. SQL Compare will perform schema migrations, and Data Compare will help migrate data. You can move data to new columns and new tables, populate default values, sync from dev to production, among other things.
You can make your changes in a dev environment with production data, and when you're satisfied with the process, do the actual migration in production.
Make a backup of database (for reference : http://msdn.microsoft.com/en-us/library/ms187510.aspx) and then you can do the required steps. If everything goes fine, then go ahead else restore the old database ( for reference : http://msdn.microsoft.com/en-us/library/ms177429.aspx)
You can even automate this process of making a backup for say, every week.

What is a good approach to preloading data?

Are there best practices out there for loading data into a database, to be used with a new installation of an application? For example, for application foo to run, it needs some basic data before it can even be started. I've used a couple options in the past:
TSQL for every row that needs to be preloaded:
IF NOT EXISTS (SELECT * FROM Master.Site WHERE Name = #SiteName)
INSERT INTO [Master].[Site] ([EnterpriseID], [Name], [LastModifiedTime], [LastModifiedUser])
VALUES (#EnterpriseId, #SiteName, GETDATE(), #LastModifiedUser)
Another option is a spreadsheet. Each tab represents a table, and data is entered into the spreadsheet as we realize we need it. Then, a program can read this spreadsheet and populate the DB.
There are complicating factors, including the relationships between tables. So, it's not as simple as loading tables by themselves. For example, if we create Security.Member rows, then we want to add those members to Security.Role, we need a way of maintaining that relationship.
Another factor is that not all databases will be missing this data. Some locations will already have most of the data, and others (that may be new locations around the world), will start from scratch.
Any ideas are appreciated.
If it's not a lot of data, the bare initialization of configuration data - we typically script it with any database creation/modification.
With scripts you have a lot of control, so you can insert only missing rows, remove rows which are known to be obsolete, not override certain columns which have been customized, etc.
If it's a lot of data, then you probably want to have an external file(s) - I would avoid a spreadsheet, and use a plain text file(s) instead (BULK INSERT). You could load this into a staging area and still use techniques like you might use in a script to ensure you don't clobber any special customization in the destination. And because it's under script control, you've got control of the order of operations to ensure referential integrity.
I'd recommend a combination of the 2 approaches indicated by Cade's answer.
Step 1. Load all the needed data into temp tables (on Sybase, for example, load data for table "db1..table1" into "temp..db1_table1"). In order to be able to handle large datasets, use bulk copy mechanism (whichever one your DB server supports) without writing to transaction log.
Step 2. Run a script which as a main step will iterate over each table to be loaded, if needed create indexes on newly created temp table, compare the data in temp table to main table, and insert/update/delete differences. Then as needed the script can do auxillary tasks like the security role setup you mentioned.

Can SQL Server Replication include the source dbid in the replicated data?

Let's say I have DatabaseA with TableA, which has these fields: Id, Name.
In another database, DatabaseB, I have TableA which has these fields: DatabaseId, Id, Name.
Is it possible to setup a replication publication that will send:
DatabaseA.dbid, DatabaseA.TableA.Id, DatabaseA.TableA.Name
to DatabaseB.TableA?
Edit:
The reason I'm asking is that I need to combine multiple databases (with identical schemas) into a single database, with as little latency as possible. Replication seemed like a good place to start (need to replicate data from one place to another), but I'm just in the brainstorming phase. I would definitely be open to suggestions on how to accomplish this without using replication.
There might be an easier way to do it, but the first thing I thought of is wrapping TableA in an indexed view on the source database and then replicating the view as a table (i.e., type = "indexed view logbased"). I don't think this would work with merge replication, though.
So, that would roughly be like:
CREATE VIEW TableA_with_dbid WITH SCHEMABINDING AS
SELECT DatabaseA.dbid, Id, Name FROM TableA
CREATE UNIQUE CLUSTERED INDEX ON TableA_with_dbid (Id) -- or whatever your PK is
EXEC sp_addarticle ...,
#source_object = 'TableA_with_dbid',
#destination_table = 'TableA',
#type = 'indexed view logbased',
...
Big caveat: indexed views have a lot of requirements that may not be appropriate for your application. For example, certain options have to be set any time you update the base table.
(In response to the edit in your question...) This won't work for combining multiple sources into one table. AFAIK, an object in a subscribing database can only come from one published article. And you can't do an indexed view on the subscribing side since UNION is not allowed in an indexed view. (The docs don't explicitly state UNION ALL is disallowed, but it wouldn't surprise me. You might try it just in case.) But it still does answer your explicit question: the dbid would be in the replicated table.
Are you aggregating these events in one place from multiple sources? Replicating only comes from one source - it's one-to-one, so the source ID doesn't seem like it would make much sense.
If you're aggregating data from multiple sources, maybe linked servers and triggers is a better choice, and if that's the case, then you could absolutely include any information about the source that you want.
If you can clarify your question to describe the purpose, it would help us find the best solution.
UPDATED FROM NEW DETAIL IN QUESTION:
Does this solution sound like it might be what you need?
Set up AFTER triggers on the source databases that send any changed rows to the central repository database, in some kind of holding table. These rows can include additional columns, like "Source", "Change type" (for insert, delete, etc).
Some central process watches the table and processes new rows (or runs periodically - once/minute, maybe), incorporating them into the central database
You could adjust how frequently the check/merge process runs on the server based on your needs (even running it constantly to handle new rows as they appear, perhaps even with an AFTER trigger on that table as well).

Resources