One of our sites has around 10,000 nodes. In each node, there is a simple cck text/integer field. This integer changes daily, so they need to be updated every day. The integer ranges from 1 to 20000000. The cck field is across all content types, so it has its own table in the database. We don't use revisions. I chose to have it read a csv file because this table is a very simple with 3 fields. All integers. I didn't need all the flexibility of doing a php array type import.
I created a cron job to execute a php script everyday which holds something similar to:
LOAD DATA LOCAL INFILE 'file.csv'
REPLACE INTO TABLE content_field_mycckfield
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(vid, nid, field_mycckfield_value);
At the end of the script, it counts how many records imported, reports success and errors.
The file is below public, and all the jazz.
Are there any other steps I am missing? Anything I should be aware of or be cautious of?
Should I have it optimize or defragment this table after every run? Or every (x) of runs?
Should I have it first imported into a temp_ table to normalize the data, then have it copied/moved into TABLE content_field_mycckfield?
10,000 records is big but not massive in MySQL terms and the table is simple enough that I don't think you need any optimisation. If the data in the table is reliable and your .csv is always well formed then there's not a lot to go wrong.
The separate issue is whether your import process is throwing errors. If there is even the remotest chance that the .csv could contain incorrect column references, lost commas etc then your idea to test everything in a temp table is certainly a good one.
The only other things I can thing of are (in order of neuroticism)
Perform this operation overnight or whenever your site is unused
Have the PHP script catch errors and email you the results of each run
Have the script backup the table, run the .csv, check for errors and if errors then email you and simultaneously restore the backup
Hope any of that helps!
Related
I have data to load where I only need to pull records since the last time I pulled this data. There are no date fields to save this information in my destination table so I have to keep track of the maximum date that I last pulled. The problem is I can't see how to save this value in SSIS for the next time the project runs.
I saw this:
Persist a variable value in SSIS package
but it doesn't work for me because there is another process that purges and reloads the data separate from my process. This means that I have to do more than just know the last time my process ran.
The only solution I can think of is to create a table but it seems a bit much to create a table to hold one field.
This is a very common thing to do. You create an execution table that stores the package name, the start time, the end time, and whether or not the package failed/succeeded. You are then able to pull the max start time of the last successfully ran execution.
You can't persist anything in a package between executions.
What you're talking about is a form of differential replication and this has been done many many times.
For differential replication it is normal to store some kind of state in the subscriber (the system reading the data) or the publisher (the system providing the data) that remembers what state you're up to.
So I suggest you:
Read up on differential replication design patterns
Absolutely put your mind at rest about writing data to a table
If you end up having more than one source system or more than one source table your storage table is not going to have just one record. Have a think about that. I answered a question like this the other day - you'll find over time that you're going to add handy things like the last time the replication ran, how long it took, how many records were transferred etc.
Is it viable to have a SQL table with only one row and one column?
TTeeple and Nick.McDermaid are absolutely correct, and you should follow their advice if humanly possible.
But if for some reason you don't have access to write to an execution table, you can always use a script task to read/write the last loaded date to a text file on on whatever local file-system you're running SSIS on.
Here is my situation, my client wants to bulk insert 100,000+ rows into the database from a csv file which is simple enough but the values need to be checked against data that is already in the database (does this product type exist? is this product still sold? etc.). To make things worse these files will also be uploaded into the live system during the day so I need to make sure I’m not locking any tables for long. The data that is inserted will also be spread across multiple tables.
I’ve been adding the date into a staging table which takes seconds, I then tried creating a WebService to start processing the table using Linq and marking any erroneous rows with an invalid flag (this can take some time). Once the validation is done I need take the valid rows and update/add the rows to the appropriate tables.
Is there a process for this that I am unfamiliar with?
For a smaller dataset I would suggest
IF EXISTS (SELECT blah FROM blah WHERE....)
UPDATE (blah)
ELSE
INSERT (blah)
You could do this in chunks to avoid server load, but this is by no means a quick solution, so SSIS would be preferable
I need to move the data that is a month old from a logging table to a logging-archive table, and remove data older than a year from the later.
There are lots of data (600k insert in 2 months).
I was considering to simply call (batch) a stored proc every day/week.
I first thought about doing 2 stored proc :
Deleting from the archives what is older than 365 days
Moving the data from logging to archive, what is older than 30 days (I suppose there's a way to do that with 1 sql query)
Removing from logging what is older than 30 days.
However, this solution seems quite ineficient and will probably lock the DB for a few minutes, which I do not want.
So, do I have any alternative and what are they?
None of this should lock the tables that you actually use. You are writing only to the logging table currently, and only to new records.
You are selecting from the logging table only OLD records, and writing to a table that you don't write to except for the archive process.
The steps you are taking sound fine. I would go one step further, and instead of deleting based on date, just do an INNER JOIN to your archive table on your id field - then you only delete the specific records you have archived.
As a side note, 600k records is not very big at all. We have production DBs with tables over 2billion rows, and I know some other folks here have dbs with millions of inserts a minute into transactional tables.
Edit:
I forgot to include originally, another benefit of your planned method is that each step is isolated. If you need to stop for any reason, none of your steps is destructive or depends on the next step executing immediately. You could potentially archive a lot of records, then run the deletes the next day or overnight without creating any issues.
What if you archived to a secondary database.
I.e.:
Primary database has the logging table.
Secondary database has the archive table.
That way, if you're worried about locking your archive table so you can do a batch on it, it won't take your primary database down.
But in any case, i'm not sure you have to worry about locking -- I guess just depends on how you implement.
Are there best practices out there for loading data into a database, to be used with a new installation of an application? For example, for application foo to run, it needs some basic data before it can even be started. I've used a couple options in the past:
TSQL for every row that needs to be preloaded:
IF NOT EXISTS (SELECT * FROM Master.Site WHERE Name = #SiteName)
INSERT INTO [Master].[Site] ([EnterpriseID], [Name], [LastModifiedTime], [LastModifiedUser])
VALUES (#EnterpriseId, #SiteName, GETDATE(), #LastModifiedUser)
Another option is a spreadsheet. Each tab represents a table, and data is entered into the spreadsheet as we realize we need it. Then, a program can read this spreadsheet and populate the DB.
There are complicating factors, including the relationships between tables. So, it's not as simple as loading tables by themselves. For example, if we create Security.Member rows, then we want to add those members to Security.Role, we need a way of maintaining that relationship.
Another factor is that not all databases will be missing this data. Some locations will already have most of the data, and others (that may be new locations around the world), will start from scratch.
Any ideas are appreciated.
If it's not a lot of data, the bare initialization of configuration data - we typically script it with any database creation/modification.
With scripts you have a lot of control, so you can insert only missing rows, remove rows which are known to be obsolete, not override certain columns which have been customized, etc.
If it's a lot of data, then you probably want to have an external file(s) - I would avoid a spreadsheet, and use a plain text file(s) instead (BULK INSERT). You could load this into a staging area and still use techniques like you might use in a script to ensure you don't clobber any special customization in the destination. And because it's under script control, you've got control of the order of operations to ensure referential integrity.
I'd recommend a combination of the 2 approaches indicated by Cade's answer.
Step 1. Load all the needed data into temp tables (on Sybase, for example, load data for table "db1..table1" into "temp..db1_table1"). In order to be able to handle large datasets, use bulk copy mechanism (whichever one your DB server supports) without writing to transaction log.
Step 2. Run a script which as a main step will iterate over each table to be loaded, if needed create indexes on newly created temp table, compare the data in temp table to main table, and insert/update/delete differences. Then as needed the script can do auxillary tasks like the security role setup you mentioned.
I have a big fat query that's written dynamically to integrate some data. Basically what it does is query some tables, join some other ones, treat some data, and then insert it into a final table.
The problem is that there's too much data, and we can't really trust the sources, because there could be some errored or inconsistent data.
For example, I've spent almost an hour looking for an error while developing using a customer's database because somewhere in the middle of my big fat query there was an error converting some varchar to datetime. It turned out to be that they had some sales dating '2009-02-29', an out-of-range date.
And yes, I know. Why was that stored as varchar? Well, the source database has 3 columns for dates, 'Month', 'Day' and 'Year'. I have no idea why it's like that, but still, it is.
But how the hell would I treat that, if the source is not trustable?
I can't HANDLE exceptions, I really need that it comes up to another level with the original message, but I wanted to provide some more info, so that the user could at least try to solve it before calling us.
So I thought about displaying to the user the row number, or some ID that would at least give him some idea of what record he'd have to correct. That's also a hard job because there will be times when the integration will run up to 80000 records.
And in an 80000 records integration, a single dummy error message: 'The conversion of a varchar data type to a datetime data type resulted in an out-of-range datetime value' means nothing at all.
So any idea would be appreciated.
Oh I'm using SQL Server 2005 with Service Pack 3.
EDIT:
Ok, so for what I've read as answers, best thing to do is check each column that could be critical to raising errors, and if they do attend the condition, I should myself raise an error, with the message I find more descriptive, and add some info that could have been stored in a separate table or some variables, for example the ID of the row, or some other root information.
for dates you can use the isdate function
select ISDATE('20090229'),ISDATE('20090227')
I usually insert into a staging table, do my checks and then insert into the real tables
My suggestion would be to pre-validate the incoming data, and as you encounter errors, set aside the record. For example, check for invalid dates. Say you find 20 in a set of 80K. Pull those 20 out into a separate table, with the error message attached to the record. Run your other validation, then finally import the remaining (all valid) records into the desired target table(s).
This might have too much impact on performance, but would allow you to easily point out the errors and allow them to be corrected and then inserted in a second pass.
This sounds like a standard ETL issue: Extract, Transform, and Load. (Unless you have to run this query over and over again against the same set of data, in which case you'd pretty much do the same thing, only over and over again. So how critical is performance?)
What kind of error handling and/or "reporting of bad data" are you allowed to provide? If you have everything as "one big fat query", your options become very limited -- either the query works or it doesn't, and if it doesn't I'm guessing you get at best one RAISERROR message to tell the caller what's what.
In a situation like this, the general framework I'd try to set up is:
Starting with the source table(s)
Produce an interim set of tables (SQLMenace's staging tables) that you know are consistant and properly formed (valid data, keys, etc.)
Write the "not quite so big and fat query" against those tables
Done this way, you should always be able to return (or store) a valid data set... even if it is empty. The trick will be in determining when the routine fails -- when is the data too corrupt to process and produce the desired results, so you return a properly worded error message instead?
try something like this to find the rows:
...big fat query here...
WHERE ISDATE(YourBadVarcharColumn)!=1
Load the Data into a staging table, where most columns are varchar and allow NULLs, where you have a status column.
Run an UPDATE command like
UPDATE Staging
SET Status='X'
WHERE ISDATE(CONVERT(YourCharYear+YourCharMonth+YourCharDat+)!=1
OR OtherColumn<4...
Then just insert from your staging table where Status!='X'
INSERT INTO RealTable
(col1, col2...)
SELECT
col1, col2, ...
where Status!='X'