My current environment is: 3 servers, one for source database, one for destination database and one for IS packages. Now I need to adjust the configurations, like CPU, Memory for each server .
I do believe that running IS packages will consume a lot of resources because of large data volume. However, I do not know which server needs to be configured with more power for IS packages. That is, which server's resource will be mostly used when IS is running?
Also, I need to setup SQL agent for daily ETL processing, then which DB server I should use, the source or destination one?
I'm new to IS deployment,thanks for any advice!
The data will be read from the source server and written to the destination server, so here you need nice fast IO subsystem. Ideally RAID 10. Also, providing your data is split across multiple discs on the source server, more cores will achieve more parallelism. This is not so important on the destination as inserts are normally single threaded.
The server running SSIS needs lots of memory as the data flow buffers will be on this server (providing you run Server Agent here) and you need a fast network connection between all three.
Server Agent should be on the ETL server, otherwise SSIS will consume resources on the box that Server Agent is on, and could therefore fight for threads with SQL Server whilst reading or writing.
Related
I am working on a project which processes big data (size ~3TB) every day. The first stage of our data pipeline copies the data from our MS-SQL Server to a host machine(linux server) using a tool called freebcp. More info on freebcp is here
Using this tool from the linux server, we run a set of stored procedures on SQL Server and export the data and transfer it in bulk. Recently I am observing that if the data is huge (~200GB), the data transfer is getting stalled after sometime. I ran a couple of commands (sp_who2, and dbcc inputbuffer(spid)) to monitor the execution of stored procedure on SQL server. We observe the CPU time and Disk IO utilized by this procedure. If this is not changing for a few minutes, we assume the job is stalled, and manually kill these stored procedures to continue our data processing tasks.
What are the probable reasons for this stalling of data copy?
Is there any better way to copy the data in bulk from SQL server to linux host? May be an alternative to freebcp. After this, we load this data into hadoop file system run our map reduce tasks.
If the SELECT query for the BCP source is not blocked, the likely cause of a stall is a problem on the client side consuming the results.
Have you considered the free SQL Server ODBC Driver for Linux ?This includes a BCP command-line utility.
I am confused as to what problems ssis packages solve. I need to create an application to copy content from our local network to our live servers across a dedicated line, that may be unreliable. From our live server the content needs to be replicated across all other servers. The database also needs to be updated with all the files that arrived successfully so it may be available to the user.
I was told that ssis can do this but my question is, is this the right thing to us? SSIS is for data transformation, not for copying files from one network to the other. Can ssis really do this?
My rule of thumb is: if no transformation, no aggregation, no data mapping and no disparate sources then no SSIS.
You may want to explore Transactional Replication:
http://technet.microsoft.com/en-us/library/ms151176.aspx
and if you are on SQL Server 2012 you can also take a look at Availability Groups: http://technet.microsoft.com/en-us/library/ff877884.aspx
I would use SSIS for this scenario. It has built in restart functionality ("checkpoints") which I would use to manage partial retries when your line fails. It is also easy to configure the Control Flow so tasks can run in parallel, e.g. Site 2 isn't left waiting for data if Site 1 is slow.
We use a central SQL Server (2008 Standard edition) and several smaller, dedicated SQL Servers (Express editions). We need to implement some mechanism for transferring data asynchronously* from the dedicated decentralized SQL Server (bigger volume, see below) and back from the central SQL Server (few records, basically some notifications for the machines and possibly some optimization hints).
The dedicated SQL Servers are physically located near technology machines, and they are collecting say datetime, temperature rows in regular intervals (think about few seconds interval). There are about 500 records for one job, but the next job follows immediately (the machine does not know it is a new job--being quite stupid in the sense -- and simply collects the temperatures on and on).
The technology machines must be able to work without the central SQL Server, and the central SQL Server must work also when the machine is not accessible (i.e. its dedicated SQL engine cannot be reached, switched off with the machine). In other words, the solution need not to be super fast, but must be robust in the sense that no collected data is lost.
The basic idea is to move the collected data from the dedicated SQL Server (preprocessed to the normalized format with ID of the machine) to the well known table on the central SQL Server. Only the newer data should be sent to minimize the amount of the data. That transfer should be started by the dedicated SQL Server in regular intervals (say one hour) if the connection is OK. If the connection is not OK, the data will be sent after next hour, etc.
Another well known table on the central SQL Server will be used to send notifications for the dedicated SQL Server engines. This way the dedicated engine can be told (for example) what data was already processed/archived on the central SQL Server (i.e. the hint for what records may already be deleted from the local database on the dedicated machine), or whatever information that is hinted from the central (just hints or other not the real-time requirements). The hints will be collected by the dedicated SQL Server (i.e. also the machine responsibility). In other words, the central SQL Server only processes the well known, local tables. It does not try to connect the dedicated SQL Server machines.
The solution should use only the standard mechanisms -- SQL commands (via stored procedures), no external software. What kind of solution should I focus on?
Thanks,
Petr
[Edited later] The SQL servers are at the same Local Area Network.
If you are willing to make a mental switch and stop thinking in terms of tables and rows and instead think in terms of data and messages then Service Broker can do handle all the communication, delivery and message processing. Instead of locally (on the Express machines) doing INSERT INTO LocalTable(datetime, temperature) VALUES (...) you think in terms of:
BEGIN CONVERSATION WITH CentralServer ...;
SEND ON conversation MESSAGE TYPE [Measurement] (<datetime...><temperature ...>)
See Using Service Broker instead of Replication or High Volume Contiguous Real Time ETL
Sounds like a job for merge replication.
We get daily files that need to be loaded into our database. The files will get delivered on a separate server than the database. Which one of the 2 approaches are better for the ETL from a performance perspective?
Transfer files over from the delivery server to the database server. Do bulk load.
Open DB connection from delivery server and load
Edited to add: The servers are all on the same network.
Depends whether source servers are SQL servers or other technology, the driver used (if it's oracle the Microsoft driver will nerf your perf badly, oracle is better), the amount of database overhead You want to impose (while one server is feeding the other they are probably both IO bound), the disk layout You have (ie reading from one raid and writing to the other, conpressing and transferring through 1gig or 100mb might be more efficient. Usually the dumps compress nicely but as Beth have noticed, test it.
With dumps You can abuse parallel transformations (like multiple disk shares, and multiple processors use for compression - use 7zip period.) With ethernet YOu probably wont abuse as much parallelism. Same thing affects the target server.
All in all, as usual with performance, test, quantify, test, quantify, repeat:)
The universal response of 'It Depends'. It depends particularly on what ETL technology you are using. If your ETL is tied to the database server for its processing power (SSIS, BODI (to a lesser degree) then you need to get your files onto the database server asap. If you have a more file based ETL package (Abinitio, Informatica) then you are free to do your transformation on your delivery server and then move your 'ready-to-load' data onto the database server for bulk loading.
in all cases.
Espacially if the files are very large, you can compress data files before transporting over network.
What are my options for achieving a cold backup server for SQL Server Express instance running a single database?
I have an SQL Server 2008 Express instance in production that currently represents a single point of failure for my application. I have a second physical box sitting at the installation that is currently doing nothing. I want to somehow replicate my database in near real time (a little bit of data loss is acceptable) to the second box. The database is very small and resources are utilized very lightly.
In the case that the production server dies, I would manually reconfigure my application to point to the backup server instead.
Although Express doesn't support log shipping, I am thinking that I could manually script a poor man's version of it, where I use batch files to take the logs and copy them across the network and apply them to the second server at 5 minute intervals.
Does anyone have any advice on whether this is technically achievable, or if there is a better way to do what I am trying to do?
Note that I want to avoid having to pay for the full version of SQL Server and configure mirroring as I think it is an overkill for this application. I understand that other DB platforms may present suitable options (eg. a MySQL Cluster), but for the purposes of this discussion, let's assume we have to stick to SQL Server.
I would also advise for a script based log shipping. After all, this is how log shipping started. All you need is a time based agent to schedule the scripts (ie. Tasks Scheduler), and a smart(er) file copy (robocopy).