How do I Queue Distributed Computing Jobs at home using SQL? - sql-server

I'm using an application written in LabVIEW (an engineering software framework/programming language) to run about twenty thousand simulations. Each simulation takes about 5 mins to complete, and results will be dumped into a database hosted in a laptop in my local network. I'm using SQL Express as my database.
Each simulation job has a set of starting parameters that will be passed to the application. This could be as simple as a string of characters that the application would parse into valid simulation characteristics, but I'm not sure exactly how to structure this.
Because the simulations would take about 3 months to run on one computer, I want to add in the capability for the database computer to be able to "schedule" jobs. That way, I can run the application on any computer in my local network (I have 5 available) for a few simulations, and stop simulations when I need to use it for other things. The database computer will hand out these jobs as they get requested by the application, as well as continuously run jobs itself.
How would I go about setting up this queue from an SQL point of view? The framework I currently have in mind would work something like this: Database has 3 tables in addition to tables used to store simulation data. The tables contain CompletedJobs,RunningJobs, and JobsToRun. The application would request a job from JobsToRun, and place that job's ID into the RunningJobs table. It would then parse the job's ID for relevant information, run the simulation, and if it exits without errors, move the job ID to the CompletedJobs table.
Would this work?

I don't see the need for three tables - why not have one table, Jobs, with a JobStatus field that can take values (e.g.) ToRun, Running, Completed, and perhaps Failed - you can probably think of others. When a simulation starts a new job, it changes the status to Running, when it completes the job it changes it again to Completed or Failed.
You might want fields for StartTime and EndTime, perhaps ErrorCode if your simulation might fail with different types of error? What does the output of the simulation consist of - should you store a filename of the output file, or even upload the output data itself as a BLOB? Let the database take care of assigning each job a unique ID, which would be the primary key for the database table.
What sort of data actually are the starting parameters? If you can store them in database fields, do that. You could put those in a second table if you wanted, and have your Jobs table refer to the parameter set's ID in the job parameters table.

Related

Parallel operations in sql server running "sequentially "

There are n databases in our windows application where the schema is same.
We have a requirement where we need to export the databases as mirrors but only a subset of data should be copied.
For this we are following
Create a new empty DB.
Run a script which inserts selected data into the DB based on a key.
After insert we detach the DB.
steps 1,2,3 are run for each database which needs to be shipped.
Inserts we are running in batches. Our plan is to run parallel for each database which needs to be exported.
When we run for large data it takes 6 minutes for a database.
And if I run the same script for 5 different databases in 5 different sessions simultaneously. The last script finishes running by 30-32 minutes. Which means even though they are running parallel time taken is same as sequential.
We disabled all indexes and at end we are rebuilding them
The database is supposed to be in simple logging mode we are turning them to bulk logging mode and turning back to simple logging.
We tried using MAXDOP(0) and (1) -- No change
NO LOCK Hint is being used in all select queries.
I want to understand what should I do to get best performance, because they are 5 different databases being copied to 5 new databases where all operations are supposed to be independent.

T-SQL: advise on copying data across to another database

I need advise on copying daily data to another server.
Just to give you an image of the situation, I will explain a little. there are workstations posting transactions to 2 database servers (DB1 and DB2). These db servers hosted on 2 separate physical servers and are linked. Daily transactions are 50.000 for now but will increase soon. There might be days some workstations down (operational but cannot post data) and transactions posted after a few days.
So, what I do is I run a query on those 2 linked servers. The daily query output contains ~50.000 records with minimum 15 minutes fetching time as linked servers have performance problems.I will create a SP and schedule it to run 2AM in the morning.
My concern starts from here, the output will be copied across to another data warehouse (DW). This is our client's special land, I do not know much about. This DW will be linked onto these db servers to make it possible to send the data (produced by my stored procedure) across.
Now, what would you do to copy the data across:
Create a dummy table on DB1 to copy stored procedure output on the same server so make sure it is available and we do not need to rerun stored procedure again. Then client retrieves it later.
Use "select into" statement to copy the content to remote DW table. I do not know what happens with this one during fetching and sending data across to DW. Remember it takes ~15 mins to fetch the data by my stored procedure.
post the data (retrieved by stored procedure) with xml file through ftp.
Please tell me if there is a way of setting an alert or notification on jobs.
I just want to take precautions so it will be easier to track when something goes wrong.
Any advice is appreciated very much. Thank you. Oz.
When it comes to coping data in SQL Server you need to look at High Availability Solutions, depending on the version and edition of your SQL Server you will have different options.
http://msdn.microsoft.com/en-us/library/ms190202(v=sql.105).aspx
If you need just to move data for specific tables you can have options like SSIS job or SQL Server Replication.
If you are looking to have all tables in a given databases copied to another server you should use Log Shipping. Which allows you to copy entire content of source database to another location. Because this is done of smaller interval the your load will be distributed over larger period of time instead of having large transaction running at once.
Another great alternative is SQL Server Replication. This option will capture transaction on the source and push them to the target. This model requires publisher (source), distributor (can be source or another db) and subscriber (target).
Also you can create SSIS job that runs on frequent basis and just moves specified amount of data.

Remove duplicated records periodically in Sql Server/Compact Edition

I need to remove duplicated records, as a maintenance task, inside the sql server instance or in my local compact edition testing database. Because, I have a tool that reads a clock device that outputs workers check-in/out workday. I export reading data to Xml files as a backup and insert the objects parsed into the database.
So, there are to many records for insertion daily and I will like to do it in a optimal manner without having to check other values existing in the database every time I need to insert.
What recommendation you give me?
I'm using Entity Framework 6
Do I deal with EF and Linq for managing duplicates and SqlBulkCopy?
Do I create temporary tables in Sql Server?
Do I create a Sql store procedure that does so?
Do I use SSIS (I'm a newbie on that) for importing Xml files?
I have two tables:
-Clock (Id, Name, Location)
-Stamp (Id, ClockId, WorkerId, StartDate, EndDate, State)
State: Evaluation of worker assistance attending to Start/End (in a normal work day: 8.00am-5.00pm).
-BadStart
-BadEnd
-Critical (Start/End out of admisible range)
-Pending (Those who not yet has been processed and normalized)
How do I process data:
There are 2 clocks units (each creates its own stamps, but workers can check-in/out in any of them)
-Read clock data from the device (other application does that, the physical machine has a scheduled task that runs a script that reads the clock unit device. Output: Xml files)
-Parse Xml files (Compatibility issue: Human Resources department has other application that reads it in that specific format)
-Insert/update records in database according to some normalizing rules
As you could see, the table can't have unique fields, because the same worker can check-in/out several times (by mistake, by confirmation, by other clock) and all these stamps has to be unified/normalized for the day in course.
The duplicates are created each time I run the parser that reads all Xml files in the directory and insert them in the database.
I don't have permissions to modify the physical machine directory hierarchy.
So I'm looking a better strategy for clasify, store and remove redundant records.
The task should be performed daily and several Xml files are created from each clock unit in a specific directory. The clock is connected via a serial wire to a physical machine.
Depending on your preference and data model, there are several ways to skin this cat.
See the following links that have examples. Most of them use CTE - Common Table Expression. You should be easily able to adapt it to your needs, and then schedule the script to run as a SQL Server Job periodically.
1) Different strategies for removing duplicate records in SQL Server.
2) Using CTE to remove duplicate records

Warehouse PostgreSQL database architecture recommendation

Background:
I am developing an application that allows users to generate lots of different reports. The data is stored in PostgreSQL and has natural unique group key, so that the data with one group key is totally independent from the data with others group key. Reports are built only using 1 group key at a time, so all of the queries uses "WHERE groupKey = X;" clause. The data in PostgreSQL updates intensively via parallel processes which adds data into different groups, but I don't need a realtime report. The one update per 30 minutes is fine.
Problem:
There are about 4 gigs of data already and I found that some reports takes significant time to generate (up to 15 seconds), because they need to query not a single table but 3-4 of them.
What I want to do is to reduce the time it takes to create a report without significantly changing the technologies or schemes of the solution.
Possible solutions
What I was thinking about this is:
Splitting one database into several databases for 1 database per each group key. Then I will get rid of WHERE groupKey = X (though I have index on that column in each table) and the number of rows to process each time would be significantly less.
Creating the slave database for reads only. Then I will have to sync the data with replication mechanism of PostgreSQL for example once per 15 minutes (Can I actually do that? Or I have to write custom code)
I don't want to change the database to NoSQL because I will have to rewrite all sql queries and I don't want to. I might switch to another SQL database with column store support if it is free and runs on Windows (sorry, don't have Linux server but might have one if I have to).
Your ideas
What would you recommend as the first simple steps?
Two thoughts immediately come to mind for reporting:
1). Set up some summary (aka "aggregate") tables that are precomputed results of the queries that your users are likely to run. Eg. A table containing the counts and sums grouped by the various dimensions. This can be an automated process -- a db function (or script) gets run via your job scheduler of choice -- that refreshes the data every N minutes.
2). Regarding replication, if you are using Streaming Replication (PostgreSQL 9+), the changes in the master db are replicated to the slave databases (hot standby = read only) for reporting.
Tune the report query. Use explain. Avoid procedure when you could do it in pure sql.
Tune the server; memory, disk, processor. Take a look at server config.
Upgrade postgres version.
Do vacuum.
Out of 4, only 1 will require significant changes in the application.

How do I ensure SQL Server replication is running?

I have two SQL Server 2005 instances that are geographically separated. Important databases are replicated from the primary location to the secondary using transactional replication.
I'm looking for a way that I can monitor this replication and be alerted immediately if it fails.
We've had occasions in the past where the network connection between the two instances has gone down for a period of time. Because replication couldn't occur and we didn't know, the transaction log blew out and filled the disk causing an outage on the primary database as well.
My google searching some time ago led to us monitoring the MSrepl_errors table and alerting when there were any entries but this simply doesn't work. The last time replication failed (last night hence the question), errors only hit that table when it was restarted.
Does anyone else monitor replication and how do you do it?
Just a little bit of extra information:
It seems that last night the problem was that the Log Reader Agent died and didn't start up again. I believe this agent is responsible for reading the transaction log and putting records in the distribution database so they can be replicated on the secondary site.
As this agent runs inside SQL Server, we can't simply make sure a process is running in Windows.
We have emails sent to us for Merge Replication failures. I have not used Transactional Replication but I imagine you can set up similar alerts.
The easiest way is to set it up through Replication Monitor.
Go to Replication Monitor and select a particular publication. Then select the Warnings and Agents tab and then configure the particular alert you want to use. In our case it is Replication: Agent Failure.
For this alert, we have the Response set up to Execute a Job that sends an email. The job can also do some work to include details of what failed, etc.
This works well enough for alerting us to the problem so that we can fix it right away.
You could run a regular check that data changes are taking place, though this could be complex depending on your application.
If you have some form of audit train table that is very regularly updated (i.e. our main product has a base audit table that lists all actions that result in data being updated or deleted) then you could query that table on both servers and make sure the result you get back is the same. Something like:
SELECT CHECKSUM_AGG(*)
FROM audit_base
WHERE action_timestamp BETWEEN <time1> AND BETWEEN <time2>
where and are round values to allow for different delays in contacting the databases. For instance, if you are checking at ten past the hour you might check items from the start the last hour to the start of this hour. You now have two small values that you can transmit somewhere and compare. If they are different then something has most likely gone wrong in the replication process - have what-ever pocess does the check/comparison send you a mail and an SMS so you know to check and fix any problem that needs attention.
By using SELECT CHECKSUM_AGG(*) the amount of data for each table is very very small so the bandwidth use of the checks will be insignificant. You just need to make sure your checks are not too expensive in the load that apply to the servers, and that you don't check data that might be part of open replication transactions so might be expected to be different at that moment (hence checking the audit trail a few minutes back in time instead of now in my example) otherwise you'll get too many false alarms.
Depending on your database structure the above might be impractical. For tables that are not insert-only (no updates or deletes) within the timeframe of your check (like an audit-trail as above), working out what can safely be compared while avoiding false alarms is likely to be both complex and expensive if not actually impossible to do reliably.
You could manufacture a rolling insert-only table if you do not already have one, by having a small table (containing just an indexed timestamp column) to which you add one row regularly - this data serves no purpose other than to exist so you can check updates to the table are getting replicated. You can delete data older than your checking window, so the table shouldn't grow large. Only testing one table does not prove that all the other tables are replicating (or any other tables for that matter), but finding an error in this one table would be a good "canery" check (if this table isn't updating in the replica, then the others probably aren't either).
This sort of check has the advantage of being independent of the replication process - you are not waiting for the replication process to record exceptions in logs, you are instead proactively testing some of the actual data.

Resources