How to import large CSV file in SQL Server [duplicate]

How to import large CSV file in SQL Server [duplicate] - sql-server

I have a system that requires a large amount of names and email addresses (two fields only) to be imported via CSV upload.
I can deal with the upload easily enough, how would I verify the email addresses before I process the import.
Also how could I process this quickly or as a background process without requiring the user to watch a script churning away?
Using Classic ASP / SQL server 2008.
Please no jibes at the classic asp.

Do you need to do this upload via the ASP application? If not, whatever kind of scripting language you feel most comfortable with, and can do this with the least coding time is the best tool for the job. If you need for users to be able to upload into the classic ASP app and have a reliable process to insert the valid records into the database and reject the invalid ones, your options change.
Do you need to provide feedback to the users? Like telling them exactly which rows were invalid?
If that second scenario is what you're dealing with, I would have the asp app simply store the file, and have another process, a .net service, or scheduled task or something, do the importing and report on its progress in a text file which the asp app can check. That brings you back to doing it in whatever scripting language you are comfortable with, and you don't have to deal with the http request timing out.
If you google "regex valid email" you can find a variety of regular expressions out there for identifying invalid email addresses.

In a former life, I used to do this sort of thing by dragging the file into a working table using DTS and then working that over using batches of SQL commands. Today, you'd use Integration Services.
This allows you to get the data into SQL Server very quickly, and prevent the script timing out, then you can use whatever method you prefer (e.g. AJAX-driven batches, redirection-driven batches, etc.) to work over discreet chunks of the data, or schedule it to run as a single batch (an SQL Server job) and just report on the results.
You might be lucky enough to get your 500K rows processed in a single batch by your upload script, but I wouldn't chance it.

Related

Transfer data between NoSQL and SQL databases on different servers

Currently, I'm working on a MERN Web Application that'll need to communicate with a Microsft SQL Server database on a different server but on the same network.
Data will only be "transferred" from the Mongo database to the MSSQL one based on a user action. I think I can accomplish this by simply transforming the data to transfer into the appropriate format on my Express server and connecting to the MSSQL via the matching API.
On the flip side, data will be transferred from the MSSQL database to the Mongo one when a certain field is updated in a record. I think I can accomplish this with a Trigger, but I'm not exactly sure how.
Do either of these solutions sound reasonable or are there more better/industry standard methods that I should be employing. Any and all help is much appreciated!

There are (in general) two ways of doing this.
If the data transfer needs to happen immediately, you may be able to use triggers to accomplish this, although be aware of your error handling.
The other option is to develop some form of worker process in your favourite scripting language and run this on a schedule. (This would be my preferred option, as my personal familiarity with triggers is fairly limited). If option 1 isn't viable, you could set your schedule to be very frequent, say once per minute or every x seconds, as long as a new task doesn't spawn before the previous is completed.
The broader question though, is do you need to have data duplicated across two different sources? The obvious pitfall with this approach is consistency, should anything fail you can end up with two data sources wildly out of sync with each other and your approach will have to account for this.

Best way to communicate between two programmes (VB.NET)

Afternoon,
I'm looking for some advice on a project I am working on.
Presently I have 2 Console Applications (That run 2 separate API's one for the phone system ACD and one of the Call Recorder).
These churn through traffic and store information in a database ready for a client application (Windows Form) to obtain and use.
At present I am storing the information in a SQL Table and using calls to and from the SQL database in both the Console Applications and Windows Form application.
Is this best practice? In my experience under load the SQL Server could slow down and I don't like the idea of relying on DB requests to send the information back.
Is there a better way I could communicate between my Console Applications and my WIndows Form? I was potentially thinking that TCP messages from one to the other on a port would work and then use the Windows Form to use the 'stream' to perform its actions rather than getting the information from the database.
The reason being is ideally I'd like to have the 'database' as stored within memory for quicker retrieval than using the SQL DB.
Any and all advice is fully appreciated. Please give me an explanation as to why your suggestion is efficient as I would like to learn from this rather than just 'do'
Many Thanks!
James

Is it better to log to file or database?

We're still using old Classic ASP and want to log whenever a user does something in our application. We'll write a generic subroutine to take in the details we want to log.
Should we log this to, say, a txt file using FileSystemObject or log it to a MS SQL database?
In the database, should we add a new table to the one existing database or should we use a separate database?

Edit
In hindsight, a better answer is to log to BOTH file system (first, immediately) and then to a centralized database (even if delayed). Most modern logging frameworks follow a publish-subscribe model (often called logging sources and sinks) which will allow multiple logging sinks (targets) to be defined.
The rationale behind writing to file system that if an external infrastructure dependency like network, database, or security issue prevents you from writing remotely, that at least you have a fall back if you can recover data from the server's hard disk (something akin to a black box in the airline industry). Log data written to a file system can be deleted as soon as it is confirmed that the central database has recorded the data, so generally file system retention sizes or rotation times need not be large.
Enterprise log managers like Splunk can be configured to scrape your local server log files (e.g. as written by log4net, the EntLib Logging Application Block, et al) and then centralize them in a searchable database, where data logged can be mined, graphed, shown on dashboards, etc.
But from an operational perspective, where it is likely that you will have a farm or cluster of servers, and assuming that both the local file system and remote database logging mechanisms are working, the 99% use case for actually trying to find anything in a log file will still be via the central database (ideally with a decent front end system to allow you to query, aggregate, graph and build triggers or notifications from log data).
Original Answer
If you have the database in place, I would recommend using this for audit records instead of the filesystem.
Rationale:
typed and normalized classification of data (severity, action type, user, date ...)
it is easier to find audit data (select ... from Audits where ... ) vs Grep
it is easier to clean up (e.g. Delete from Audits where = Date ...)
it is easier to back up
The decision to use existing db or new one depends - if you have multiple applications (with their own databases) and want to log / audit all actions in all apps centrally, then a centralized db might make sense.
Since you say you want to audit user activity, it may would make sense to audit in the same db as your users table / definition (if applicable).

I agree with the above with the perhaps obvious exception of logging database failures which would make logging to the database problematic. This has come up for me in the past as I was dealing with infrequent but regular network failovers.

Either works. It's up to your preference.
We have one central database where ALL of our apps log their error messages. Every app we write is set up in a table with a unique ID, and the error log table contains a foreign key reference to the AppId.
This has been a HUGE bonus for us in giving us one place to monitor errors. We had done it as a file system or by sending emails to a monitored inbox in the past, but we were able to create a fairly nice web app for interacting with the error logs. We have different error levels, and we have an "acknowledged" flag field, so we have a page where we can view unacknowledged events by severity, etc.,

Looking at the responses, I think the answer may actually be both.
If it's a user error that's likely to happen during expected usage (e.g. user enters an invalid email etc.), that should go into a database to take advantage of easy queries.
If it's a code error that shouldn't happen (can't get username of a logged in user), that should be reserved for a txt file.
This also nicely splits the errors between non-critical and critical. Hopefully the critical error list stays small!
I'm creating a quick prototype of a new project right now, so I'll stick with txt for now.
On another note, email is great for this. Arguably you could just email them to a "bug" account and not store them locally. However this shares the database risk of bad connections.

Should we log this to say a txt file using FileSystemObject or log it
to a MSSQL database?
Another idea is to write the log file in XML and then query it using XPath. I like to think that this is the best of both worlds.

Advice on how to write robust data transfer processes?

I have a daily process that relies on flat files delivered to a "drop box" directory on file system, this kicks off a load of this comma-delimited (from external company's excel etc) data into a database, a piecemeal Perl/Bash application, this database is used by multiple applications as well as edited directly with some GUI tools. Some of the data then gets replicated with some additional Perl app into the database that I mainly use.
Needless to say, all that is complicated and error prone, data coming in is sometimes corrupt or sometimes an edit breaks it. My users often complain about missing or incorrect data. Diffing the flat files and DBs to analyze where the process breaks is time consuming, and which each passing day data becomes more out of data and difficult to analyze.
I plan to fix or rewrite parts or all of this data transfer process.
I am looking on recommended reading before I embark on this, websites and articles on how to write robust, failure resistant and auto-recoverable ETL processes or other advice would be appreciated.

This is precisely what Message Queue Managers are designed for. Some examples are here.

You don't say what database backend you have, but in SQL Server I would write this as an SSIS package. We have a system designed to also write data to a meta data database that tells us when teh file was picked up, whether it processed successfully and why if it did not. It also tells things like how many rows the file had (which we can then use to determine if the current row size is abnormal). One of the beauties of SSIS is that I can set up configurations on package connections and variables, sothat moving the package from development to prod is easy (I don't have to go in and manaually change the connections each time once I have a configuration set up in the config table)
In SSIS we do various checks to ensure the data is correct or clean up the data before inserting to our database. Actually we do lots and lots of checks. Questionable records can be removed from the file processing and put in a separate location for the dbas to examine and possibly pass back to the customer. We can also check to see if the data in various columns (and the column names if given, not all files have them) is what would be expected. So if the zipcode field suddently has 250 characters, we know something is wrong and can reject the file before processing. That way when the client swaps the lastname column with the firstname column without telling us, we can reject the file before importing 100,000 new incorrect records. IN SSIS we can also use fuzzy logic to find existing records to match. So if the record for John Smith says his address is at 213 State st. it can matchour record that says he lives at 215 State Street.
It takes alot to set up a process this way, but once you do, the extra confidence that you are processing good data is worth it's weight in gold.
Even if you can't use SSIS, this should at least give you some ideas of the types of things you should be doing to get the information into your database.

I found this article helpful for the error handling aspects of running cron jobs:
IBM DeveloperWorks Article: "Build intelligent, unattended
scripts"

Setting up a Reporting Server to liberate resource from a webserver

Yay, first post on SO! (Good work Jeff et al.)
We're trying to solve a bottleneck in one of our web-applications that was introduced when we started allowing users to generate reports on-demand.
Our infrastructure is as follows:
1 server acting as a Webserver/DBServer (ColdFusion 7 and MSSQL 2005)
It's serving a web-application for our backend users and a frontend website. The reports are generated by the users from the backend so there's a level of security where the users have to log in (web based).
During peak hours when reports are generated it brings the web-application and frontend website to unacceptable speed due to SQL Server using resources for the huge queries and afterward ColdFusion generating multi page PDFs.
We're not exactly sure what the best practice would be to remove some load, but restricting access to the reports isn't an option at the moment.
We've considered denormalizing data to other tables to simplify the most common queries, but that seems like it would just push the issue further.
So, we're thinking of getting a second server and use it as a "report server" with a replicated copy of our DB on which the queries would be ran. This would fix one issue, but the second remains: generating PDFs is resource intensive.
We would like to offload that task to the reporting server as well, but being in a secured web-application we can't just fire HTTP GET to create PDFs with the user logged in the web-application from server 1 and displaying it in the web-application but generating/fetching it on server 2 without validating the user's credential...
Anyone have experience with this? Thanks in advance Stack Overflow!!

"We would like to offload that task to the reporting server as well, but being in a secured web-application we can't just fire HTTP GET to create PDFs with the user logged in the web-application from server 1 and displaying it in the web-application but generating/fetching it on server 2 without validating the user's credential..."
why can't you? you're using the world's easiest language for writing webservices. here are my suggestions.
first, move the database to it's own server thus having cf and sql server on separate servers. the first reason to do this is performance. as already mentioned, having both cf and sql on the same server isn't an ideal setup. the second reason is for security. if someone is able to hack your webserver, well there right there to get your data. you should have a firewall in place between your cf and sql server to give you more security. last reason is for scalability. if you ever need to throw more resources or cluster your database, it's easier when it's on it's own server.
now for the webservices. what you can do is install cf on another server and writing webservices to handle the generation of reports. just lock down the new cf server to accept only ssl connections and pass the login credentials of the users to the webservice. inside your webservice, authenticate the user before invoking the methods to generate the report.
now for the pdfs themselves. one of the methods i've done in the pass is generating a hash based on some parameters passed (user credentials and the generated sql to run the query) and then once the pdf is generated, you assign the hash to the name of the pdf and save it on disk. now you have a simple caching system where you can look to see if the pdf already exists and if so, return it, otherwise generate it and cache it.
in closing, your problem is not something that most haven't seen before. you just need to do a little work and your application will magnitudes faster.

The most basic best practice is to not have the web server and db server on the same hardware. I'd start with that.

You have to separate the perception between generating the PDF and doing the calculations. Both are separate steps.
What you can do is
1) Create a report calculated table that will run daily and populated it with all the calculated values for all your reports.
2) When someone requests a PDF report, have the report do a simple select query of the pre-calculated values. It will be much less db effort than calculating on the fly. You can use coldfusion to generate the PDF if it's using the fancy pdf settings. Otherwise you may be able to get away with using the raw PDF format (it's similar to html markup) in text form, or use another library (cfx_pdf, a suitable java library, etc) to generate them.
If the users don't need to download and only need to view/print the report, could you get away with flash paper?
An alternative is also to build a report queue. Whether you put it on the second server or not, what CF could do if you can get away with it, you could put report requests into a queue, and email them to the users as they get processed.
You can then control the queue through a scheduled process to run as regularly as you like and do only create a few reports at a time. I'm not sure if it's a suitable approach for your situation.
As mentioned above, doing a stored procedure may also help, and make sure you have your indexes set correctly in MySQL. I once had a 3 minute query that I brought down to 15 seconds because I forgot to declare additional indexes in each table that were being heavily used.
Let us know how it goes!

In addition to advice to separate web & db servers, I'd tried to:
a) move queries into stored procedures, if you're not using them yet;
b) generate reports by scheduler and keep them cached in special tables in ready-to-use state, so customers only select them with few fast queries -- this should also decrease report building time for customers.
Hope this helps.