Extract & transform data from Sql Server to MongoDB periodically - sql-server

I have a Sql Server database which is used to store data coming from a lot of different sources (writers).
I need to provide users with some aggregated data, however in Sql Server this data is stored in several different tables and querying it is too slow ( 5 tables join with several million rows in each table, one-to-many ).
I'm currently thinking that the best way is to extract data, transform it and store it in a separate database (let's say MongoDB, since it will be used only for read).
I don't need the data to be live, just not older that 24 hours compared to the 'master' database.
But what's the best way to achieve this? Can you recommend any tools for it (preferably free) or is it better to write your own piece of software and schedule it to run periodically?

I recommend respecting the NIH principle here, reading and transforming data is a well understood exercise. There are several free ETL tools available, with different approaches and focus. Pentaho (ex Kettle) and Talend are UI based examples. There are other ETL frameworks like Rhino ETL that merely hand you a set of tools to write your transformations in code. Which one you prefer depends on your knowledge and, unsurprisingly, preference. If you are not a developer, I suggest using one of the UI based tools. I have used Pentaho ETL in a number of smaller data warehousing scenarios, it can be scheduled by using operating system tools (cron on linux, task scheduler on windows). More complex scenarios can make use of the Pentaho PDI repository server, which allows central storage and scheduling of your jobs and transformations. It has connectors for several database types, including MS SQL Server. I haven't used Talend myself, but I've heard good things about it and it should be on your list too.
The main advantage of sticking with a standard tool is that once your demands grow, you'll already have the tools to deal with them. You may be able to solve your current problem with a small script that executes a complex select and inserts the results into your target database. But experience shows those demands seldom stay the same for long, and once you have to incorporate additional databases or maybe even some information in text files, your scripts become less and less maintainable, until you finally give in and redo your work in a standard toolset designed for the job.

Related

Loading data from SQL Server to Elasticsearch

Looking for suggesting on loading data from SQL Server into Elasticsearch or any other data store. The goal is to have transactional data available in real time for Reporting.
We currently use a 3rd party tool, in addition to SSRS, for data analytics. The data transfer is done using daily batch jobs and as a result, there is a 24 hour data latency.
We are looking to build something out that would allow for more real time availability of the data, similar to SSRS, for our Clients to report on. We need to ensure that this does not have an impact on our SQL Server database.
My initial thought was to do a full dump of the data, during the weekend, and writes, in real time, during weekdays.
Thanks.
ElasticSearch's main use cases are for providing search type capabilities on top of unstructured large text based data. For example, if you were ingesting large batches of emails into your data store every day, ElasticSearch is a good tool to parse out pieces of those emails based on rules you setup with it to enable searching (and to some degree querying) capability of those email messages.
If your data is already in SQL Server, it sounds like it's structured already and therefore there's not much gained from ElasticSearch in terms of reportability and availability. Rather you'd likely be introducing extra complexity to your data workflow.
If you have structured data in SQL Server already, and you are experiencing issues with reporting directly off of it, you should look to building a data warehouse instead to handle your reporting. SQL Server comes with a number of features out of the box to help you replicate your data for this very purpose. The three main features to accomplish this that you could look into are AlwaysOn Availability Groups, Replication, or SSIS.
Each option above (in addition to other out-of-the-box features of SQL Server) have different pros and drawbacks. For example, AlwaysOn Availability Groups are very easy to setup and offer the ability to automatically failover if your main server had an outage, but they clone the entire database to a replica. Replication let's you more granularly choose to only copy specific Tables and Views, but then you can't as easily failover if your main server has an outage. So you should read up on all three options and understand their differences.
Additionally, if you're having specific performance problems trying to report off of the main database, you may want to dig into the root cause of those problems first before looking into replicating your data as a solution for reporting (although it's a fairly common solution). You may find that a simple architectural change like using a columnstore index on the correct Table will improve your reporting capabilities immensely.
I've been down both pathways of implementing ElasticSearch and a data warehouse using all three of the main data synchronization features above, for structured data and unstructured large text data, and have experienced the proper use cases for both. One data warehouse I've managed in the past had Tables with billions of rows in it (each Table terabytes big), and it was highly performant for reporting off of on fairly modest hardware in AWS (we weren't even using Redshift).

Migrate data from MariaDB to SQLServer

We are planning to migrate all the data from MariaDB to SQLServer. Can anyone please suggest any approach to migrate the data so that no downtime is required as well as no data is lost.
In context of that, I have gone through a few posts here, but did not get much idea.
You could look into SQL Server Integration Services functionality for migrating your data.
Or you could manually create a migration script using a linked server in your new SQL Server instance.
Or you could use BCP to perform bulk imports (which is quite fast, but requires intermediate steps to put the data in text files).
What's more important is how you want to realize the "no downtime" requirement. I suppose the migration routines need some functional requirement, which might be difficult to implement with a general migration tool, like:
the possibility to perform the migration in multiple batches/runs (where already migrated data is skipped), and
the possibility to implement different phases of the migration in different solutions, like bulk imports (using text files and staging tables) for history data (which will not change anymore), but live queries over a live database connection for the latest updates in the MariaDB/MySQL database.
The migration strategy might also largely depend on the size of the data in MariaDB/MySQL, and the structure of the database(s) and its data. Perhaps you want to keep auto-generated primary key values, because the system requires them to remain unchanged. Perhaps you need to use different data types for some exotic table fields. Perhaps you need to re-implement some database logic (like stored procedures and functions). Etc. etc.
It is very difficult to give some ad-hoc advice about these kind of migration projects; as Tim Biegeleisen already commented, this can be quite a complex job, even for "small" databases. It practically always requires a lot of research, extensive preparations, test runs (in a testing environment using database backups), some more test runs, a final test run, etc. And - of course - some analytics, metrics, logging, and reporting for troubleshooting (and to know what to expect during the actual migration). If the migration will be long-running, you want to make sure it does not freeze the live production environment, and you might also want some form of progress indication during the migration.
And - last but not least - you surely want to have a "plan B" or a quick return strategy in case the actual migration will fail (despite all those careful preparations).
Hope I did not forget something... ;-)

Consideration DECS vs SSIS?

I need solution to pump data from Lotus Notes to SqlServer. Data will be transfered in 2 modes
Archive data transfer
Current data transfer
Availability of data in Sql is not critical, data is used for reports. Reports could be created daily, weekly or monthly.
I am considering to choose from one of those solutions: DESC and SSIS. Could You please give me some tips about prons and cons of both technologies. If You suggest something else it could be also taken into consideration.
DECS - Domino Enterprise Connection Services
SSIS - Sql Sever Integration Services
I've personally used XML frequently to get data out of Lotus Notes in a way that can be read easily by other systems. I'd suggest you take a look and see if that fits your needs. You can create views that emit XML or use NotesAgents or Java Servlets, all of which can be accessed using HTTP.
SSIS is a terrific tool for complex ETL tasks. You can even write C# code if you need to. There are lots of pre-written available data cleaning components already out there for you to download if you want. It can pretty much do anything you need to do. It does however have a fairly steep learning curve. SSIS comes free with SQL Server so that is a plus. A couple of things I really like about SSIS are the ability to log errors and the way it handles configuration so that moving the package from the dev environment to QA and Prod is easy once you have set it up.
We have also set up a meta data database to record a lot of information about our imports such as the start and stop time, when the file was recieved, the number of records processed, types of errors etc. This has really helped us in researching data issues and has helped us write some processes that are automatically stopped when the file exceeds the normal parameters by a set amount. This is handy if you normally recive a file with 2 million records and the file comes in one day with 1000 records. Much better than delting 2,000,000 potential customer records because you got a bad file. We also now have the ability to do reporting on files that were received but not processed or files that were expected but not received. This has tremendously improved our importing porcesses (we have hundreds of imports and exports in our system). If you are designing from sratch, you might want to take some time and think about what meta data you want to have and how it will help you over time.
Now depending on your situation at work, if there is a possibility that data will also be sent to the SQL Server database from sources other than Lotus Notes as well as the imports from Notes that you are developing for, I would suggest it might be worth your time to go ahead and start using SSIS as that is how the other imports are likely to be done. As a database person, I would prefer to have all the imports I support using the same technology.
I can't say anything about DECS as I have never used it.
Just a thought - but as Lotus Notes tends to behave a bit "different" than relational databases (or anything else), you might be safer going with a tool which comes out of the Notes world, versus a tool from the sql world.
(I have used DECS in the past (prior to Domino 8) and it has worked fine for pumping data out into a SQL Server database. I have not used SSIS).

Swapping out databases?

It seems like the goal of a lot of ORM tools and custom data access layers (DAO pattern, etc.) is to abstract the database to the point where you could supposedly swap out the entire database system with minimal work.
Following the common DAL patterns is usually a good idea in code, but it seems like it would never be minimal work to swap out a database. (Cost, training, data migration, etc.)
Does anyone have any experience with swapping out one database for another in a large system, and dealing with the implications in code? Is it worth it to worry about abstracting the actual database from your code?
Question 1: Does anyone have any experience with
swapping out one database for another
in a large system, and dealing with
the implications in code?
Yes we tried it. Our customer is using a large MS Access based Delphi client server application. After about five years we considered switching to SQL Server. We analyzed the problem and concluded that swapping the database would be very costly and provide only a few advantages. Customer decided not to swap the database. The application is still running fine and the customer is still happy.
Note that:
MS Access is only being used for data storage and report generation.
The server application ensures that MS Access is only being accessed on the server. Normal multi-user MS Access applications will transfer large chunks of the Access database over the network - resulting in slow and unreliable database functionality. This is not the case for this application. Client <> Server <> MS Access. Only the server application communicates with the MS Access database. Actually the Server has exclusive access to the MS Access database. No other computer can open to the MS Access database. Conclusion: MS Access is being used as a true RDBMS, Relational DataBase Management System - please no flaming about MS Access being inferior and unstable - it has been running fine for more than 10 years.
The most important issues you will have to consider:
SQL statements: (SELECT, UPDATE, DELETE, INSERT, CREATE TABLE) and make sure they would be compatible with the SQL database. It's amazing how much all the RDBMS differ in the details (date formats, number formats, search formats, string formats, join syntax, create table syntax, stored procedures, user defined functions, (auto) primary keys, etc.)
Report generation: Depending on your database you might be using a different reporting tool. Our customer has over 200 complex reports. Converting all these reports is very time consuming.
Performance: all RDBMS have different performances in different environments. Normally performance optimalisations are very much RDBMS dependent.
Costs: the costs of tools, developers, server and user licenses varies greatly. It ranges from free to very expensive. Free does not mean cheap and expensive does not always equate to good. A cost/value comparison will have to be made.
Experience: making the best use of your RDBMS requires experience. If you have to develop for an "unknown" RDBMS your productivity will suffer.
Question 2: Is it worth it to worry about
abstracting the actual database from
your code?
Yes. In an ideal world, swapping a database would just be adjusting the data connection string. In the real world this is not possible because all databases are different. They all have tables and SQL support but the differences are in the details. If you can keep the differences of the databases shielded through abstraction - please do so. Make a list of the databases you need to support. Check the selected database systems for the differences. Provide centralized code to handle the differences. Support one RDBMS and provide stubs for future support of other RDBMS.
I disagree that the purpose is to be able to swap out databases, and I think you are correct in showing some suspicion about ORMs leading towards that goal.
However, I would still use an ORM, as it abstracts away the details of data access. Isn't this the goal of object oriented programming? Keep your concerns separated.
I think the primary use case for database abstraction (via ORM tools) is to be able to ship a product that works with multiple database brands. I believe it's a rarer occurrence for a company to switch between database vendors, but that's still one of the use cases.
I've worked jobs where we started out using MySQL for monetary reasons (think a startup) and, one we started making money, wanted to switch to Oracle. We didn't end up making the switch, but it was nice to have the option.
Still, ORM tools are not a completely leak-less abstractions and I know our migration still would have been painful and costly. It totally depends on what you are building, but it has been my experience that -- for performance reasons, usually -- you end up either working around your ORM solution or exploiting vendor-specific features at some point.
The only time I've seen a database switch was from HSQL during early development to Oracle as the project progressed. The ORM made this easy.
I often use the DAO pattern to swap out data services (from a database to web service or to swap a web service to a test stub).
For ORM I don't think the goal is to enable you to switch databases - it is to hide you from the complexities of different database implementations and removing the need to worry about the fine details of translating from relational to object represenations of your data.
By having someone smart write an ORM that handles caching, only updates fields that have changed, groups updates, etc I don't need to. Although in the cases where I need something special I can still revert to SQL if I want.

When is it time to change database backends?

Is there a general rule of thumb to follow when storing web application data to know what database backend should be used? Is the number of hits per day, number of rows of data, or other metrics that I should consider when choosing?
My initial idea is that the order for this would look something like the following (but not necessarily, which is why I'm asking the question).
Flat Files
BDB
SQLite
MySQL
PostgreSQL
SQL Server
Oracle
It's not quite that easy. The only general rule of thumb is that you should look for another solution when the current one can't keep up anymore. That could include using different software (not necessarily in any globally fixed order), hardware or architecture.
You will probably get a lot more benefit out of caching data using something like memcached than switching to another random storage backend.
If you think you are going to ever need one of the heavyweights (SqlServer, Oracle), you should start with one of those at the beginning. Data migrations are extremely difficult. In the long run it will cost you less to just start at the top and stay there.
I think you're being overly specific in your rankings. You can pretty much start with flat files and the like for very small data sets, go up to something like DBM for slightly bigger ones that don't require SQL-like syntax, and go to some kind of SQL database after that.
But who wants to do all that rewriting? If the application will benefit from access to joins, stored procedures, triggers, foreign key validation, and the like--just use a SQL database regardless of the dataset size.
Which one should depend more on the client's existing installations and what DBA skills are available than on the amount of data you're holding.
In other words, the size of your database is far from the only consideration, and maybe not the most important one.
There is no blanket answer to this, but ALMOST always, using flat files is not a good idea. You have to parse through them (i suppose) and they do not scale well. Starting with a proper database, like Oracle or SQL Server (or MySQL, Postgres if you are looking for free options) is a good idea. For very little overhead, you will save yourself a lot of effort and headache later on. They also allow you to structure your data in a non-stupid fashion, leaving you free to think of WHAT you will do with the data rather than HOW you will be getting it in/out.
It really depends on your data, and how you intend to use it. At one of my previous positions, we used Postgres due to the native geo-location and timezone extensions which existed because it allowed us to manage our data using polygonal datatypes. For us, we needed to do that, and we also wanted to use stored procedures, views and the like.
Now, another place I worked at used MySQL simply because the data was normalized, standard row by row data.
SQL Server, for a long time, had a 4gb database limit (see SQL Server 2000), but despite that limitation it remains a very stable platform for small to medium applications for which the old data is purged.
Now, from working with Oracle and SQL Server 05/08, all I can tell you is that if you want the creme of the crop for stability, scalability and flexibility, then these two are your best bet. For enterprise applications, I strongly recommend them (merely because that's what we use where I work now).
Other things to consider:
Language integration (ASP.NET session storage, role management, etc.)
Query types (Select, Update, Delete) [Although this is more of a schema design issue, not a DBMS issue)
Data storage requirements
Your application's utilization of the database is the most critical ones. Mainly what queries are used most often (SELECT, INSERT or UPDATE)?
Say if you use SQLite, it is gears for smaller application but for "web" application you might a bigger one like MySQL or SQL Server.
The way you write scripts and your web application platforms also matters. If you're developing on a Microsoft platform, then SQL Server is a better alternative.
Typically, I go with what is commonly accepted by whichever framework I am using. So, if I'm doing .NET => SQL Server, Python (via Django or Pylons) => MySQL or SQLite.
I almost never use flat files though.
There is more to choosing an RDBMS solution that just "back end horsepower". The ability to have commitment control, for example, so you can roll back a failed transaction is one. reason.
Unless you are in the megatransaction rate application, most database engines would be adequate - so it becomes a question of how much you want to pay for the software, whether it runs on the hardware and operating system environment you want, and what expertise you have in managing that software.
That progression sounds painful. If you're going to include MS products (especially the for-pay SQL Server) in there anywhere, you may as well use the whole stack, since you only have to pay for the last of these:
SQL Server Compact -> SQL Server Express -> SQL Server Enterprise (clustered).
If you target your app at SQL Server Compact initially, all your SQL code is guaranteed to scale up to the next version without modification. If you get bigger than SQL Server Enterprise, then congratulations. That's what they call a good problem to have.
Also: go back and check the SO podcasts. I believe they talked about this briefly.
This question depends on your situation really.
If you have control over the server you're deploying to and you can install whatever services you need, then the time to install a MySql or MSSQL Express server and code against an existing database framework VERSUS coding against flat file structure is not worth the effort of considering.
What about FireBird? Where would that fit into that list?
And lets not forget the requirements that the "customer" of your solution must also have in place. If your writing a commercial application for a small companies, then Oracle might not be a good choice... but if your writing a customized solution for a large enterprise which must share data among multiple campuses, and has a good sized IT department then the decision of Oracle vs Sql Server would come down to what does the customer most likely already have deployed.
Data migration nowdays isn't that bad since we have those great tools from Embarcadero, so I would instead let the customer needs drive the decision.
If you have the option SQL Server is a good choice from the word go, predominantly because you have access to solid procedures and functions and the database backup facilities are totally reliable. Wrapping up as much as your logic as you can inside the database itself (rather than in whatever language you are using) helps security and performance - indeed there's an good argument to be made for always using procedures for insert/update logic as these make you invulnerable to injection attacks.
If I have the choice the only time I'd consider MySQL in preference is with a large, fairly simple, database predominantly used for read access. This isn't to decry MySQL which has improved markedly of late and I happily use if I don't have the choice, but for more complex systems with update/insert activity MSSQL is generally the superior option.
I think your list is subjective but I will play your game.
Flat Files
BDB
SQLite
MySQL
PostgreSQL
SQL Server
Oracle
Teradata

Resources