I need to synchronize two databases.
Those databases stores same semantic objects but physically different across two databases.
I plan to use a DTO Pattern to uniformize object representation :
DB ----> DTO ----> MAPPING (Getters / Setters) ----> DTO ----> DB
I think it's a better idea than physically synchronize using SQL Query on each side, I use hibernate to add abstraction, and synchronize object.
Do you think, it's a good idea ?
Nice reference above to Hitchhiker's Guide.
My two cents. You need to consider using the right tool for the job. While it is compelling to write custom code to solve this problem, there are numerous tools out there that already do this for you, map source to target, do custom tranformations from attribute to attribute and will more than likely deliver with faster time to market.
Look to ETL tools. I'm unfamiliar with the tools avaialable in the open source community but if you lean in that direction, I'm sure you'll find some. Other tools you might look at are: Informatica, Data Integrator, SQL Server Integration Services and if you're dealing with spatial data, there's another called Alteryx.
Tim
Doing that with an ORM might be slower by order of magnitude than a well-crafted SQL script. It depends on the size of the DB.
EDIT
I would add that the decision should depend on the amount of differences between the two schemas, not your expertise with SQL. SQL is so common that developers should be able to write simple script in a clean way.
SQL has also the advantage that everybody know how to run the script, but not everybody will know how to run you custom tool (this is a problem I encountered in practice if migration is actually operated by somebody else).
For schemas which only slightly differ (e.g. names, or simple transformation of column values), I would go for SQL script. This is probably more compact and straightforward to use and communicate.
For schemas with major differences, with data organized in different tables or complex logic to map some value from one schema to the other, then a dedicated tool may make sense. Chances are the the initial effort to write the tool is more important, but it can be an asset once created.
You should also consider non-functional aspects, such as exception handling, logging of errors, splitting work in smaller transaction (because there are too many data), etc.
SQL script can indeed become "messy" under such conditions. If you have such constraints, SQL will require advanced skills and tend to be hard to use and maintain.
The custom tool can evolve into a mini-ETL with ability to chunck the work in small transactions, manage and log errors nicely, etc. This is more work, and can result in being a dedicated project.
The decision is yours.
I have done that before, and I thought it was a pretty solid and straightforward way to map between 2 DBs. The only downside is that any time either database changes, I had to update the mapping logic, but it's usually pretty simple to do.
Related
Does it make sense to use an OR-mapper?
I am putting this question of there on stack overflow because this is the best place I know of to find smart developers willing to give their assistance and opinions.
My reasoning is as follows:
1.) Where does the SQL belong?
a.) In every professional project I have worked on, security of the data has been a key requirement. Stored Procedures provide a natural gateway for controlling access and auditing.
b.) Issues with Applications in production can often be resolved between the tables and stored procedures without putting out new builds.
2.) How do I control the SQL that is generated? I am trusting parse trees to generate efficient SQL.
I have quite a bit of experience optimizing SQL in SQL-Server and Oracle, but would not feel cheated if I never had to do it again. :)
3.) What is the point of using an OR-Mapper if I am getting my data from stored procedures?
I have used the repository pattern with a homegrown generic data access layer.
If a collection needed to be cached, I cache it. I also have experience using EF on a small CRUD application and experience helping tuning an NHibernate application that was experiencing performance issues. So I am a little biased, but willing to learn.
For the past several years we have all been hearing a lot of respectable developers advocating the use of specific OR-Mappers (Entity-Framework, NHibernate, etc...).
Can anyone tell me why someone should move to an ORM for mainstream development on a major project?
edit: http://www.codinghorror.com/blog/2006/06/object-relational-mapping-is-the-vietnam-of-computer-science.html seems to have a strong discussion on this topic but it is out of date.
Yet another edit:
Everyone seems to agree that Stored Procedures are to be used for heavy-duty enterprise applications, due to their performance advantage and their ability to add programming logic nearer to the data.
I am seeing that the strongest argument in favor of OR mappers is developer productivity.
I suspect a large motivator for the ORM movement is developer preference towards remaining persistence-agnostic (don’t care if the data is in memory [unless caching] or on the database).
ORMs seem to be outstanding time-savers for local and small web applications.
Maybe the best advice I am seeing is from client09: to use an ORM setup, but use Stored Procedures for the database intensive stuff (AKA when the ORM appears to be insufficient).
I was a pro SP for many, many years and thought it was the ONLY right way to do DB development, but the last 3-4 projects I have done I completed in EF4.0 w/out SP's and the improvements in my productivity have been truly awe-inspiring - I can do things in a few lines of code now that would have taken me a day before.
I still think SP's are important for some things, (there are times when you can significantly improve performance with a well chosen SP), but for the general CRUD operations, I can't imagine ever going back.
So the short answer for me is, developer productivity is the reason to use the ORM - once you get over the learning curve anyway.
A different approach... With the raise of No SQL movement now, you might want to try object / document database instead to store your data. In this way, you basically will avoid the hell that is OR Mapping. Store the data as your application use them and do transformation behind the scene in a worker process to move it into a more relational / OLAP format for further analysis and reporting.
Stored procedures are great for encapsulating database logic in one place. I've worked on a project that used only Oracle stored procedures, and am currently on one that uses Hibernate. We found that it is very easy to develop redundant procedures, as our Java developers weren't versed in PL/SQL package dependencies.
As the DBA for the project I find that the Java developers prefer to keep everything in the Java code. You run into the occassional, "Why don't I just loop through all the Objects that just returned?" This caused a number of "Why isn't the index taking care of this?" issues.
With Hibernate your entities can contain not only their linked database properties, but can also contain any actions taken upon them.
For example, we have a Task Entity. One could Add or Modify a Task among other things. This can be modeled in the Hibernate Entity in Named Queries.
So I would say go with an ORM setup, but use procedures for the database intensive stuff.
A downside of keeping your SQL in Java is that you run the risk of developers using non-parameterized queries leaving your app open to a SQL Injection.
The following is just my private opinion, so it's rather subjective.
1.) I think that one needs to differentiate between local applications and enterprise applications. For local and some web applications, direct access to the DB is okay. For enterprise applications, I feel that the better encapsulation and rights management makes stored procedures the better choice in the end.
2.) This is one of the big issues with ORMs. They are usually optimized for specific query patterns, and as long as you use those the generated SQL is typically of good quality. However, for complex operations which need to be performed close to the data to remain efficient, my feeling is that using manual SQL code is stilol the way to go, and in this case the code goes into SPs.
3.) Dealing with objects as data entities is also beneficial compared to direct access to "loose" datasets (even if those are typed). Deserializing a result set into an object graph is very useful, no matter whether the result set was returned by a SP or from a dynamic SQL query.
If you're using SQL Server, I invite you to have a look at my open-source bsn ModuleStore project, it's a framework for DB schema versioning and using SPs via some lightweight ORM concept (serialization and deserialization of objects when calling SPs).
For our reporting application, we have a process that aggregates several databases into a single 'reporting' database on a nightly basis. The schema of the reporting database is quite different than that of the separate 'production' databases that we are aggregating so there is a good amount of business logic that goes into how the data is aggregated.
Right now this process is implemented by several stored procedures that run nightly. As we add more details to the reporting database the logic in the stored procedures keeps growing more fragile and unmanageable.
What are some other strategies that could be used to populate this reporting database?
SSIS? This has been considered but doesn't appear to offer a much cleaner, more maintainable approach that just the stored procedures.
A separate C# (or whatever language) process that aggregates the data in memory and then pushes it into the reporting database? This would allow us to write Unit Tests for the logic and organize the code in a much more maintainable manner.
I'm looking for any new ideas or additional thoughts on the above. Thanks!
Our general process is:
Copy data from source table(s) into
tables with exactly the same
structure in a loading database
Transform data into staging
table, which have the same structure
as the final fact/dimension tables
Copy data from the staging tables to
the fact/dimension tables
SSIS is good for step 1, which is more or less a 1:1 copy process, with some basic data type mappings and string transformations.
For step 2, we use a mix of stored procs, .NET and Python. Most of the logic is in procedures, with things like heavy parsing in external code. The major benefit of pure TSQL is that very often transformations depend on other data in the loading database, e.g. using mapping tables in a SQL JOIN is much faster than doing a row-by-row lookup process in an external script, even with caching. Admittedly, that's just my experience, and procedural processing might be better for syour data set.
In a few cases we do have to do some complex parsing (of DNA sequences) and TSQL is just not a viable solution. So that's where we use external .NET or Python code to do the work. I suppose we could do it all in .NET procedures/functions and keep it in the database, but there are other external connections required, so a separate program makes sense.
Step 3 is a series of INSERT... SELECT... statements: it's fast.
So all in all, use the best tool for the job, and don't worry about mixing things up. An SSIS package - or packages - is a good way to link together stored procedures, executables and whatever else you need to do, so you can design, execute and log the whole load process in one place. If it's a huge process, you can use subpackages.
I know what you mean about TSQL feeling awkward (actually, I find it more repetitive than anything else), but it is very, very fast for data-driven operations. So my feeling is, do data processing in TSQL and string processing or other complex operations in external code.
I would take another look at SSIS. While there is a learning curve, it can be quite flexible. It has support for a lot of different ways to manipulate data including stored procedures, ActiveX scripts and various ways to manipulate files. It has the ability to handle errors and provide notifications via email or logging. Basically, it should be able to handle just about everything. The other option, a custom application, is probably going to be a lot more work (SSIS already has a lot of the basics covered) and is still going to be fragile - any changes to data structures will require a recompile and redeployment. I think a change to your SSIS package would probably be easier to make. For some of the more complicated logic you might even need to use multiple stages - a custom C# console program to manipulate the data a bit and then an SSIS package to load it to the database.
SSIS is a bit painful to learn and there are definitely some tricks to getting the most out of it but I think it's a worthwhile investment. A good reference book or two would probably be a good investment (Wrox's Expert SQL Server 2005 Integration Services isn't bad).
I'd look at ETL (extract/transform/load) best practices. You're asking about buying vs building, a specific product, and a specific technique. It's probably worthwhile to backup a few steps first.
A few considerations:
There's a lot of subtle tricks to delivering good ETL: making it run very fast, be very easily managed, handling rule-level audit results, supporting high-availability or even reliable recovery and even being used as the recovery process for the reporting solution (rather than database backups).
You can build your own ETL. The downside is that commercial ETL solutions have pre-built adapters (which you may not need anyway), and that custom ETL solutions tend to fail since few developers are familiar with the batch processing patterns involved (see your existing architecture). Since ETL patterns have not been well documented it is unlikely to be successful in writing your own ETL solution unless you bring in a developer very experienced in this space.
When looking at commercial solutions note that the metadata and auditing results are the most valuable part of the solution: The GUI-based transform builders aren't really any more productive than just writing code - but the metadata can be more productive than reading code when it comes to maintenance.
Complex environments are difficult to solution with a single ETL product - because of network access, performance, latency, data format, security or other requirements incompatible with your ETL tool. So, a combination of custom & commercial often results anyway.
Open source solutions like Pentaho are really commercial solutions if you want support or critical features.
So, I'd probably go with a commercial product if pulling data from commercial apps, if the requirements (performance, etc) are tough, or if you've got a junior or unreliable programming team. Otherwise you can write your own. In that case I'd get an ETL book or consultant to help understand the typical functionality and approaches.
I've run data warehouses that were built on stored procedures, and I have used SSIS. Neither is that much better than the other IMHO. The best tool I have heard of to manage the complexity of modern ETL is called Data Build Tool (DBT) (https://www.getdbt.com/). It has a ton of features that make things more manageable. Need to refresh a particular table in the reporting server? One command will rebuild it, including refreshing all the tables it depends on back to the source. Need dynamic SQL? This offers Jinja for scripting your dynamic SQL in ways you never thought possible. Need version control for what's in your database? DBT has you covered. After all that, it's free.
I ever developed several projects based on python framework Django. And it greatly improved my production. But when the project was released and there are more and more visitors the db becomes the bottleneck of the performance.
I try to address the issue, and find that it's ORM(django) to make it become so slow. Why? Because Django have to serve a uniform interface for the programmer no matter what db backend you are using. So it definitely sacrifice some db's performance(make one raw sql to several sqls and never use the db-specific operation).
I'm wondering the ORM is definitely useful and it can:
Offer a uniform OO interface for the progarammers
Make the db backend migration much easier (from mysql to sql server or others)
Improve the robust of the code(using ORM means less code, and less code means less error)
But if I don't have the requirement of migration, What's the meaning of the ORM to me?
ps. Recently my friend told me that what he is doing now is just rewriting the ORM code to the raw sql to get a better performance. what a pity!
So what's the real meaning of ORM except what I mentioned above?
(Please correct me if I made a mistake. Thanks.)
You have mostly answered your own question when you listed the benefits of an ORM. There are definitely some optimisation issues that you will encounter but the abstraction of the database interface probably over-rides these downsides.
You mention that the ORM sometimes uses many sql statements where it could use only one. You may want to look at "eager loading", if this is supported by your ORM. This tells the ORM to fetch the data from related models at the same time as it fetches data from another model. This should result in more performant sql.
I would suggest that you stick with your ORM and optimise the parts that need it, but, explore any methods within the ORM that allow you to increase performance before reverting to writing SQL to do the access.
A good ORM allows you to tune the data access if you discover that certain queries are a bottleneck.
But the fact that you might need to do this does not in any way remove the value of the ORM approach, because it rapidly gets you to the point where you can discover where the bottlenecks are. It is rarely the case that every line of code needs the same amount of careful hand-optimisation. Most of it won't. Only a few hotspots require attention.
If you write all the SQL by hand, you are "micro optimising" across the whole product, including the parts that don't need it. So you're mostly wasting effort.
here is the definition from Wikipedia
Object-relational mapping is a programming technique for converting data between incompatible type systems in relational databases and object-oriented programming languages. This creates, in effect, a "virtual object database" that can be used from within the programming language.
a good ORM (like Django's) makes it much faster to develop and evolve your application; it lets you assume you have available all related data without having to factor every use in your hand-tuned queries.
but a simple one (like Django's) doesn't relieve you from good old DB design. if you're seeing DB bottleneck with less than several hundred simultaneous users, you have serious problems. Either your DB isn't well tuned (typically you're missing some indexes), or it doesn't appropriately represents the data design (if you need many different queries for every page this is your problem).
So, i wouldn't ditch the ORM unless you're twitter or flickr. First do all the usual DB analysis: You see a lot of full-table scans? add appropriate indexes. Lots of queries per page? rethink your tables. Every user needs lots of statistics? precalculate them in a batch job and serve from there.
ORM separates you from having to write that pesky SQL.
It's also helpful for when you (never) port your software to another database engine.
On the downside: you lose performance, which you fix by writing a custom flavor of SQL - that it tried to insulate from having to write in the first place.
ORM generates sql queries for you and then return as object to you. that's why it slower than if you access to database directly. But i think it slow a little bit ... i recommend you to tune your database. may be you need to check about index of table etc.
Oracle for example, need to be tuned if you need to get faster ( i don't know why, but my db admin did that and it works faster with queries that involved with lots of data).
I have recommendation, if you need to do complex query (eg: reports) other than (Create Update Delete/CRUD) and if your application won't use another database, you should use direct sql (I think Django has it feature)
We have a database with ~100K business objects in it. Each object has about 40 properties which are stored amongst 15 tables. I have to get these objects, perform some transforms on them and then write them to a different database (with the same schema.)
This is ADO.Net 3.5, SQL Server 2005.
We have a library method to write a single property. It figures out which of the 15 tables the property goes into, creates and opens a connection, determines whether the property already exists and does an insert or update accordingly, and closes the connection.
My first pass at the program was to read an object from the source DB, perform the transform, and call the library routine on each of its 40 properties to write the object to the destination DB. Repeat 100,000 times. Obviously this is egregiously inefficent.
What are some good designs for handling this type of problem?
Thanks
This is exactly the sort of thing that SQL Server Integration Services (SSIS) is good for. It's documented in Books Online, same as SQL Server is.
Unfortunately, I would say that you need to forget your client-side library, and do it all in SQL.
How many times do you need to do this? If only once, and it can run unattended, I see no reason why you shouldn't reuse your existing client code. Automating the work of human beings is what computers are for. If it's inefficient, I know that sucks, but if you're going to do a week of work setting up a SSIS package, that's inefficient too. Plus, your client-side solution could contain business logic or validation code that you'd have to remember to carry over to SQL.
You might want to research Create_Assembly, moving your client code across the network to reside on your SQL box. This will avoid network latency, but could destabilize your SQL Server.
Bad news: you have many options
use flatfile transformations: Extract all the data into flatfiles, manipulate them using grep, awk, sed, c, perl into the required insert/update statements and execute those against the target database
PRO: Fast; CON: extremly ugly ... nightmare for maintanance, don't do this if you need this for longer then a week. And a couple dozens of executions
use pure sql: I don't know much about sql server, but I assume it has away to access one database from within the other, so one of the fastes ways to do this is to write it as a collection of 'insert / update / merge statements fed with select statements.
PRO: Fast, one technology only; CON: Requires direct connection between databases You might reach the limit of SQL or the available SQL knowledge pretty fast, depending on the kind of transformation.
use t-sql, or whatever iterative language the database provides, everything else is similiar to pure sql aproach.
PRO: pretty fast since you don't leave the database CON: I don't know t-sql, but if it is anything like PL/SQL it is not the nicest language to do complex transformation.
use a high level language (Java, C#, VB ...): You would load your data into proper business objects manipulate those and store them in the database. Pretty much what you seem to be doing right now, although it sounds there are better ORMs available, e.g. nhibernate
use a ETL Tool: There are special tools for extracting, transforming and loading data. They often support various databases. And have many strategies readily available for deciding if an update or insert is in place.
PRO: Sorry, you'll have to ask somebody else for that, I so far have nothing but bad experience with those tools.
CON: A highly specialized tool, that you need to master. I my personal experience: slower in implementation and execution of the transformation then handwritten SQL. A nightmare for maintainability, since everything is hidden away in proprietary repositories, so for IDE, Version Control, CI, Testing you are stuck with whatever the tool provider gives you, if any.
PRO: Even complex manipulations can be implemented in a clean maintainable way, you can use all the fancy tools like good IDEs, Testing Frameworks, CI Systems to support you while developing the transformation.
CON: It adds a lot of overhead (retrieving the data, out of the database, instanciating the objects, and marshalling the objects back into the target database. I'd go this way if it is a process that is going to be around for a long time.
Building on the last option you could further glorify the architectur by using messaging and webservices, which could be relevant if you have more then one source database, or more then one target database. Or you could manually implement a multithreaded transformer, in order to gain through put. But I guess I am leaving the scope of your question.
I'm with John, SSIS is the way to go for any repeatable process to import large amounts of data. It should be much faster than the 30 hours you are currently getting. You could also write pure t-sql code to do this if the two database are on the same server or are linked servers. If you go the t-sql route, you may need to do a hybrid of set-based and looping code to run on batches (of say 2000 records at a time) rather than lock up the table for the whole time a large insert would take.
I know that the title might sound a little contradictory, but what I'm asking is with regards to ORM frameworks (SQLAlchemy in this case, but I suppose this would apply to any of them) that allow you to define your schema within your application.
Is it better to change the database schema directly and then update the column types in your program manually, or does it make more sense to define the tables in your application and then use the ORM framework's table generation functions to make the schema and then build the tables on the database side for you?
Bear in mind that applications and databases tend to live in a M:M relationship in any but the most trivial cases. If your application is at all likely to have interfaces to other systems, reports, data extracts or loads, or data migrated onto or off it from another system then the database has more than one stakeholder.
Be nice to the other stakeholders in your application. Take the time and get the schema right and put some thought into data quality in the design of your application. Keep an eye on anyone else using the application and make sure you don't break bits of the schema that they depend on without telling them. This means that the database has a life of its own to a greater or lesser extent. The more integration, the more independent the database.
Of course, if nobody else uses or cares about the data, feel free to ignore my advice.
My personal belief is that you should design the database on its own merits. The database is the best place to handle things modeling your Domain data. The database is also the biggest source of slow down in applications and letting your ORM design your database seems like a bad idea to me. :)
Of course, I've only got a couple of big projects behind me. I'm still learning daily. :)
The best way to define your database schema is to start with modeling your application domain (domain driven design anyone?) and seeing what tables take shape based on the domain objects you define.
I think this is the best way because really the database is simply a place to persist information from the application, it should never lead the design. It's not the only place to persist information as well. We have users that want to work from flat files or the database for instance. They could also use XML files. So by starting with your domain objects and then generating tables (or flat file or XML schema or whatever) from there will lead to a much better design in the end.
While this may depend on you using an object-oriented language, using an ORM tool like Hibernate/NHibernate, SubSonic, etc. can really make this transition easy for you up to, and including generating the database creation scripts.
In reference to performance, performance should be one of the last things you look at in an application, it should never drive the design. After you get a good schema up and running based on your domain you can always make tweaks to improve its performance.
Alot depends on your skill level with the specific database product that you're going to use. Think of it as the difference between a "manual" and "automatic" transmission car. ORMs provide you with that "automatic" transmission, just start designing your classes, and let the ORM worry about getting it stored into the database somehow.
Sounds good. The problem with most ORMs is that in their quest to be PI "persistence ignorant", they often don't take advantage of specific database features that can provide elegant solutions for a given task. Notice, I didn't say ALL ORMs, just most.
My take is to design the conceptual data model first yourself. Then you can go in either direction, up towards the application space, or down towards the physical database. But remember, only YOU know if it's more advantageous to use a view instead of a table, should you normalize or de-normalize a table, what non-clustered index(es) make sense with this table, is a natural or surrogate key more appropriate for this table, etc... Of course, if you feel that these questions are beyond your grasp, then let the ORM help you out.
One more thing, you really need to seperate the application design from the database design. They are almost never the same. How important is that data? Could another application be designed to use that data? It's a lot easier to refactor an application than it is to refactor a database with a billion rows of data spread across thousands of tables.
Well, if you can get away with it, doing it in the application is probably the best way. Since it's a perfect example of the DRY principle.
Having said that however, getting away with it is always going to be hard to pull off since you're practically choosing to give up most database specific optimizations. (more so, with querying, but it still applies to schemas (indexes, etc)).
You'll probably end up changing the schema by hand anyway, and then you'll be stuck with a brittle database schema that's going to be the source of your worst nightmares :)
My 2 Cents
Design each based on their own requirements as much as possible. Trying to keep them in too rigid sync is a good illustration of increased coupling/decreased cohesion.
Come to think of it, ORMs can easily be used to spread coupling (even though it can be avoided to some degree).