Transforming (Synchronizing) Data between SQL to HBase

Transforming (Synchronizing) Data between SQL to HBase - sql-server

We are overhauling our product by completely moving from Microsoft and .NET family to open source (well one of the reasons is cost cutting and exponential increase in data).
We plan to move our data model completely from SQL Server (relational data) to Hadoop (the famous key-Value pair ecosystem).
In the beginning, we want to support both versions (say 1.0 and new v2.0). In order to maintain the data consistency, we plan to sync the data between both systems, which is a fairly challenging task and error prone, but we don't have any other option.
A bit confused where to start from, I am looking up to the community of experts.
Any strategy/existing literature or any other kind of guidance in this direction would be greatly helpful.

I am not entirely sure how your code is structured, but if you currently have a data or persistence layer, or at least a database access class where all your SQL is executed through, you could override the save functions to write changes to both databases. If you do not have a data layer, you may want to considering writing one before starting the transition.
Otherwise, you could add triggers in MSSQL to update Hadoop, not sure what you can do in Hadoop to keep MSSQL in-sync.
Or, you could have a process that runs every x minutes, that manually syncs the two databases.
Personally, I would try to avoid trying to maintain two databases of record. Moving changes from a new, experimental database to your stable database seems risky. You stand the chance of corrupting your stable system. Instead, I would write a convertor to move data from your relational DB to Hadoop. Then every night or so, copy your data into Hadoop and use it for the development and testing of your new system. I think test users would understand if you said your beta version is just a test playground, and won't effect your live product. If you plan on making major changes to your UI and fear some will not want to transition to 2.0, then you might be trying to tackle too much at once.
Those are the solutions I came up with... Good luck!

Consider using a queuing tool like Flume (http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/) to split your input between both systems.

Related

Fooling Around With Databases, Online Development

I'm trying to design a database for a small project I'm working on. Eventually, I'd like to make it a web-app, but right now I don't mind just experimenting with data offline. However, I'm stuck in a crossroads.
The basic concept would be a user inputs values for 10 fields, to be compared against what is in the database, with each item having a weighted value. I know that if I was to code it, that I could use a look-up tables for each field, add up the values, and display the result to the end-user.
Another example would be having to get the distance between two points, each point stored in a row, with the X value getting its own column as well as the Y value.
Now, if I store data within a database, should I try to do everything within queries (which I think would involve temporary tables among other things), or just use simple queries, and manipulate the rows returned within the application code?
Right now, I'm thinking to go for the latter (manipulate data within the app) and just use queries to reduce the amount of data that I would have to go sort through. What would you guys suggest?
EDIT: Right now I'm using Microsoft Access to get the basics down pat and try to get a good design going. IIRC with my experience with Oracle and MySQL you can run commands together in a batch process and return just one result. But not sure if you can do that with Access.

If you're using a database I would strongly suggest using SQL to do all your manipulation. SQL is far more capable and powerful for this kind of job as compared to imperative programming languages.
Of course it does imply that you're comfortable in thinking about data as "sets" and programming in a declarative style. But spending time now to get really comfortable with SQL and manipulating data using SQL will pay off big time in the long run. Not only for this project but for projects in the future. I would also suggest using stored procedures over queries in code because stored procedure provide a beautiful abstraction layer allowing your table design to change over time without impacting the rest of the system.
A very big part of using and working with databases is understanding Data modeling, normalization and the like. Like everything else it will be a effort but in the long run it will pay off.
May I ask why you're using Access when you have a far better database available to you such as MSSQL Express? The migration path from MSSQL Express to MSSQL or SQL Azure even is quite seamless and everything you do and experience today (in this project) completely translates to MSSQL Server/SQL Azure for future projects as well as if this project grows beyond your expectations.
I don't understand your last statement about running a batch process and getting just one result, but if you can do it in Oracle and MySQL then you can do it in MSSQL Express as well.

What Shiv said, and also...
A good DBMS has quite a bit of solid engineering in it. There are two components that are especially carefully engineered, namely the query optimizer and the transaction controller. If you adopt the view of using the DBMS as just a stupid table retrieval tool, you will most likely end up inventing your own optimizer and transaction controller inside the application. You won't need the transaction controller until you move to an environment that supports multiple concurrent users.
Unless your engineering talents are extraordinary, you will probably end up with a home brew data management system that is not as good as the one in a good DBMS.
The learning curve for SQL is steep. You need to learn how to phrase queries that join, project, and restrict data from multiple tables. You need to learn how to handle updates in the context of a transaction.
You need to learn simple and sound table design and index design. This includes, but is not limited to, data normalization and data modeling. And you need a DBMS with a good optimizer and good transaction control.
The learning curve is steep. But the view from the top is worth the climb.

Using a common database for collaborative development

Some of the people in my project seem to think that using a common development database with everyone connecting to it is the best thing. I think that it isn't and each developer having his own database (with periodic updated data dumps) is the best. Am I right or wrong? Have you encountered any problems in any of these approaches?

Disk space and CPU should be cheap enough that every developer can run their own instance of the database, with an automated build under version control. This is needed to allow developers to be bold in hacking on the database, in isolation from any other developer's concurrent hacking.
The caveat being, of course, that any changes they make to their private instance are useless to anyone else unless it can be automatically applied during the build process. So there needs to be a firm policy that application code can't depend on any database state unless that state is represented by version-controlled, unit-tested changes to the DDL.
For an excellent guide on the theory and practice of treating the database definition as another part of the project code, and coordinating changes and refactorings, see Refactoring Databases: Evolutionary Database Design by Scott W. Ambler and Pramod Sadalage.

I like having my own copy of the database for development, because it gives you the flexibility to rapidly change things without worrying how it will impact others.
However, if all the developers are hacking away on their own copy of the database, it becomes more and more difficult to merge everyone's work together in the end.
I think you can get the best of both worlds by letting developers work on a local copy during day-to-day development, but each developer should probably merge their work into a common copy on a pretty regular basis. Writing a lot of unit tests helps too.

We share a single database amongst all our developer (20-odd) but we've got it structured so that everyone has their own tables.
You don't need a separate database per developer if you structure the application right. It should be configurable which database or table-prefix it uses anyway so you can easily move it between instances (unit test, system test, acceptance test, production, disaster recovery and so on).
The advantage to using a single database is that the cost of maintenance is amortized. You don't have your DBAs trying to handle a lot of databases (or, if you're a small-DB shop, you don't have every developer trying to maintain their own database when they're better utilized in developing).

Having a single point of Failure is not a good thing isn't it?

I prefer a single, shared database. But it's very dependent on the situation and the applications being developed.
What works for me may not work for you. Go with your gut.

If you are working with Hibernate or any hibernate-based platform you can configure your database to be created when you start your server (create-drop option). This is very useful when you are adding new attributes to your classes. If this is the case each developer must have his own copy of the DB.
If you are not changing the DB structure at all then you can use a single shared DB.
In this second case is not a must. I prefer to have my own DB where I can do whatever I want. On the other hand remember that some queries can take a lot of time and this will affect your whole team if you are sharing a DB.

In Memory Database

I'm using SqlServer to drive a WPF application, I'm currently using NHibernate and pre-read all the data so it's cached for performance reasons. That works for a single client app, but I was wondering if there's an in memory database that I could use so I can share the information across multiple apps on the same machine. Ideally this would sit below my NHibernate stack, so my code wouldn't have to change. Effectively I'm looking to move my DB from it's traditional format on the server to be an in memory DB on the client.
Note I only need select functionality.

I would be incredibly surprised if you even need to load all your information in memory. I say this because, just as one example, I'm working on a Web app at the moment that (for various reasons) loads thousands of records on many pages. This is PHP + MySQL. And even so it can do it and render a page in well under 100ms.
Before you go down this route make sure that you have to. First make your database as performant as possible. Now obviously this includes things like having appropriate indexes and tuning your database but even though are putting the horse before the cart.
First and foremost you need to make sure you have a good relational data model: one that lends itself to performant queries. This is as much art as it is science.
Also, you may like NHibernate but ORMs are not always the best choice. There are some corner cases, for example, that hand-coded SQL will be vastly superior in.
Now assuming you have a good data model and assuming you've then optimized your indexes and database parameters and then you've properly configured NHibernate, then and only then should you consider storing data in memory if and only if performance is still an issue.
To put this in perspective, the only times I've needed to do this are on systems that need to perform millions of transactions per day.
One reason to avoid in-memory caching is because it adds a lot of complexity. You have to deal with issues like cache expiry, independent updates to the underlying data store, whether you use synchronous or asynchronous updates, how you give the client a consistent (if not up-to-date) view of your data, how you deal with failover and replication and so on. There is a huge complexity cost to be paid.
Assuming you've done all the above and you still need it, it sounds to me like what you need is a cache or grid solution. Here is an overview of Java grid/cluster solutions but many of them (eg Coherence, memcached) apply to .Net as well. Another choice for .Net is Velocity.
It needs to be pointed out and stressed that something like NHibernate is only consistent so long as nothing externally updates the database and that there is exactly one NHibernate-enabled process (barring clustered solutions). If two desktop apps on two different PCs are both updating the same database with NHibernate the caching simply won't work because the persistence units simply won't be aware of the changes the other is making.

http://www.db4o.com/ can be your friend!

Velocity is an out of process object caching server designed by Microsoft to do pretty much what you want although it's only in CTP form at the moment.
I believe there are also wrappers for memcached, which can also be used to cache objects.

You can use HANA, express edition. You can download it for free, it's in-memory, columnar and allows for further analytics capabilities such as text analytics, geospatial or predictive. You can also access with ODBC, JDBC, node.js hdb library, REST APIs among others.

The best technology to synchronize data between different database schemas?

I have an existing SQL Server 2005 database that runs our accounting/inventory application. We are looking at using a new on-line ordering framework - which has it's own database.
If we use this new framework, we will need to transfer the on-line ordering data (inventory, prices, orders, customers) - almost realtime - to and from, our existing inventory database. The transfer of data doesn't have to be real-time, but it has to be quick. Both databases will be in SQL Server.
So my question is... what is the best way to transer data back and forth between two databases, with have different schemas?
Replication? SSIS? What would you suggest, and why?
Any help would be appreciated!

The Business Rules are the Hard Part
One-way sync? Two-way sync? Real-time push? Nightly updates? Dump and reload? Compare and update? Conflict resolution? Which side wins? Push read-only info one way, and order info the other way? What about changes/cancellations/etc? Do order statuses get pushed back?
You can see where I'm going here. Technology is a secondary question.
Because of the business rules issue, and because the two systems have different schemas (and different purposes), this isn't a standard data move, and most of the "standard" answers (replication, log shipping, etc) are off the table.
There are frameworks out there designed to help with this, like Microsoft BizTalk or Scribe Insight. These are cumbersome and expensive, though.
It isn't too difficult to create a custom queue-ing system either based on SQL triggers, or scheduled pushes (depending on your needs) in C# or your favorite language. That's probably the route I would go. It would probably involve a third "transfer" database to hold the queue of changes made by one side, and a module to apply the business rules and push the data to the other.

Personally I would run away from this nightmare as fast as I could. Since you have not yet bought this online ordering I would suggest that keeping the data in synch with the existing application is a valid reason for not doing such a thing. If you buy this you will eternally regret how mucked up your data will become and how much time and money you spend trying to get things to work properly. This is a disaster waiting to happen. You''ll end up having people order items supposedly in iventory when there are none in the warehouse. Do not do this. This is a guarantee of angry customers and angry managers. Far far cheaper over time to hire some developers to put together your own online ordering that accesses your data base. If they go ahead over your objections, I'd update my resume.

From personal experience, I would only use replication if there was no other choice. You have to tear it down for any schema change and it has a tendency to just blow up.
For this, I'd most likely use SSIS. It's fairly easy to build a transformation package and fairly simple to maintain.

Replication works well, and if it's two way, it might be your only viable option, since conflict resolution is built in.
If you're going one-way, SSIS or triggers on tables would fine, and would push the data real-time (for triggers) or at whatever interval you want (SSIS). The upside to SSIS would be that it's a background process, whereas triggers could potentially hold up transactions on the supply side while they push the data.
If you're looking to move massive amounts of data, there are other products out there that can do it for you, but if it's not too much data, a solution using SQL Servers tools should do all you need.

What are the advantages of using a single database for EACH client?

In a database-centric application that is designed for multiple clients, I've always thought it was "better" to use a single database for ALL clients - associating records with proper indexes and keys. In listening to the Stack Overflow podcast, I heard Joel mention that FogBugz uses one database per client (so if there were 1000 clients, there would be 1000 databases). What are the advantages of using this architecture?
I understand that for some projects, clients need direct access to all of their data - in such an application, it's obvious that each client needs their own database. However, for projects where a client does not need to access the database directly, are there any advantages to using one database per client? It seems that in terms of flexibility, it's much simpler to use a single database with a single copy of the tables. It's easier to add new features, it's easier to create reports, and it's just easier to manage.
I was pretty confident in the "one database for all clients" method until I heard Joel (an experienced developer) mention that his software uses a different approach -- and I'm a little confused with his decision...
I've heard people cite that databases slow down with a large number of records, but any relational database with some merit isn't going to have that problem - especially if proper indexes and keys are used.
Any input is greatly appreciated!

Assume there's no scaling penalty for storing all the clients in one database; for most people, and well configured databases/queries, this will be fairly true these days. If you're not one of these people, well, then the benefit of a single database is obvious.
In this situation, benefits come from the encapsulation of each client. From the code perspective, each client exists in isolation - there is no possible situation in which a database update might overwrite, corrupt, retrieve or alter data belonging to another client. This also simplifies the model, as you don't need to ever consider the fact that records might belong to another client.
You also get benefits of separability - it's trivial to pull out the data associated with a given client ,and move them to a different server. Or restore a backup of that client when the call up to say "We've deleted some key data!", using the builtin database mechanisms.
You get easy and free server mobility - if you outscale one database server, you can just host new clients on another server. If they were all in one database, you'd need to either get beefier hardware, or run the database over multiple machines.
You get easy versioning - if one client wants to stay on software version 1.0, and another wants 2.0, where 1.0 and 2.0 use different database schemas, there's no problem - you can migrate one without having to pull them out of one database.
I can think of a few dozen more, I guess. But all in all, the key concept is "simplicity". The product manages one client, and thus one database. There is never any complexity from the "But the database also contains other clients" issue. It fits the mental model of the user, where they exist alone. Advantages like being able to doing easy reporting on all clients at once, are minimal - how often do you want a report on the whole world, rather than just one client?

Here's one approach that I've seen before:
Each customer has a unique connection string stored in a master customer database.
The database is designed so that everything is segmented by CustomerID, even if there is a single customer on a database.
Scripts are created to migrate all customer data to a new database if needed, and then only that customer's connection string needs to be updated to point to the new location.
This allows for using a single database at first, and then easily segmenting later on once you've got a large number of clients, or more commonly when you have a couple of customers that overuse the system.
I've found that restoring specific customer data is really tough when all the data is in the same database, but managing upgrades is much simpler.
When using a single database per customer, you run into a huge problem of keeping all customers running at the same schema version, and that doesn't even consider backup jobs on a whole bunch of customer-specific databases. Naturally restoring data is easier, but if you make sure not to permanently delete records (just mark with a deleted flag or move to an archive table), then you have less need for database restore in the first place.

To keep it simple. You can be sure that your client is only seeing their data. The client with fewer records doesn't have to pay the penalty of having to compete with hundreds of thousands of records that may be in the database but not theirs. I don't care how well everything is indexed and optimized there will be queries that determine that they have to scan every record.

Well, what if one of your clients tells you to restore to an earlier version of their data due to some botched import job or similar? Imagine how your clients would feel if you told them "you can't do that, since your data is shared between all our clients" or "Sorry, but your changes were lost because client X demanded a restore of the database".

As for the pain of upgrading 1000 database servers at once, some fairly simple automation should take care of that. As long as each database maintains an identical schema, then it won't really be an issue. We also use the database per client approach, and it works well for us.
Here is an article on this exact topic (yes, it is MSDN, but it is a technology independent article): http://msdn.microsoft.com/en-us/library/aa479086.aspx.
Another discussion of multi-tenancy as it relates to your data model here: http://www.ayende.com/Blog/archive/2008/08/07/Multi-Tenancy--The-Physical-Data-Model.aspx

Scalability. Security. Our company uses 1 DB per customer approach as well. It also makes code a bit easier to maintain as well.

In regulated industries such as health care it may be a requirement of one database per customer, possibly even a separate database server.
The simple answer to updating multiple databases when you upgrade is to do the upgrade as a transaction, and take a snapshot before upgrading if necessary. If you are running your operations well then you should be able to apply the upgrade to any number of databases.
Clustering is not really a solution to the problem of indices and full table scans. If you move to a cluster, very little changes. If you have have many smaller databases to distribute over multiple machines you can do this more cheaply without a cluster. Reliability and availability are considerations but can be dealt with in other ways (some people will still need a cluster but majority probably don't).
I'd be interested in hearing a little more context from you on this because clustering is not a simple topic and is expensive to implement in the RDBMS world. There is a lot of talk/bravado about clustering in the non-relational world Google Bigtable etc. but they are solving a different set of problems, and lose some of the useful features from an RDBMS.

There are a couple of meanings of "database"
the hardware box
the running software (e.g. "the oracle")
the particular set of data files
the particular login or schema
It's likely Joel means one of the lower layers. In this case, it's just a matter of software configuration management... you don't have to patch 1000 software servers to fix a security bug, for example.
I think it's a good idea, so that a software bug doesn't leak information across clients. Imagine the case with an errant where clause that showed me your customer data as well as my own.