Should I use messaging instead of a database

Should I use messaging instead of a database - database

I am designing a system that will allow users to take data from one system and send to other systems. One of the destination systems has a sophisticated SOA (web services) and the other is a mainframe that accepts flat files for input.
I have created a database that has a PublishEvent table and PublishEventType table. There are also normalized tables that are specific to the type of event being published.
I also have an "interface" table that is a flatened out version of the normalized data tables. The end user has a process that puts data into the interface table. I am not sure of the exact process - I think it's some kind of reporting application that they can export results to a SQL table. I then use an SSIS package to take the data out of the interface table and put it into the normalized data structure and create new rows in the PublishEvent table. I use the flat table because when I first showed them the relational tables they seemed to be very confused.
I have a windows service that watches for new rows in the PublishEvent table. The windows service is extended with plug-ins (using the MEF framework). Which plug-in is called depends on the value of the PublishEventTypeID field in the PublishEvent row.
PublishEventTypeID 1 calls the plug-in that reads data from one set of tables and calls the SOA Web service. PublishEventTypeID 2 calls the plug-in that reads data from a different set of tables and created the flat file to be sent to the mainframe.
This seems like I am implementing the "Database as IPC" anti-pattern. Should I change my design to use a messaging based system? Is the process of puting data into the flat table then into the normalized tables redundant?
EDIT: This is being developed in .NET 3.5

A MOM is probably the better solution but you also have to take in account the following points:
Do you have a message based system already in place as part of your
customer's architecture? If not, maybe introducing it is an
overkill.
Do you have any experience with Message-based systems? As an Jason
Plank correctly mentioned, you have to take in account specific
patterns for these, like having to ensure chronological order of
messages, managing dead letter channels and so on (see this
book for more).
You mentioned a mainframe system which has apparently limited
options for interfacing with. Who will take care of the layer that
will transform "messages" (either DB or MOM based) into something
that the mainframe can digest? Assuming it is you, would it be
easier (for you) to do that by accessing the DB (maybe you have
already worked on the problem in the past) or would the effort be
different depending on using a DB or a MOM?
To sum it up: if you are more confident by going the DB route, maybe it's better to do that, even if - as you correctly suggested yourself, it is a bit of an "anti-pattern".

Some key items to keep in mind are:
Row order consistency - Does your data model depend on the order of the data generated? If so, does your scheme ensure the pub and sub activity in the same order original data is created?
Do you have identity columns on either side? They are a problem since their value keeps changing based on the order the data is inserted. If Identity column is the sole primary key (surrogate key), a change in its value may make the data unusable.
How do you prove that you have not lost a record? This is the trickiest part of the solution, especially if you have millions of rows.
As for the architecture, you may want to check out the XMPP protocol - Smack for client (if Java) and eJabberD for Server.

Have a look at nServiceBus, Mass Transit or RhinoServiceBus if you're using .Net.

Related

Alternative to Apatar for scripted data migration

I'm looking for the fastest-to-success alternative solution for related data migration between Salesforce environments with some specific technical requirements. We were using Apatar, which worked fine in testing, but late in the game it has started throwing the dreaded socket "connection reset" errors and we have not been able to resolve it - it has some other issues that are leading me to ditch it.
I need to move a modest amount of data (about 10k rows total) between several sandboxes and ultimately to a production environment. The data is spread across eight custom objects. There is a four-level-deep master-detail relationship, which obviously must be preserved.
The target environment tables are 100% empty.
The trickiest object has a master-detail and two lookup fields.
Ideally, the data from one table near the top of the hierarchy should be filtered by a simple WHERE, and then children not under matching rows not migrated, but I'll settle for a solution that migrates all the data wholesale.
My fallback in this situation is going to be good old Data Loader, but it's not ideal because our schema is locked down and does not contain external ID fields, so scripting a solution that preserves all the M-D and lookups will take a while and be more error prone than I'd like.
It's been a long time since I've done a survey of the tools available, and don't have much time to do one now, so I'm appealing to the crowd. I need an app that will be simple (able to configure and test very quickly), repeatable, and rock-solid.
I've always pictured an SFDC data migration app that you can just check off eight checkboxes from a source environment, point it to a destination environment, and it just works, preserving all your relationships. That would be perfect. Free would be nice too. Does such a shiny thing exist?

Sesame Relational Junction seems to best match what you're looking for. I haven't used it, though; so, I can't comment on its effectiveness for what you're attempting.
The other route you may want to look into is using the Bulk API or using the Data Loader CLI with Task Scheduling.
You may find this information (below), from an answer to a different question, helpful.
Here is a list of integration services (other than Apatar):
Informatica Cloud
Cast Iron
SnapLogic
Boomi
JitterBit
Sesame Relational Junction
Information on other tools, to integrate Salesforce with other databases, is available here:
Salesforce Web Services API
Salesforce Bulk API

Relational Junction has a unique feature set that supports cloning, splitting, and merging of Salesforce orgs, and will keep the relationships intact in a one-pass load. It works like this:
Download source org to an empty database schema (any relationship DBMS)
Download target org to a second empty database schema
Run some scripts to condition the data; this varies by object. Sesame provides guidance and sample scripts, but essentially you have to set a control field to tell Relational Junction to create or update Salesforce. This is also where you may need to replace source ID's with target ID's if some objects have been pre-populated during sandbox creation
Replicate the second database to the target org
Relational Junction handles the socket disconnects, timeouts, and whatever havoc happens during the unload/reload process gracefully and without creating duplicates.
This process was developed for a proof of concept at a large Silicon Valley network vendor in 2007, who became a customer. The entire down and up of 15 GB of data took 46 hours, plus about 2 days of preparation.

Updating data from several different sources

I'm in the process of setting up a database with customer information. The database will handle customer data (customer id, address, phonenr etc.) as well as some basic information about which kind of advertisement a specific customer has been subjected to, and how they reacted to it.
The data will be maintained both from a central data-warehouse, but additional information about customers and the advertisement will also be updated from other sources. For example, if an external advertisement agency runs a campaign, I want them to be able to feed back data about OptOuts, e-mail bounces etc. I guess what I need is an API which can be easily handed out to any number of agencies.
My first thought was to set up a web service API for all external sources, but since we'll probably be talking large amounts of data (millions of records per batch) I'm not sure a web service is the best option.
So my question is, what's the best practice here? I need a solution simple enough for advertisement agencies (likely with moderately skilled IT-people) to make use of. Simplicity is of the essence – by which I mean “simplicity over performance” in this case. If the set up gets too complex, it won't work.
The system will very likely be based on Microsoft technology.
Any suggestions?

The process you're describing is commonly referred to as Data Integration using ETL processes. ETL stands for Extract-Transform-Load. The idea is to build up your central data warehouse by extracting information from a lot of different data-sources, transform it and then load it into your data warehouse.
A variety of (also graphical) tools exist to implement such a process. Since you said you'll probably running a Microsoft stack, I suggest having a look at Sql Server Integration Services (SSIS).
Regarding your suggestion to implement integration using a web-service, I don't think that's a good idea too. Similarily, I don't think shifting the burden of data integration to your customers is a good idea either. You should agree with your customers on some form of a data exchange format, it could be as simple as a CSV file, or XML, Excel sheets, Access databases, use whatever suits your needs.
Any modern ETL tool like SSIS is capable of working with those different data sources.

Can you provide some advice on setting up my database?

I'm working on a MUD (Multi User Dungeon) in Python and am just now getting around to the point where I need to add some rooms, enemies, items, etc. I could hardcode all this in, but it seems like this is more of a job for a database.
However, I've never really done any work with databases before so I was wondering if you have any advice on how to set this up?
What format should I store the data in?
I was thinking of storing a Dictionary object in the database for each entity. In htis way, I could then simply add new attributes to the database on the fly without altering the columns of the database. Does that sound reasonable?
Should I store all the information in the same database but in different tables or different entities (enemies and rooms) in different databases.
I know this will be a can of worms, but what are some suggestions for a good database? Is MySQL a good choice?

1) There's almost never any reason to have data for the same application in different databases. Not unless you're a Fortune500 size company (OK, i'm exaggregating).
2) Store the info in different tables.
As an example:
T1: Rooms
T2: Room common properties (aplicable to every room), with a row per **room*
T3: Room unique properties (applicable to minority of rooms, with a row per property per room - thos makes it easy to add custom properties without adding new columns
T4: Room-Room connections
Having T2 AND T3 is important as it allows you to combine efficiency and speed of row-per-room idea where it's applicable with flexibility/maintanability/space saving of attribute-per-entity-per-row (or Object/attribute/value as IIRC it's called in fancy terms) schema
Good discussion is here
3) Implementation wise, try to write something re-usable, e.g. have generic "Get_room" methods, which underneath access the DB -= ideally via transact SQL or ANSI SQL so you can survive changing of DB back-end fairly painlessly.
For initial work, you can use SQLite. Cheap, easy and SQL compatible (the best property of all). Install is pretty much nothing, DB management can be done by freeware tools or even FireFox plugin IIRC (all of FireFox 3 data stores - history, bookmarks, places, etc... - are all SQLite databases).
For later, either MySQL or Postgres (I don't do either one professionally so can't recommend one). IIRC at some point Sybase had free personal db server as well, but no idea if that's still the case.

This technique is called entity-attribute-value model. It's normally preferred to have DB schema that reflects the structure of the objects, and update the schema when your object structure changes. Such strict schema is easier to query and it's easier to make sure that the data is correct on the database level.
One database with multiple tables is the way to do.
If you want a database server, I've recommend PostgreSQL. MySQL has some advantages, like easy replication, but PostgreSQL is generally nicer to work with. If you want something smaller that works directly with the application, SQLite is a good embedded database.

Storing an entire object (serialized/encoded) as a value in the database is bad for querying - I am sure that some queries in your mud will NOT need to know 100% of attributes, or may retrieve a list of object by a value of attributes.

it seems like this is more of a job
for a database
True, although 'database' doesn't have to mean 'relational database'. Most existing MUDs store all data in memory, and read it in from flat-file saved in a plain-text data format. I'm not necessarily recommending this route, just pointing out that a traditional database is by no means necessary. If you do want to go the relational route, recent versions of Python come with sqlite which is a lightweight embedded relational database with good SQL support.
Using relational databases with your code can be awkward. Any change to a game logic class can require a parallel change to the database, and changes to the code that read and write to the database. For this reason good planning will help you a lot, but it's hard to plan a good database schema without experience. At least get your entity classes planned first, then build a database schema around it. Reading up on normalizing a database and understanding the principles there will help.
You may want to use an 'object-relational mapper' which can simplify a lot of this for you. Examples in Python include SQLObject, SQLAlchemy, and Autumn. These hide a lot of the complexities for you, but as a result can hide some of the important details too. I'd recommend using the database directly until you are more familiar with it, and consider using an ORM in the future.
I was thinking of storing a Dictionary
object in the database for each
entity. In htis way, I could then
simply add new attributes to the
database on the fly without altering
the columns of the database. Does that
sound reasonable?
Unfortunately not - if you do that, you waste 99% of the capabilities of the database and are effectively using it as a glorified data store. However, if you don't need aforementioned database capabilities, this is a valid route if you use the right tool for the job. The standard shelve module is well worth looking at for this purpose.
Should I store all the information in
the same database but in different
tables or different entities (enemies
and rooms) in different databases.
One database. One table in the database per entity type. That's the typical approach when using a relational database (eg. MySQL, SQL Server, SQLite, etc).
I know this will be a can of worms,
but what are some suggestions for a
good database? Is MySQL a good choice?
I would advise sticking with sqlite until you're more familiar with SQL. Otherwise, MySQL is a reasonable choice for a free game database, as is PostGreSQL.

One database. Each database table should refer to an actual data object.
For instance, create a table for all items, all creatures, all character classes, all treasures, etc.
Spend some time now and figure out how objects will relate to each other, as this will affect your database structure. For example, can a character have more than one character class? Can monsters have character classes? Can monsters carry items? Can rooms have more than one monster?
It seems pedantic, but you'll save yourself a whole lot of trouble early by figuring out what database objects "belong" to which other database objects.

Synchronising data entities from different applications

I'm looking for some feedback on the best approach to a problem I've been tasked with. There are two systems with their own databases which store very similar business entities.
For each entity in question there needs to be a synchronization mechanism in place to make sure that changes in one database are delivered to the other when a change occurs and for the changes to be translated into the destination table structure. This translation means that replication is not an option but I don't want to start writing bespoke triggers or views etc to keep them in sync.
Is this something which BizTalk or a similar product could handle after an initial configuration / mapping process? Also, is Biztalk potentially overkill and are there any other methods which I could employee to achieve this?
Thanks,
Brian.

It depends on the size of the "systems" (tables ?) to synchronise.
EAI are the general application to do this. Connecting two systems which can't interact together, effectivly mapping one business object to another one, aplling a map to translate one into another.
But such tools (like webMethods for exemple) are entreprise tools, if you only need to synchronise two table from two systems EAI will clearly be overkill.
Anyway the principles can help you. The EAI approach would be to have a generic business object that's match all of properties found in both systems for the business objects you want to syncrhonise. Then you will have to have some sort of map to translate each application specific business objet to and from you generic business object. Your object should not only describe the business data, but also the operation to perform (create, update, delete data).
Then you need a trigger (or two if you want to synchronyze both ways) to detect when a change happen, use the map to transform the data your trigger get to generic object (with the operation to perform at the other end).
And finally you need an "updater" that will take the specific business object and do the right operation in the database (insert/update/delete)
EAI provide connectors to take care of triggering the workflow and updating the database. You will still need to define some mappings in some specific way depending of the EAI used.
EAI are a lot more powerfull than juste synchronizing two tables. Connnectors have various type and can interact with various system (proprietary ones), various database, simple format (xml, text) or specific protocols (ftp, webservices, etc.)
EAI also ensure that any modification is effectivly commited at the end.
Hope it helps.

Sql Server Integration Services could be a cheep candidate for solving the problem (can connect to other DBs and data sources that Sql Server). SSIS is part of all Sql Server installations (with the exception of Express).

There is a nifty tool called "datariver" by the Swiss company Sowatec (where I did work a few years ago. I wasn't involved with this product though; just so you know). It's meant to flow data from sources to sinks (just like a river).
The web site is in German but the guys behind it are happy to answer any of your questions in English by mail.

BizTalk is and would be an ideal solution for this kind of problem.
What BizTalk can do?
1. Define a schema which represents a common business entity, this is essentially all the fields which need to be in sync across several database tables.
Define the flow of communication (Orchestrations) and end-points(web services), i.e. which update triggers what changes!
Use maps to map the common business entity into specific data elements required
by the databases. Note that biztalk has built-in adapters to speed up the development process.
Adequate time must be spent in design and of this system the results would be fabulous.
For development purposes refer my articles (google keywords: Biztalk + Karamchetti)

How do you handle small sets of data?

With really small sets of data, the policy where I work is generally to stick them into text files, but in my experience this can be a development headache. Data generally comes from the database and when it doesn't, the process involved in setting it/storing it is generally hidden in the code. With the database you can generally see all the data available to you and the ways with which it relates to other data.
Sometimes for really small sets of data I just store them in an internal data structure in the code (like A Perl hash) but then when a change is needed, it's in the hands of a developer.
So how do you handle small sets of infrequently changed data? Do you have set criteria of when to use a database table or a text file or..?
I'm tempted to just use a database table for absolutely everything but I'm not sure if there are any implications to this.
Edit: For context:
I've been asked to put a new contact form on the website for a handful of companies, with more to be added occasionally in the future. Except, companies don't have contact email addresses.. the users inside these companies do (as they post jobs through their own accounts). Now though, we want a "speculative application" type functionality and the form needs an email address to send these applications to. But we also don't want to put an email address as a property in the form or else spammers can just use it as an open email gateway. So clearly, we need an ID -> contact_email type relationship with companies.
SO, I can either add a column to a table with millions of rows which will be used, literally, about 20 times OR create a new table that at most is going to hold about 20 rows. Typically how we handle this in the past is just to create a nasty text file and read it from there. But this creates maintenance nightmares and these text files are frequently looked over when data that they depend on changes. Perhaps this is a fault with the process, but I'm just interested in hearing views on this.

Put it in the database. If it changes infrequently, cache it in your middle tier.

The example that springs to mind immediately is what is appropriate to have stored as an enumeration and what is appropriate to have stored in a "lookup" database table.
I tend to "draw the line" with the rule that if it will result in a column in the database containing a "magic number" that maps to an enumeration value, then the enumeration should really exist as a lookup table. If it's unrelated to the data stored in the database (eg. Application configuration data rather than user generated data), then it's an enumeration all the way.

Surely it depends on the user of the software tool you've developed to consume the set of data, regardless of size?
It might just be that they know Excel, so your tool would have to parse a .csv file that they create.
If it's written for the developers, then who cares what you use. I'm not a fan of cluttering databases with minor or transient data however.

We have a standard config file format (key:value) and a class to handle it. We just use that on all projects. Mostly we're just setting persistent properties for our applications (mobile phone development) so that's an appropriate thing to do. YMMV

In cases where the program accesses a database, I'll store everything in there: easier for backup and moving data around.
For small programs without database access I store my data in the .net settings, which are stored in an xml file - of course this is a feature of c#, so it might not apply to you.
Anyway, I make sure to store all data in one place. Usually a database.

Have you considered sqlite ? It's file-based, which addresses your feeling that "just a file might do" (zero configuration), but it's a perfectly good database and scales remarkably well. It supports a number of APIs and there are numerous front ends for administering it.

If these are small config-like data, i use some simple and common format. ini, json and yaml are usually ok. Java and .NET fans also like XML. in short, use something that you can easily read to an in-memory object and forget about it.

I would add it to the database in the main table:
Backup and recovery (you do want to recover this text file, right?)
Adhoc querying (since you can do it will a SQL tool and join it to the other database data)
If the database column is empty the store requirements for it should be minimal (nothing if it's a NULL column at the end of the table in Oracle)
It will be easier if you want to have multiple application servers as you will not need to keep multiple copies of some extra config file around
Putting it into a little child table only complicates the design without giving any real benefits
You may well already be going to that same row in the database as part of your processing anyway, so performance is not likely to be a problem. If you are not, you could cache it in memory.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight