Alternative to Apatar for scripted data migration

Alternative to Apatar for scripted data migration - salesforce

I'm looking for the fastest-to-success alternative solution for related data migration between Salesforce environments with some specific technical requirements. We were using Apatar, which worked fine in testing, but late in the game it has started throwing the dreaded socket "connection reset" errors and we have not been able to resolve it - it has some other issues that are leading me to ditch it.
I need to move a modest amount of data (about 10k rows total) between several sandboxes and ultimately to a production environment. The data is spread across eight custom objects. There is a four-level-deep master-detail relationship, which obviously must be preserved.
The target environment tables are 100% empty.
The trickiest object has a master-detail and two lookup fields.
Ideally, the data from one table near the top of the hierarchy should be filtered by a simple WHERE, and then children not under matching rows not migrated, but I'll settle for a solution that migrates all the data wholesale.
My fallback in this situation is going to be good old Data Loader, but it's not ideal because our schema is locked down and does not contain external ID fields, so scripting a solution that preserves all the M-D and lookups will take a while and be more error prone than I'd like.
It's been a long time since I've done a survey of the tools available, and don't have much time to do one now, so I'm appealing to the crowd. I need an app that will be simple (able to configure and test very quickly), repeatable, and rock-solid.
I've always pictured an SFDC data migration app that you can just check off eight checkboxes from a source environment, point it to a destination environment, and it just works, preserving all your relationships. That would be perfect. Free would be nice too. Does such a shiny thing exist?

Sesame Relational Junction seems to best match what you're looking for. I haven't used it, though; so, I can't comment on its effectiveness for what you're attempting.
The other route you may want to look into is using the Bulk API or using the Data Loader CLI with Task Scheduling.
You may find this information (below), from an answer to a different question, helpful.
Here is a list of integration services (other than Apatar):
Informatica Cloud
Cast Iron
SnapLogic
Boomi
JitterBit
Sesame Relational Junction
Information on other tools, to integrate Salesforce with other databases, is available here:
Salesforce Web Services API
Salesforce Bulk API

Relational Junction has a unique feature set that supports cloning, splitting, and merging of Salesforce orgs, and will keep the relationships intact in a one-pass load. It works like this:
Download source org to an empty database schema (any relationship DBMS)
Download target org to a second empty database schema
Run some scripts to condition the data; this varies by object. Sesame provides guidance and sample scripts, but essentially you have to set a control field to tell Relational Junction to create or update Salesforce. This is also where you may need to replace source ID's with target ID's if some objects have been pre-populated during sandbox creation
Replicate the second database to the target org
Relational Junction handles the socket disconnects, timeouts, and whatever havoc happens during the unload/reload process gracefully and without creating duplicates.
This process was developed for a proof of concept at a large Silicon Valley network vendor in 2007, who became a customer. The entire down and up of 15 GB of data took 46 hours, plus about 2 days of preparation.

Related

How to structure/coordinate multiple databases?

Imagine a large corp with dozens of companies, each with their own website and each website will have their own unique functional requirements
Most data on each website will be specific to that website
Each website can edit its own data
Some data will be shared across all websites
There will be a central CMS that is allowed to edit this data, but other websites can read and use that data
e.g. say you're planning the infrastructure for a company that owns multiple sub-companies that make different kinds of products, some in the same category (cereal, food), others in completely different categories (books, instruments). Some are marketing websites, some are for CRM, some are online stores
there are a list of regulatory requirements that affect all products
each company should manage the status of compliance of its own products to each requirement
when a new requirement surfaces, details regarding that requirement should only be entered once
How would the multiple databases be coordinated?
edit: added more info per Bob's suggestions
Thanks for the incredibly insightful questions!
compliance data is not shared, silo'd within each site
shared data is only on the one enterprise-wide database, they will mostly be "types of [thing]"
no conclusive list of instances where they'll be used but currently it'd be to populate CMS dropdowns for individual sites.
changes to shared data would occur a few times a year.
Ideally changes would be reflected within a few minutes, but an hour or so should be acceptable
very low volume in shared data.
All DBs will be new, decision on which DB is pending current investigation.
Sub-systems will expose REST api

Here are some ways I have seen this handled, you need to think about the implications of each structure based on the details of your particular business domain. All can work, but all have to be carefully set up if they are going to work.
One database for shared information and one for each client for client-specific information. Set up the overall application so that the first thing you put in the application on log in is the client and it connects to the correct client. People might have to also have a way to change the client if users will handled multiples.
Separate servers for each client if they completely need to be siloed. Database changes are by script (and in source control) and are applied to each server as need be. So the changes to the central database might have a job that runs to push any data changes to the other servers
All the data in one database, but making sure each table has a client_id so that the data is always filtered correctly by client. You can set up separate views by client, so that the users can only see the clients they are supposed to see. This only works if the data for each client is substantially in the same form.
And since you are in a regulatory environment, I strongly urge that you create an audit database that is updated by database triggers (never audit from the application, you will lose changes to the data) for each database.

I agree with Chris that, even after both the sets of questions, there is still a big set of possible solutions. For instance, if the databases were the same technology, and the shared data were stored in the same way in each one, you could do db-level replication from the central db to the others. Is it OK to have 2 separate dbs per application (one with shared stuff and one with not-shared?) - this would influence the kind of replication.
Or you could have a purely code solution, where clicking publish in a GUI that updates the central db calls a set of APIs that also update the other dbs. Or micro-services - updating the central db also creates a message on a shared queue, that is picked up by services that each look after a different db and apply the updates in whatever form makes sense for that db.
It depends on (among the things already mentioned) what your organisation's technology strategy is, what technology and skills you already have in-house, and so on.
So this is as much an architecture question as it is a db question.

I don't think this question is sufficiently clear to get a single answer. However there are a few possibilities.
In many cases, where you have shared data you want to have a single point of ownership of that information. It could be in a database, in an excel file (which can then be turned into csv and periodically loaded on all dbs), or some other form. The specifics depend on what is shared exactly.
Now in this case it sounds like you are going to have some sort of legal department in charge of some shared information and they will manage that data, which will then be shared to the other sites. This might be done with an application they manage which aggregates information from the other companies or it could be data which is pushed to their systems.
A final point:
Software is at its best when it facilitates human solutions to human problems, not when it tries to solve those problems directly. In these cases, you probably want a good human solution in place and then to look at what software can do to support that. A lot of the issues (who owns the information?) will already have been solved and you will be simply automating what is already done.

What is a good web application SQL Server data mart implementation in ElasticSearch?

Coming from a RDBMS background and trying to wrap my head around ElasticSearch data storage patterns...
Currently in SQL Server, we have a star schema data mart, RecordData. Rows are organized by user ID, geographic location that pertains to the rest of the searchable record, title and description (which are free text search fields).
I would like to move this over to ElasticSearch, and have read about creating a separate index per user. If I understand this correctly, with this suggestion, I would be creating a RecordData type in each user index, correct? What is a recommended naming convention for user indices that will be simple for Kibana analysis?
One issue I have with this recommendation is, how would you organize multiple web applications on the ES server? You wouldn't want to have all those user indices all over the place?
Is it so bad to have one index per application, and type per SQL Server table?
Since in SQL Server, we have other tables for user configuration, based on user ID's, I take it that I could then create new ES types in user indices for configuration. Is this a recommended pattern? I would rather not have two data base systems for this web application.
Suggestions welcome, thank you.

I went through the same thing, and there are a few things to take into account.
Data Modeling
You say you use a star schema today. Elasticsearch is typically appropriate for denormalized data where the totality of the information resides in each document unlike with a star schema. If you can live with denormalized, that is fine but I assume that since you already have star schema, denormalized data is not an option because you don't want to go and update millions of documents each time the location name change for example(if i understand the use case). At least in my use case that wasn't an option.
What are Elasticsearch options for normalized data?
This leads us to think of how to put star schema like data in a system like Elasticsearch. There are a few options in the documentation, the main ones i focused were
Nested Objects - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html . In nested objects the entire information is kept in a single document, meaning one location and its related users would be in a single document. That may make it not optimal becasue the document will be huge and again, a change in the location name will require to update the entire document. So this is better but still not optimal.
Parent - Child Relationship - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html . In this case the location and the User records would be kepts in separate indices similarly to a relational database. This seems to be the right modeling for what we need. The only major issue with this option is the fact that Kibana 4 does not provide ways to manipulate/aggregate documents based on parent/child relationship as of this writing. So if you main driver for using Elasticsearch is Kibana(this was mine), that kind of eliminates the option. If you want to benefit from the elasticsearch speed as an engine this seems to be the desired option for your use case.
In my opinion once you got right the data modeling all of your questions will be easier to answer.
Regarding the organization of the servers themselves, the way we organize that is by having a separate cluster of 3 elasticsearch nodes behind a Load Balancer(all of that is hosted on a cloud) and then have all your Web Applications connect to that cluster using the Elasticsearch API.
Hope that helps.

Architecture recommendation using SQL Server for real-time aggregation and denormalization

We have an enterprise LOB application for managing millions of bibliographic (lots of text) records using SQLServer (2008). The database is very normalized (a complete record might easily be made of up ten joined tables plus nested collections). Write transactions are fine, and we have a very responsive search solution for now, which makes generous use of full-text indexing and indexed views.
The issue is that in reality, much of what the research users need could be better served by a read-only warehouse-type copy of the data, but it would need to be continually copied near real-time (latency of a few minutes is fine).
Our search is optimized by several calculated columns or composite tables already, and we would like to add more. Indexed views cannot cover all needs because of their constraints (such as no outer joins). There are dozens of 'aspects' to this data, much like a read-only data warehouse might provide, involving permissions, geography, category, quality, and counts of associated documents. We also compose complex xml representations of the records that are fairly static and could be composed and stored once.
The total amount of denormalization, calculation and search optimization provokes an unacceptable delay if done completely via triggers, and is also prone to lock conflicts.
I've researched some of Microsoft's SQL Server suggestions, and I would like to know if anyone having experience with similar requirements has can offer recommendation from the following three (or other suggestions that use the SQL Server/.Net stack):
Transactional replication to a read-only copy - but it is unclear from the documentation how much one can change the schema on the subscriber side and add triggers, calculated columns or composite tables;
Table partitioning - not to alter the data, but perhaps to segment large areas of data that currently are recalculated constantly, such as permissions, record type (60), geographical region, etc...would that allow triggers on the transactional side to run with less locks?
Offline batch processing - Microsoft uses that phrase often, but does not give great examples, except for 'checking for signs of credit card fraud' on the subscriber side of transaction replication...which would be a great sample, but how is that done exactly in practice? SSIS jobs that run every 5 minutes? Service Broker? External executables that poll continually? We want to avoid the 'run a long process at night' solution, and we also want to avoid locking up the transactional side of things by running an update-intensive aggregating/compositing routine every 5 minutes on the transactional server.
Update to #3: after posting, I found this SO answer with a link to Real Time Data Integration using Change Tracking, Service Broker, SSIS and triggers - looks promising - would that be a recommended path?
Another Update: which, in turn, has helped me find rusanu.com - all things ServiceBroker by SO user Remus Rusanu. The asyncrhonous messaging solutions seem to match our scenario much better than the Replication scenarios...

Service Broker technology is good for serving your task although there are maybe potential drawback depending on your particular system configuration. The most valuable feature IMO is ability to decouple two kind of processing - writing and aggregation. You will be able to do this even using different databases/SQL Server instances/physical servers in very reliable way. Of course you need to spend some time designing message exchange process - specifying message formats, planning conversations, etc., because this has huge influence on satisfaction from resulting system.
I've used SSBS for my task that was more or less similar - near real-time creation of analytic data warehouse based on regular data flow.

What strategy to migrate data from a spreadsheet to an RDBMS?

This is linked to my other question when to move from a spreadsheet to RDBMS
Having decided to move to an RDBMS from an excel book, here is what I propose to do.
The existing data is loosely structured across two sheets in a work-book. The first sheet contains main record. The second sheet allows additional data.
My target DBMS is mysql, but I'm open to suggestions.
Define RDBMS schema
Define, say, web-services to interface with the database so the same can be used for both, UI and migration.
Define a migration script to
Read each group of affiliated rows from the spreadsheet
Apply validation/constraints
Write to RDBMS using the web-service
Define macros/functions/modules in spreadsheet to enforce validation where possible. This will allow use of the existing system while the new comes up. At the same time, ( i hope ) it will reduce migration failures when the move is eventually made.
What strategy would you follow?

There are two aspects to this question.
Data migration
Your first step will be to "Define RDBMS schema" but how far are you going to go with it? Spreadsheets are notoriously un-normalized and so have lots of duplication. You say in your other question that "Data is loosely structured, and there are no explicit constraints." If you want to transform that into a rigourously-defined schema (at least 3NF) then you are going to have to do some cleansing. SQL is the best tool for data manipulation.
I suggest you build two staging tables, one for each worksheet. Define the columns as loosely as possible (big strings basically) so that it is easy to load the spreadsheets' data. Once you have the data loaded into the staging tables you can run queries to assess the data quality:
how many duplicate primary keys?
how many different data formats?
what are the look-up codes?
do all the rows in the second worksheet have parent records in the first?
how consistent are code formats, data types, etc?
and so on.
These investigations will give you a good basis for writing the SQL with which you can populate your actual schema.
Or it might be that the data is so hopeless that you decide to stick with just the two tables. I think that is an unlikely outcome (most applications have some underlying structure, we just have to dig deep enough).
Data Loading
Your best bet is to export the spreadsheets to CSV format. Excel has a wizard to do this. Use it (rather than doing Save As...). If the spreadsheets contain any free text at all the chances are you will have sentences which contain commas, so make sure you choose a really safe separator, such as ^^~
Most RDBMS tools have a facility to import data from CSV files. Postgresql and Mysql are the obvious options for an NGO (I presume cost is a consideration) but both SQL Server and Oracle come in free (if restricted) Express editions. SQL Server obviously has the best integration with Excel. Oracle has a nifty feature called external tables which allow us to define a table where the data is held in a CSV file, removing the need for staging tables.
One other thing to consider is Google App Engine. This uses Big Table rather than an RDBMS but that might be more suited to your loosely-structured data. I suggest it because you mentioned Google Docs as an alternative solution. GAE is an attractive option because it is free (more or less, they start charging if usage exceeds some very generous thresholds) and it would solve the app sharing issue with those other NGOs. Obviously your organisation may have some qualms about Google hosting their data. It depends on what field they are operating in, and the sensitivity of the information.

Obviously, you need to create a target DB and the necessary table structure.
I would skip the web services and write a groovy script which reads the .xls (using the POI library), validates and saves the data in the database.
In my view, anything more involved (web services, GUI...) is not justified: these kinds of tasks are very well suited for scripts because they're concise and extremely flexible while things like performance, code base scalability and such are less of an issue here. Once you have something that works, you will be able to adapt the script to any future document with different data anomalies you run into in a matter of minutes or a few hours.
This is all assuming your data isn't in perfect order and needs to be filtered and/or cleaned.
Alternatively, if the data and validation rules aren't too complex, you can probably get good results with using a visual data transfer tool like Kettle: you just define the .xls as your source, the database table as the table, some validation/filter rules if needed and trigger the loading process. Quite painless.

If you'd rather use a tool that roll your own, check out SeekWell, which lets you write to your database from Google Sheets. Once you define your schema, Select the tables into a Sheet, then edit or insert the records and mark them for the appropriate action (e.g., update, insert, etc.). Set the schedule for the update and you're done. Read more about it here. Disclaimer--I'm a co-founder.
Hope that helps!

You might be doing more work than you need to. Excel spreadsheets can be saved as CVS or XML files and many RDBMS clients support importing these files directly into tables.
This could allow you skip writing web service wrappers and migration scripts. Your database constraints would still be properly enforced during any import. If your RDBMS data model or schema is very different from your Excel spreadsheets, however, then some translation would of course have to take place via scripts or XSLT.

What are the advantages of using a single database for EACH client?

In a database-centric application that is designed for multiple clients, I've always thought it was "better" to use a single database for ALL clients - associating records with proper indexes and keys. In listening to the Stack Overflow podcast, I heard Joel mention that FogBugz uses one database per client (so if there were 1000 clients, there would be 1000 databases). What are the advantages of using this architecture?
I understand that for some projects, clients need direct access to all of their data - in such an application, it's obvious that each client needs their own database. However, for projects where a client does not need to access the database directly, are there any advantages to using one database per client? It seems that in terms of flexibility, it's much simpler to use a single database with a single copy of the tables. It's easier to add new features, it's easier to create reports, and it's just easier to manage.
I was pretty confident in the "one database for all clients" method until I heard Joel (an experienced developer) mention that his software uses a different approach -- and I'm a little confused with his decision...
I've heard people cite that databases slow down with a large number of records, but any relational database with some merit isn't going to have that problem - especially if proper indexes and keys are used.
Any input is greatly appreciated!

Assume there's no scaling penalty for storing all the clients in one database; for most people, and well configured databases/queries, this will be fairly true these days. If you're not one of these people, well, then the benefit of a single database is obvious.
In this situation, benefits come from the encapsulation of each client. From the code perspective, each client exists in isolation - there is no possible situation in which a database update might overwrite, corrupt, retrieve or alter data belonging to another client. This also simplifies the model, as you don't need to ever consider the fact that records might belong to another client.
You also get benefits of separability - it's trivial to pull out the data associated with a given client ,and move them to a different server. Or restore a backup of that client when the call up to say "We've deleted some key data!", using the builtin database mechanisms.
You get easy and free server mobility - if you outscale one database server, you can just host new clients on another server. If they were all in one database, you'd need to either get beefier hardware, or run the database over multiple machines.
You get easy versioning - if one client wants to stay on software version 1.0, and another wants 2.0, where 1.0 and 2.0 use different database schemas, there's no problem - you can migrate one without having to pull them out of one database.
I can think of a few dozen more, I guess. But all in all, the key concept is "simplicity". The product manages one client, and thus one database. There is never any complexity from the "But the database also contains other clients" issue. It fits the mental model of the user, where they exist alone. Advantages like being able to doing easy reporting on all clients at once, are minimal - how often do you want a report on the whole world, rather than just one client?

Here's one approach that I've seen before:
Each customer has a unique connection string stored in a master customer database.
The database is designed so that everything is segmented by CustomerID, even if there is a single customer on a database.
Scripts are created to migrate all customer data to a new database if needed, and then only that customer's connection string needs to be updated to point to the new location.
This allows for using a single database at first, and then easily segmenting later on once you've got a large number of clients, or more commonly when you have a couple of customers that overuse the system.
I've found that restoring specific customer data is really tough when all the data is in the same database, but managing upgrades is much simpler.
When using a single database per customer, you run into a huge problem of keeping all customers running at the same schema version, and that doesn't even consider backup jobs on a whole bunch of customer-specific databases. Naturally restoring data is easier, but if you make sure not to permanently delete records (just mark with a deleted flag or move to an archive table), then you have less need for database restore in the first place.

To keep it simple. You can be sure that your client is only seeing their data. The client with fewer records doesn't have to pay the penalty of having to compete with hundreds of thousands of records that may be in the database but not theirs. I don't care how well everything is indexed and optimized there will be queries that determine that they have to scan every record.

Well, what if one of your clients tells you to restore to an earlier version of their data due to some botched import job or similar? Imagine how your clients would feel if you told them "you can't do that, since your data is shared between all our clients" or "Sorry, but your changes were lost because client X demanded a restore of the database".

As for the pain of upgrading 1000 database servers at once, some fairly simple automation should take care of that. As long as each database maintains an identical schema, then it won't really be an issue. We also use the database per client approach, and it works well for us.
Here is an article on this exact topic (yes, it is MSDN, but it is a technology independent article): http://msdn.microsoft.com/en-us/library/aa479086.aspx.
Another discussion of multi-tenancy as it relates to your data model here: http://www.ayende.com/Blog/archive/2008/08/07/Multi-Tenancy--The-Physical-Data-Model.aspx

Scalability. Security. Our company uses 1 DB per customer approach as well. It also makes code a bit easier to maintain as well.

In regulated industries such as health care it may be a requirement of one database per customer, possibly even a separate database server.
The simple answer to updating multiple databases when you upgrade is to do the upgrade as a transaction, and take a snapshot before upgrading if necessary. If you are running your operations well then you should be able to apply the upgrade to any number of databases.
Clustering is not really a solution to the problem of indices and full table scans. If you move to a cluster, very little changes. If you have have many smaller databases to distribute over multiple machines you can do this more cheaply without a cluster. Reliability and availability are considerations but can be dealt with in other ways (some people will still need a cluster but majority probably don't).
I'd be interested in hearing a little more context from you on this because clustering is not a simple topic and is expensive to implement in the RDBMS world. There is a lot of talk/bravado about clustering in the non-relational world Google Bigtable etc. but they are solving a different set of problems, and lose some of the useful features from an RDBMS.

There are a couple of meanings of "database"
the hardware box
the running software (e.g. "the oracle")
the particular set of data files
the particular login or schema
It's likely Joel means one of the lower layers. In this case, it's just a matter of software configuration management... you don't have to patch 1000 software servers to fix a security bug, for example.
I think it's a good idea, so that a software bug doesn't leak information across clients. Imagine the case with an errant where clause that showed me your customer data as well as my own.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight