I have searched the answers and haven't found anything relating to the project I'm currently working on. If I've missed it if you could please link the URL it would be greatly appreciated.
Currently we have many sets in the field which run various applications to transfer field data to a master database. These applications were developed for different reason as different times and are currently sending only specific data. We now have a need to send all data. I was hoping to leverage ApexSQL, a product we already have, data diff to possibly transfer each infield servers net new row entries to the main database.
I was hoping to get some information from someone who may have already looked at this as a possibility or have implemented it in the past.
The extra fun aspect of this project is that it must fall under PCI compliance which I can figure out after the fact.
To transfer only new entries, you have to synchronize the records that exist in your source (infield servers) and are "missing" in your destination (the main database). ApexSQL Data Diff can do this, but you have to select all records in the Missing tab for the specific tables and unselect all records in the Additional tab. If you leave the records selected inn the Additional tab, the records that exist in the main database but doesn't in the infield servers will be deleted, so you have to be careful
If you have ApexSQL Diff API, you can use SynchronizeMissingRows and all missing rows in the database will be synchronized
Related
I am now working on a project which requires to show the transaction history of one customer and if the product customer buys is under warranty or not. I need to use the data from the current system, the system can provide Web API, which is a .csv file. So how can I make use of the current system data?
A solution I think of is to download all the .csv files and write scripts to insert every record into the database I built which contains the necessary tables and relations to hold the data I retrieve. Then I can have a new database which I want. because I never done this before so I want know if it is feasible?
And one more question would be, if I should store the data locally or use a cloud database like Firebase?
High-end databases like SQL Server and Oracle come with utilities that allow you to read directly from a csv file. Check the docs. Having done this many times, the best procedure I found was to read the file into one holding table. This gives you the chance to examine the data and find any unexpected quirks or missing fields. This allows you to correct the data, where possible.
Then write the scripts to move the data from the holding table into the proper tables you have designed. This must be done in a logical manner. For example, move the customer data before the buy transactions. Thus any error messages you get will not be because you tried to store a transaction before you stored the customer. (You will have referential integrity set up, yes?) This gives you more chances to correct or adjust the data or just identify problems more or less at your leisure.
Whether or not to store the data in the cloud is strictly according to the preferences of your employer.
I need a database (not even close to a warehouse yet) that will serve as a master record of office locations and certain attributes associated with them. I have an initial source file, but also will be adding to that information from a number of other supplemental files. Some of the additional information will simply be records I can join to after matching. And some will be information I create once I've validated some status or have some internal knowledge on a location to store.
The source data is mostly flat files - some updated every few months, some annually. I am looking for design advice including ideas for how to query the database later.
I want to be able to start with the records in the initial data source, leave it alone, and then do a number of things in a new database:
if we verify the address, etc. from the initial source, I want to flag that and be able to give it precedence
if we can match a record in the initial data source to a record in one of the other data sources, I want to pull one or both records via query
if a location is participating in a specific program, I want to store that flag, regardless of which source I will keep for the address, etc., i.e. there is information related to a record that is separate from any of the different sources.
So, I was starting to think about something similar to a Star schema, but with a fact table pointing to the separate source tables based on some matching logic. My fact table would have records that include the keys of any of the other sources that are related. And it might also have additional columns for our added info on each record that is separate from the sources.
I was also thinking about just building a table that holds some kind of copy of the current info from each related source - or perhaps a view. But this may not be robust enough to serve as a basic source-of-truth database given all the flags and other additional info I expect to start adding.
This is not a large database - in the tens of thousands only - and does not need high availability to many users/queries. Eventually, I will want to join it to list of individuals at the locations. I'm hoping I can join the people to the places easily after I've got the main locations table built.
I realize this is question has a broad scope. I'm not an experienced database designer and I've searched for advice already but haven't figured out how to describe my challenge so I get relevant search results. All advice and references are most appreciated!
I'm working in SQL Server 2012
Imagine a large corp with dozens of companies, each with their own website and each website will have their own unique functional requirements
Most data on each website will be specific to that website
Each website can edit its own data
Some data will be shared across all websites
There will be a central CMS that is allowed to edit this data, but other websites can read and use that data
e.g. say you're planning the infrastructure for a company that owns multiple sub-companies that make different kinds of products, some in the same category (cereal, food), others in completely different categories (books, instruments). Some are marketing websites, some are for CRM, some are online stores
there are a list of regulatory requirements that affect all products
each company should manage the status of compliance of its own products to each requirement
when a new requirement surfaces, details regarding that requirement should only be entered once
How would the multiple databases be coordinated?
edit: added more info per Bob's suggestions
Thanks for the incredibly insightful questions!
compliance data is not shared, silo'd within each site
shared data is only on the one enterprise-wide database, they will mostly be "types of [thing]"
no conclusive list of instances where they'll be used but currently it'd be to populate CMS dropdowns for individual sites.
changes to shared data would occur a few times a year.
Ideally changes would be reflected within a few minutes, but an hour or so should be acceptable
very low volume in shared data.
All DBs will be new, decision on which DB is pending current investigation.
Sub-systems will expose REST api
Here are some ways I have seen this handled, you need to think about the implications of each structure based on the details of your particular business domain. All can work, but all have to be carefully set up if they are going to work.
One database for shared information and one for each client for client-specific information. Set up the overall application so that the first thing you put in the application on log in is the client and it connects to the correct client. People might have to also have a way to change the client if users will handled multiples.
Separate servers for each client if they completely need to be siloed. Database changes are by script (and in source control) and are applied to each server as need be. So the changes to the central database might have a job that runs to push any data changes to the other servers
All the data in one database, but making sure each table has a client_id so that the data is always filtered correctly by client. You can set up separate views by client, so that the users can only see the clients they are supposed to see. This only works if the data for each client is substantially in the same form.
And since you are in a regulatory environment, I strongly urge that you create an audit database that is updated by database triggers (never audit from the application, you will lose changes to the data) for each database.
I agree with Chris that, even after both the sets of questions, there is still a big set of possible solutions. For instance, if the databases were the same technology, and the shared data were stored in the same way in each one, you could do db-level replication from the central db to the others. Is it OK to have 2 separate dbs per application (one with shared stuff and one with not-shared?) - this would influence the kind of replication.
Or you could have a purely code solution, where clicking publish in a GUI that updates the central db calls a set of APIs that also update the other dbs. Or micro-services - updating the central db also creates a message on a shared queue, that is picked up by services that each look after a different db and apply the updates in whatever form makes sense for that db.
It depends on (among the things already mentioned) what your organisation's technology strategy is, what technology and skills you already have in-house, and so on.
So this is as much an architecture question as it is a db question.
I don't think this question is sufficiently clear to get a single answer. However there are a few possibilities.
In many cases, where you have shared data you want to have a single point of ownership of that information. It could be in a database, in an excel file (which can then be turned into csv and periodically loaded on all dbs), or some other form. The specifics depend on what is shared exactly.
Now in this case it sounds like you are going to have some sort of legal department in charge of some shared information and they will manage that data, which will then be shared to the other sites. This might be done with an application they manage which aggregates information from the other companies or it could be data which is pushed to their systems.
A final point:
Software is at its best when it facilitates human solutions to human problems, not when it tries to solve those problems directly. In these cases, you probably want a good human solution in place and then to look at what software can do to support that. A lot of the issues (who owns the information?) will already have been solved and you will be simply automating what is already done.
A client has one of my company's applications which points to a specific database and tables within the database on their server. We need to update the data several times a day. We don't want to update the tables that the users are looking at in live sessions. We want to refresh the data on the side and then flip which database/tables the users are accessing.
What is the accepted way of doing this? Do we have two databases and rename the databases? Do we put the data into separate tables, then rename the tables? Are there other approaches that we can take?
Based on the information you have provided, I believe your best bet would be partition switching. I've included a couple links for you to check out because it's much easier to direct you to a source that already explains it well. There are several approaches with partition switching you can take.
Links: Microsoft and Catherin Wilhelmsen blog
Hope this helps!
I think I understand what you're saying: if the user is on a screen you don't want the screen updating with new information while they're viewing it, only update when they pull a new screen after the new data has been loaded? Correct me if I'm wrong. And Mike's question is also a good one, how is this data being fed to the users? Possibly there's a way to pause that or something while the new data is being loaded. There are more elegant ways to load data like possibly partitioning the table, using a staging table, replication, have the users view snapshots, etc. But we need to know what you mean by 'live sessions'.
Edit: with the additional information you've given me, partition switching could be the answer. The process takes virtually no time, it just changes the pointers from the old records to the new ones. Only issue is you have to partition on something patitionable, like a date or timestamp, to differentiate old and new data. It's also an Enterprise-Edition feature and I'm not sure what version you're running.
Possibly a better thing to look at is Read Committed Snapshot Isolation. It will ensure that your users only look at the new data after it's committed; it provides a transaction-level consistent view of the data and has minimal concurrency issues, though there is more overhead in TempDB. Here are some resources for more research:
http://www.databasejournal.com/features/mssql/snapshot-isolation-level-in-sql-server-what-why-and-how-part-1.html
https://msdn.microsoft.com/en-us/library/tcbchxcb(v=vs.110).aspx
Hope this helps and good luck!
The question details are a little vague so to clarify:
What is a live session? Is it a session in the application itself (with app code managing it's own connections to the database) or is it a low level connection per user/session situation? Are users just running reports or active reading/writing from the database during the session? When is a session over and how do you know?
Some options:
1) Pull all the data into the client for the entire session.
2) Use read committed or partitions as mentioned in other answers (however requires careful setup for your queries and increases requirements for the database)
3) Use replica database for all queries, pause/resume replication when necessary (updating data should be faster than your process but it still might take a while depending on volume and complexity)
4) Use replica database and automate a backup/restore from the master (this might be the quickest depending on the overall size of your database)
5) Use multiple replica databases with replication or backup/restore and then switch the connection string (this allows you to update the master constantly and then update a replica and switch over at a certain predictable time)
I have a daily process that relies on flat files delivered to a "drop box" directory on file system, this kicks off a load of this comma-delimited (from external company's excel etc) data into a database, a piecemeal Perl/Bash application, this database is used by multiple applications as well as edited directly with some GUI tools. Some of the data then gets replicated with some additional Perl app into the database that I mainly use.
Needless to say, all that is complicated and error prone, data coming in is sometimes corrupt or sometimes an edit breaks it. My users often complain about missing or incorrect data. Diffing the flat files and DBs to analyze where the process breaks is time consuming, and which each passing day data becomes more out of data and difficult to analyze.
I plan to fix or rewrite parts or all of this data transfer process.
I am looking on recommended reading before I embark on this, websites and articles on how to write robust, failure resistant and auto-recoverable ETL processes or other advice would be appreciated.
This is precisely what Message Queue Managers are designed for. Some examples are here.
You don't say what database backend you have, but in SQL Server I would write this as an SSIS package. We have a system designed to also write data to a meta data database that tells us when teh file was picked up, whether it processed successfully and why if it did not. It also tells things like how many rows the file had (which we can then use to determine if the current row size is abnormal). One of the beauties of SSIS is that I can set up configurations on package connections and variables, sothat moving the package from development to prod is easy (I don't have to go in and manaually change the connections each time once I have a configuration set up in the config table)
In SSIS we do various checks to ensure the data is correct or clean up the data before inserting to our database. Actually we do lots and lots of checks. Questionable records can be removed from the file processing and put in a separate location for the dbas to examine and possibly pass back to the customer. We can also check to see if the data in various columns (and the column names if given, not all files have them) is what would be expected. So if the zipcode field suddently has 250 characters, we know something is wrong and can reject the file before processing. That way when the client swaps the lastname column with the firstname column without telling us, we can reject the file before importing 100,000 new incorrect records. IN SSIS we can also use fuzzy logic to find existing records to match. So if the record for John Smith says his address is at 213 State st. it can matchour record that says he lives at 215 State Street.
It takes alot to set up a process this way, but once you do, the extra confidence that you are processing good data is worth it's weight in gold.
Even if you can't use SSIS, this should at least give you some ideas of the types of things you should be doing to get the information into your database.
I found this article helpful for the error handling aspects of running cron jobs:
IBM DeveloperWorks Article: "Build intelligent, unattended
scripts"