Best way to build a DataMart from multiple external systems? - sql-server

I'm in the planning stages of building a SQL Server DataMart for mail/email/SMS contact info and history. Each piece of data is located in a different external system. Because of this, email addresses do not have account numbers and SMS phone numbers do not have email addresses, etc. In other words, there isn't a shared primary key. Some data overlaps, but there isn't much I can do except keep the most complete version when duplicates arise.
Is there a best practice for building a DataMart with this data? Would it be an acceptable practice to create a key table with a column for each external key? Then, a unique primary ID can be assigned to tie this to other DataMart tables.
Looking for ideas/suggestions on approaches I may not have yet thought of.
Thanks.

The email address or phone number itself sounds like a suitable business key. Typically a "staging" database is used to load the data from multiple sources and then assign surrogate keys and do other transformations.
Are you familiar with data warehouse methods and design patterns? If you don't have previous knowledge or experience then consider hiring some help. BI / data warehouse projects have a very high failure rate and mistakes can be expensive.

Found more information here:
http://en.wikipedia.org/wiki/Extract,_transform,_load#Dealing_with_keys

Well, with no other information to tie the disparate pieces together, your datamart is going to be pretty rudimentary. You'll be able to get the types of data (sms, email, mail), metrics for each type over time ("this week/month/quarter/year we averaged 42.5 sms texts per day, and 8000 emails per month! w00t!"). With just phone numbers and email addresses, your "other datamarts" will likely have to be phone company names, or internet domains. I guess you could link from that into some sort of geographical information (internet provider locations?), or maybe financial information for the companies. Kind of a blur if you don't already know which direction you want to head.
To be honest, this sounds like someone high-up is having a knee-jerk reaction to the "datamart" buzzword coupled with hearing something about how important communication metrics are, so they sent orders on down the chain to "get us some datamarts to run stats on all our e-mails!"
You need to figure out what it is that you or your employer is expecting to get out of this project, and then figure out if the data you're currently collecting gives you a trail to follow to that information. Right now it sounds like you're doing it backwards ("I have this data, what's it good for?"). It's entirely possible that you don't currently have the data you need, which means you'll need to buy it (who knows if you could) or start collecting it, in which case you won't have nice looking graphs and trend-lines for upper-management to look at for some time... falling right in line with the warning dportas gave you in his second paragraph ;)

Related

need of a separate database

I am working on my first web project. I have referenced many tutorials and pdfs but all those had simple examples for the login and sign-up feature for a webpage, which only used a single database. I am having a massive confusion on whether or not, the login and sign-up should have separate databases.
My main question is : The project intakes user's personal information(name, email, address, telephone number, etc.) along with information specific to their vehicles (model, company, make, manufacture date, etc.). And after logging into the website, both these data's are important but only some of them are in use like, the user's name, his/her address, the model of vehicle, and the company. So should I maintain separate databases for both of them and reference each element with a foreign key while working on databases ?? Or should i just bother less and use a single database and complete my login and sign-up function ??, because with the no. of columns that I have apparently is very large.
This might be a bit too academic, but a word you'll want to learn well is normalization. Here is a link to a pretty stiff definition: https://en.wikipedia.org/wiki/Database_normalization
This being your first web project, my advice would the following:
Don't be afraid to make mistakes. I would strongly encourage trying approaches you think are good and then don't be afraid to change your mind. The lessons learned will stick with you.
Keep everything simple up front. Only add complexity when you need it.
Definitely don't be afraid to grow horizontally with tables (add more and more tables). When I first started working with databases I was afraid to have too many tables because it felt wrong. Try to resist the temptation to cram everything in one table.
Definitely separate login, users and vehicle information. Not a bad idea to also separate out user address information since people can have more than one address.
You must use the same database for holding all the information for your project. Two different database is not really good idea , you can create many tables in an database. and each table is designed to hold different information.In case of your example you may choose the following tables in the same database
UserLogin [store login information]
User [ store personal info]
Vehicle
and so on
There must be one to one relationship between UserLogin and User table and one to many in user - Vehicle table
One user may have many Vehicle
Hopefully it will help

Will creating seperate databases in SQL Server give me better performance?

All, I'm a programmer by trade but for this particular project I'm finidng myself being the DBA as well. Here is the scenario I'm faced with:
Web app with anywhere from 400-1000 customers. A customer is a "physical company", each of which has n-number of uers. Each customer (company) has on average 1GB worth of data (total of about 200 million rows). Each company has probably 80% similar data in terms of the type of data stored. The other 20% is custom data that the companies can themselves define (basically custom fields).
I am trying to figure out the best way to scale this on the cheap when you conisder that the customers need pretty good reaction time. For example, customer X might want to grab all records where last name like 'smith' and phone like '555' where as customer Y might want to grab all records where account number equals '1526A'.
Bottom line, performance is key and I'm finding it hard to decide what to index and if that is even going to help me given the fact these guys can basically create their own query through the UI.
My question is, what would you do? Do you think it would be wise to break each customer out into it's own DB? Total DB size at the moment is around 400GB.
It is a complete re-write so I have the fortune of being able to start fresh if needed. Any thoughts, hints would be greatly appreciated.
Bottom line, performance is key and
I'm finding it hard to decide what to
index and if that is even going to
help me given the fact these guys can
basically create their own query
through the UI.
Bottom line, you're ceding your DB performance to the whims of your clients. If they're able to "create their own query", then they're able to "create their own REALLY BAD queries".
So, if you run this in a shared environment (i.e. the same hardware), then customer A's awful table scans can saturate the I/O for everyone else.
If they're on the same database server, then Customer A's scans get to flush all of your other customers data from the data cache.
Basically, the more you "share", the more one customer can impact the operations of other customers. If you give customers the capability to do expensive things, and share much of it, then everyone suffers.
So, the options are a) don't let the customers do silly things or b) keep the customers as separated as practical so that when one does do silly things, the phones don't light up from all of the other customers.
If you don't know "what to index" then you are not offering much control over what the customers can do, and thus the silly thing factor goes way up.
You would probably get quite far by offering several popular, pre-made SQL views that the customers can select from, and then they're limited to simply filtering and possibly ordering the results. Then you optimize around execution of those views.
It's likely that surprisingly few "general" views can cover a large amount of the use cases.
Generic, silly queries can be delegated to a batch process that runs overnight, during off hours, or to a separate machine that doesn't impact transactional performance, such as a nightly snapshot with "everything but todays data" on it. Let them run historic queries against that.
The SO question How to design a multi tenant database has a link to a decent article on the tradeoffs along the spectrum from "shared nothing" to "shared everything". Also, SO has a tag for those kinds of questions; I added it for you.
Creating separate databases on the same server won't help you get better performance. The performance optimisations available to you with multiple databases are just the same as you can achieve with one database.
Separate databases might make sense for administrative reasons - if different backup or availability requirements apply to different customers for example.
It's still probably sensible to build your application so that it can support multiple databases so that you have the option of scaling out over multiple DB servers.
If you have seperate databases the 80% that is the same beciomes almost impossible to keep the same over time. YOu will end up spending far more money for maintenance.
Luckly SQL Server has some options for you. First put the customer sspeicifc information in the same database in a separate schema and the common stuff in a differnt schema(create a common schema and a schema for each client).
Next set up data partitioning by client. This can require the proper hardware to do this effectively.
Now you have one code base for common which will promugate changes to all clients at once and clients are separated for performance using the partitions.

Bad real-world database schemas

Our masters thesis project is creating a database schema analyzer. As a foundation to this, we are working on quantifying bad database design.
Our supervisor has tasked us with analyzing a real world schema, of our choosing, such that we can identify some/several design issues. These issues are to be used as a starting point in the schema analyzer.
Finding a good schema is a bit difficult because we do not want a schema which is well designed in all aspects, but a schema that is more "rare to medium".
We have already scheduled the following schemas for analysis: wikimedia, moodle and drupal. Not sure in which category each fit. It is not necessary that the schema is open source.
The database engine used is not important, though we would like to focus on SQL server, Posgresql and Oracle.
For now literature will be deferred, as this task is supposed to give us real world examples which can be used in the thesis. i.e. "Design X is perceived by us as bad design, which our analyzer identifies and suggests improvements to", instead of coming up with contrived examples.
I will update this post when we have some kind of a tool ready.
Check the Dell-dvd-store, you can use it for free.
The Dell DVD Store is an open source
simulation of an online ecommerce site
with implementations in Microsoft SQL
Server, Oracle and MySQL along with
driver programs and web applications
Bill Karwin has written a great book about bad designs: SQL antipatterns
I'm working on a project including a geographical information system. And in my opinion these designs are often "medium" to "rare".
Here are some examples:
1) Geonames.org
You can find the data and the schema here: http://download.geonames.org/export/dump/ (scroll down to the bottom of the page for the schema, it's in plain text on the site !)
It'd be interesting how this DB design performs with such a HUGE amount of data!
2) OpenGeoDB
This one is very popular in german-speaking countries (Germany, Austria, Switzerland) because it's a database containing nearly every city/town/village in the german speaking region with zip-code, name, hierarchy and coordinates.
This one comes with a .sql schema and the table fields are in english, so this shouldn't be a problem.
http://fa-technik.adfc.de/code/opengeodb/
The interesting thing in both examples is how they managed the hierarchy of entities like Country -> State -> County -> City -> Village etc.
PS: Maybe you could judge my DB design too ;) DB Schema of a Role Based Access Control
vBulletin has a really bad database schema.
"we are working on quantifying bad database design."
It seems to me like you are developing a model, or process, or apparatus, that takes a relational schema as input and scores it for quality.
I invite you to ponder the following:
Can a physical schema be "bad" while the logical schema is nonetheless "extremely good" ? Do you intend to distinguish properly between "logical schema" and "physical schema" ? How do you dream to achieve that ?
How do you decide that a certain aspect of physical design is "bad" ? Take for example the absence of some index. If the relvar that that "supposedly desirable index" is to be on, is itself constrained to be a singleton, then what detrimental effects would the absence of that index cause for the system ? If there are no such detrimental effects, then what grounds are there for qualifying the absence of such an index as "bad" ?
How do you decide that a certain aspect of logical design is "bad" ? Choices in logical design are done as a consequence of what the actual requirements are. How can you make any judgment whatsoever about a logical design, without a formalized and machine-readable way to specify what the actual requirements are ?
Wow - you have an ambitious project ahead of you. To determine what is a good database design may be impossible, except for broadly understood principles and guidelines.
Here are a few ideas that come to mind:
I work for a company that does database management for several large retail companies. We have custom databases designed for each of these companies, according to how they intend for us to use the data (for direct mail, email campaigns, etc.), and what kind of analysis and selection parameters they like to use. For example, a company that sells musical equipment in stores and online will want to distinguish between walk-in and online customers, categorize the customers according to the type of items they buy (drums, guitars, microphones, keyboards, recording equipment, amplifiers, etc.), and keep track of how much they spent, and what they bought, over the past 6 months or the past year. They use this information to decide who will receive catalogs in the mail. These mailings are very expensive; maybe one or two dollars per customer, so the company wants to mail the catalogs only to those most likely to buy something. They may have 15 million customers in their database, but only 3 million buy drums, and only 750,000 have purchased anything in the past year.
If you were to analyze the database we created, you would find many "work" tables, that are used for specific selection purposes, and that may not actually be properly designed, according to database design principles. While the "main" tables are efficiently designed and have proper relationships and indexes, these "work" tables would make it appear that the entire database is poorly designed, when in reality, the work tables may just be used a few times, or even just once, and we haven't gone in yet to clear them out or drop them. The work tables far outnumber the main tables in this particular database.
One also has to take into account the volume of the data being managed. A customer base of 10 million may have transaction data numbering 10 to 20 million transactions per week. Or per day. Sometimes, for manageability, this data has to be partitioned into tables by date range, and then a view would be used to select data from the proper sub-table. This is efficient for this huge volume, but it may appear repetitive to an automated analyzer.
Your analyzer would need to be user configurable before the analysis began. Some items must be skipped, while others may be absolutely critical.
Also, how does one analyze stored procedures and user-defined functions, etc? I have seen some really ugly code that works quite efficiently. And, some of the ugliest, most inefficient code was written for one-time use only.
OK, I am out of ideas for the moment. Good luck with your project.
If you can get ahold of it, the project management system Clarity has a horrible database design. I don't know if they have a trial version you can download.

Questions and considerations to ask client for designing a database

so as title says, I would like to hear your advices what are the most important questions to consider and ask end-users before designing database for their application. We are to make database-oriented app, with special attenion to pay on db security (access control, encryption, integrity, backups)... Database will also keep some personal information about people, which is considered sensitive by law regulations, so security must be good.
I worked on school projects with databases, but this is first time working "in real world", where this db security has real implications.
So I found some advices and questions to ask on internet, but here I always get best ones. All help appreciated!
Thank you!
Some other specifics besides what has already been said:
Do you have any Regulatory
requirements for data access and
storage (Sarbanes-Oxley and HIPAA
come to mind)
Do you need to be able to audit
record changes
What internal controls do you need
reflected in the database
What business rules must be followed
under what circumstances
How large to you expect the data to
get - the larger the data store
expected the more critical to design
with performance in mind from the
start
How flexible do you want the system
to be (do you want to be able to add
columns on the fly? OR add business
rules) Be careful with this one, make
sure the client understands that
flexibilty often comes at the cost of
performance.
Do you need a separate data warehouse
for reporting?
How do you need the data populated?
Will it come from an application,
multiple applications, data imports
or a combination?
What databases do you currently have
license for? Do you want to have
this application use it?
Will different groups of users need
different accesses?
How is the process currently being
handled, can we have access to that
database or see the current process
in action. Observe, for a minimum of
one day, the client using the current
system. Take extensive notes, you will learn many things no one will think to tell you.
Do you need to migrate data from the
old system
i would start with:
Please explain your business to me.
Which processes are you looking to
automate or improve?
Do you have any reports you need to
generate?
Do you need inputs to any other
systems?
use cases (google for that, it does not need to be drawings, text is fine)
inputs
outputs
static data
historical data
From there you derive the info you need to store, you apply 4th NF, and go !
Good luck ! 8-))

Referrals DB schema

I'm coding a new {monthly|yearly} paid site with the now typical "referral" system: when a new user signs up, they can specify the {username|referral code} of other user (this can be detected automatically if they came through a special URL), which will cause the referrer to earn a percentage of anything the new user pays.
Before reinventing the wheel, I'd like to know if any of you have experience with storing this kind of data in a relational DB. Currently I'm using MySQL, but I believe any good solution should be easily adapted to any RDBMS, right?
I'm looking to support the following features:
Online billing system - once each invoice is paid, earnings for referrals are calculated and they will be able to cash-out. This includes, of course, having the possibility of browsing invoices / payments online.
Paid options vary - they are different in nature and in costs (which will vary sometime), so commissions should be calculated based on each final invoice.
Keeping track of referrals (relationship between users, date in which it was referred, and any other useful information - any ideas?)
A simple way to access historical referring data (how much have been paid) or accrued commissions.
In the future, I might offer to exchange accrued cash for subscription renewal (covering the whole of the new subscription or just a part of it, having to pay the difference if needed)
Multiple levels - I'm thinking of paying something around 10% of direct referred earnings + 2% the next level, but this may change in the future (add more levels, change percentages), so I should be able to store historical data.
Note that I'm not planning to use this in any other project, so I'm not worried about it being "plug and play".
Have you done any work with similar requirements? If so, how did you handle all this stuff? Would you recommend any particular DB schema? Why?
Is there anything I'm missing that would help making this a more flexible implementation?
Rather marvellously, there's a library of database schemas. Although I can't see something specific to referrals, there may be something related. At least (hopefully) you should be able to get some ideas.

Resources