Mini/Junk Dimension Advice - sql-server

I have a Client Dimension and a Fact table which tracks Sessions with Clients, these have the following columns:
Code:
[DimClient]
----------
PK_ClientKey
ClientNumber
EmailAddress
Postcode
PostcodeLongitude
PostcodeLatitude
DateOfBirth
Gender *
Sexuality *
CulturalIdentity *
LanguageSpokenAtHome *
CountryOfBirth
UsualAccommodation *
LivingWith *
OccupationStatus *
HighestLevelOfSchooling *
RegistrationDate
LastLoginDate
Status
[FactSession]
-------------
PK_SessionKey
FK_ClientKey
...
My first requirement was to start grouping the age of the Clients at a specific Session (FactSession), the best way to approach this was to create a Age Group dimension and create a foreign key (FK_AgeGroupKey) in the FactSession to the DimAgeGroup dimension.
Now I'm thinking it would be good to track all the columns with an * (above). These could (not yet proven) have a high correlation against Sessions. Reading through the DWH Toolkit it seems a Mini Dimension to accomodate all the * columns along with the Age Group would suit best, so I put together the following structure:
Code:
[DimClient]
----------
PK_ClientKey
ClientNumber
...
Status
[DimDemographic]
-----------------
PK_DemographicKey
AgeGroup
Gender
Sexuality
...
HighestLevelOfSchooling
[FactSession]
-------------
PK_SessionKey
FK_ClientKey
FK_DemographicKey
The DimDemographic table would need to utilize a SCD Type 2 to be able to track the changes over time. Would this be the best approach to my requirements?
Additionally, I have RegistrationDate and LastLoginDate columns on my Client Dimension, in the case where a Client registers but never logs in what would be the best value to put in the LastLoginDate field? Something like '1900-01-01' or NULL?
Sorry for the long post but hopefully I have given enough information Thanks in advance!

I would add a field to your client dimensions to indicate the user has never logged in. Something like:
select * form DimClient where HasUserLoggedIn = 'NO';
Its very human readable and you won't have to teach your business users about nulls. Traditionally nulls are bad in a Data Warehouse except in the case of numeric fact values, due to the complexities of null != null.

Yes, the above solution should work fine. It supports your need to track changes over time, otherwise you can have included the DimDemographic linkage directly in DimClient.
Regarding the date question, I believe you should use NULL, it means that there is no value because there was no login. Also, identifying non-logged-in would be:
select * from DimClient where LastLoginDate IS NULL
For me this reads much better than a query that uses an artificial date.

Related

Is there any contradiction against too many values in a table (database)?

I was wondering if there's any contradiction or futur problems against a table in a database which contains about 80 columns. There will be only VARCHARs, few INT and maybe 1 or 2 MESSAGE. I did some research on the net but there's nothing really talking about that kind of problem...In other terms, is this okay or even 'normal' to put that much of values inside a table??
Thanks in advance!
You shouldn't have any real problems if the fields are mostly integers. Most DBMSes have a limit on row length, so a bunch of long columns can cause issues...but unless the varchar columns are very long, you're probably OK.
I've honestly never even needed to think about that, though -- with a properly normalized database, it's quite rare to ever need that many columns in a table.
More columns you have, more memory server needs to process the records.
I recomend to use the "multiple to one" relation scheme in this case.
Example of tables:
customer
id
name
email
...
ins_app_form (Insurance application form)
id
customer_id (relation with customer)
date
... (here comes some other data if you need)
ins_app_item (Insurance application form items/fields)
id
ins_app_form_id (relation with Insurance application form)
question (the name of a question in application form)
answer (customer's answer)
So to show the application form with this scheme you will need to run a query:
SELECT
iaf.id AS application_id,
iaf.date AS `date`,
iai.question,
iai.answer
FROM ins_app_form AS iaf
LEFT JOIN ins_app_item AS iai ON iai.ins_app_form_id=iaf.id
WHERE iaf.customer_id=<ID of a customer>
This query will bring you something like this:
id date question answer
1 2014-03-31 "Year" "2008"
1 2014-03-31 "Car make" "Audi"
1 2014-03-31 "Car model" "Q7"
...

Fuzzy name matching algorithm

I have a database containing names of certain blacklisted companies and individuals.
All transactions created, its detail needs to be scanned against these blacklisted names. The created transactions may have names not correctly spelled, for example one can write "Wilson" as "Wilson", "Vilson" or "Veelson". The Fuzzy search logic or utility should match against the name "Wilson" present in the blacklisted database and based on the required correctness / accuracy percentage set by the user, has to show the matching name within the percentage set.
The transactions will be sent in batches or real time to check against black listed names.
I would appreciate, if users who had similar requirement and has implemented them, could also give their views and implementation
T-SQL leaves a lot to be desired in the realm of fuzzy search. Your best options are third party libraries, but if you don't want to mess with that, your best best is using the DIFFERENCE function built in to SQL Server. For example:
SELECT * FROM tblUsers U WHERE DIFFERENCE(U.Name, #nameEntered) >= 3
A higher return value for DIFFERENCE indicates higher accuracy. A drawback of this is that the algorithm favors words that sound alike, which may not be your desired characteristic.
This next example shows how to get the best match out of a table:
DECLARE #users TABLE (Name VARCHAR(255))
INSERT INTO #users VALUES ('Dylan'), ('Bob'), ('Tester'), ('Dude')
SELECT *, MAX(DIFFERENCE(Name, 'Dillon')) AS SCORE FROM #users GROUP BY Name ORDER BY SCORE DESC
It returns:
Name | Score
Dylan 4
Dude 3
Bob 2
Tester 0

How to find the minimum fields needed to identify a unique row in a set of data

Say I have a bunch of data on some people. This could include Name, DOB, Address, Email, etc... Assume there are no unique identifiers (like an id column) on this data, but also assume that there are no repeating rows. I need to figure out the minimum set of fields I can use to query that data and return a unique row.
An example of a solution would be: "I can make a query that specifies a first name, dob, email, and zip, and that would return exactly one or zero rows."
Did I ask that in a way that makes sense? I am looking for a technique, algorithm, or software package that would solve this problem for a given set of data. Anything that could provide an answer would work. Thanks!
EXAMPLE DATA (the real stuff is much more complex):
FNAME LNAME DOB ZIP email
John Smith 1/1/12 77777 dude#fake.com
Sean Smith 1/2/08 77777 dude#fake.com
Sean William 4/2/07 77789 stuff#fake.com
Richard Ross 1/1/12 78989 foo#fake.com
The solution for this set of data would be (FNAME, LNAME) or (EMAIL, DOB) or (EMIAL, FNAME).
i think you will need an iterative approach.
perhaps you can begin with each column, and attempt to create a unique index.
if you have success, then done.
if you are unable to create unique index then add another column and try again.
do this for all columns until you can successfully make the index.

DB Design: Sort Order for Lookup Tables

I have an application where the database back-end has around 15 lookup tables. For instance there is a table for Counties like this:
CountyID(PK) County
49001 Beaver
49005 Cache
49007 Carbon
49009 Daggett
49011 Davis
49015 Emery
49029 Morgan
49031 Piute
49033 Rich
49035 Salt Lake
49037 San Juan
49041 Sevier
49043 Summit
49045 Tooele
49049 Utah
49051 Wasatch
49057 Weber
The UI for this app has a number of combo boxes in various places for these lookup tables, and my client has asked that the boxes list in this case:
CountyID(PK) County
49035 Salt Lake
49049 Utah
49011 Davis
49057 Weber
49045 Tooele
'The Rest Alphabetically
The best plan I have for accomplishing this is to add a column to each lookup table for SortOrder(numeric). I had a colleague tell me he thought that would cause the tables to violate 3rd-Normal-Form, but I think the sort order still depends on the key and only the key (even though the rest of the list is alphabetical).
Is adding the SortOrder column the best way to do this, or is there a better way I am just not seeing?
I agree with #cletus that a sort order column is a good way to go and it does not violate 3NF (because, as you said, the sort order column entries are functionally dependent on the candidate keys of the table).
I'm not sure I agree that alphanumeric is better than numeric. In the specific case of counties, there are seldom new ones created. But there is no requirement that the numbers assigned are sequential; you can allocate them with numbers that are a multiple of a hundred, for example, leaving ample room for insertions.
Yes I agree a sort order column is the best solution when the requirements call for a custom sort order like the one you cite. I wouldn't go with a numeric column however. If the data is alphanumeric, the sort order should be alphanumeric. That way you can seed the value with whatever is in the county field.
If you use a numeric field you'll have to resequence the entire table (potentially) whenever you add a new entry. So:
Columns: ID, County, SortOrder
Seed:
UPADTE County SET SortOrder = CONCAT('M-', County)
and for the special cases:
UPDATE County
SET SortOrder = CONCAT('E-' . County)
WHERE County IN ('Salt Lake', 'Utah', 'Davis', 'Weber', 'Tooele')
Arguably you may want to put another marker column in to indicate those entries are special.
I went with numeric and large multiples.
Even with the CONCAT('E-'.. example, I don't get the required sort order. That would give me Davis, SL, Tooele... and Salt Lake needs to be first.
I ended up using multiples of 10 and assigned the non-special-sort entries a value like 10000. That way the view for each lookup can have
ORDER BY SortOrder ASC, OtherField ASC
Another programmer suggested using DECODE in Oracle, or CASE statements in SQL Server, but this is a more general solution. YMMV.

Select exclusively a field from a table

I have to add a coupon table to my db. There are 3 types of coupons : percentage, amount or 2 for 1.
So far I've come up with a coupon table that contains these 3 fields. If there's a percentage value not set to null then it's this kind of coupon.
I feel it's not the proper way to do it. Should I create a CouponType table and how would you see it? Where would you store these values?
Any help or cue appreciated!
Thanks,
Teebot
You're correct, I think a CouponType table would be fit for your problem.
Two tables: Coupons and CouponTypes. Store the CouponTypeId inside the Coupons table.
So for an example, you'll have a Coupon record called "50% off", if would reference the percent off CouponType record and from there you could determine the logic to take 50% off the cost of the item.
So now you can create unlimited coupons, if it's a dollar amount coupon type it will take the "amount" column and treat it as a dollar amount. If it's a percent off it will treat it as a percentage and if it's an "x for 1" deal, it will treat the value as x.
- Table Coupons
- ID
- name
- coupon_type_id # (or whatever fits your style guidelines)
- amount # Example: 10.00 (treated as $10 off for amount type, treated as
# 10% for percent type or 10 for 1 with the final type)
- expiration_date
- Table CouponTypes
- ID
- type # (amount, percent, <whatever you decided to call the 2 for 1> :))
In the future you might have much more different coupon types. You could also have different business logic associated with them - you never know. It's always useful to do the things right in this case, so yes, definitely, create a coupon type field and an associated dictionary table to go with it.
I would definitely create a CouponType lookup table. That way you avoid all the NULL's and allow for more coupon types in the future.
Coupon
coupon_id INT
name VARCHAR
coupon_type_id INT <- Foreign Key
CouponType
coupon_type_id INT
type_description VARCHAR
...
Or I suppose you could have a coupon type column in your coupon table CHAR(1)

Resources