To create a DIM table or not - data-modeling

I have 3 fields from our transactional source table (Status, Abnormal, Type) which are all coded values. All of them do not have a code table to link to. The description for each code is only defined in their data dictionary.
I am sure that these codes are rarely moving at all.
In a scenario like this, is it still worth to create a DIM table for each field in order to have a proper FK in the Fact table?
Some thoughts Im not sure of is the added requirement to handle reloading or patching of keys when new codes become available. Any issues in just storing the actual codes in the Fact tables and have the reporting tool do the decoding of description instead

Related

How to send data from OLE DB source to Anchor model tables using ETL procedure?

I'm currently solving this task: some data should be sent from AdventureWorks2012 to Anchor model tables on the same server in MsSQL.
This is my Anchor Model
At this point I have a pretty simple Integration Services project in Visual Studio and it looks like this.
Control flow:
For example Load_territories is:
The main requirement is to fill all tables of Anchor model tables in MsSQL but I'm constantly facing a problem: the amount of attributes in tables are different and some of them are repeating
At this picture in the second table basically TR_ID,TR_GRP_TR_ID, TR_TID_TR_ID, TR_TNM_TR_ID contain the same values from dwh_key but it's impossible to create a one-to-many relation between attributes. My tutor has recommended me to use Lookup but I cannot figure out how to implement them in this project
This may be considered as cheating, but if you insert data into the latest view rather than the separate 6NF tables all of those ID fields will be populated by underlying trigger logic. I suspect that this would defeat the purpose of using SSIS though, since you would effectively be loading attributes sequentially rather than in parallel.
Another option is to leave surrogate key management to the ETL tool. This would require that you switch the data type for your identities from integers to GUID:s. SSIS can then generate a GUID and you can then use that very same GUID to populate all the attributes. Note that the anchor would have to be loaded first, or you will get a foreign key violation.
The most common solution though, is to leave surrogate key management to the database (and use integers). You would have a step in which you populate the metadata column in the anchor with the desired number of new identities to be created. Using the metadata number you can then select the newly generated identities and merge them into your data flow. It doesn't matter which number gets assigned to which row. After that all attributes can be populated in parallel, including their ID columns.
Of course, if this is intended to be used for more than an initial load, you would also have to add steps to detect if the data you are loading is already known or not.
I can also recommend watching the video tutorial referenced in this blog post: https://clinthuijbers.wordpress.com/2013/06/14/ssis-anchor-modeling-example-tutorial/

How do I automatically populate a new table in MS-Access with info from an existing table?

I am very new to MS Access and I am struggling with some things that seem like they should be the most basic. I have imported a table of data from Excel and have defined the data types for the fields. I have no problem there, but now I want to make a new table that has as a primary key one of the fields from the imported table. It looks like I can manually create this table, set the relationship, and then go back and type in each record associated with the new primary key, but this seems completely ridiculous. Surely there must be a way to automatically create one record for each unique instance in the matching field from the original table. Yet, I've scrolled through hundreds of pages of Access tutorials and Googled the question and found no satisfactory guidance.
Do I completely misunderstand what Access is all about? How do I create a new table with entries from a field on an existing table? What am I missing?
You don't specify which version of Access you are using, the suggestions listed below apply to 2010, but should be similar is other versions.
You can create new tables from existing tables using either a 'Make Table' option after selecting 'Create' -> 'Query Design', or you can manually create your table first, then use an 'Append' query.
Without knowing the design of your table it's hard to get more descriptive.
Are you populating your new table's primary key ahead of time, or relying on Auto Number to do it (preferred method)?

General database design: Is it ever considered "okay" to create a non-normalized table on purpose?

After-edit: Wow, this question go long. Please forgive =\
I am creating a new table consisting of over 30 columns. These columns are largely populated by selections made from dropdown lists and their options are largely logically related. For example, a dropdown labeled Review Period will have options such as Monthly, Semi-Annually, and Yearly. I came up with a workable method to normalize these options down to numeric identifiers by creating a primitives lookup table that stores values such as Monthly, Semi-Annually, and Yearly. I then store the IDs of these primitives in the table of record and use a view to join that table out to my lookup table. With this view in place, the table of record can contain raw data that only the application understands while allowing external applications and admins to run SQL against the view and return data that is translated into friendly information.
It just got complicated. Now these dropdown lists are going to have non-logically-related items. For example, the Review Period dropdown list now needs to have options of NA and Manual. This blows my entire grouping scheme out of the water.
Similar constructs that have been used in this application have resorted to storing repeated string values across multiple records. This means you could have hundreds of records with the string 'Monthly' stored in the table's ReviewPeriod column. The thought of this happening has made me cringe since I've started working here, but now I am starting to think that non-normalized data may be the best option here.
The only other way I can think of doing this using my initial method while allowing it to be dynamic and support the constant adding of new options to any dropdown list at any time is this: When saving the data to the database, iterate through every single property of my business object (.NET class in this case) and check for any string value that exists in the primitives table. If it doesn't, add it and return the auto-generated unique identifier for storage in the table of record. It seems so complicated, but is this what one is to go through for the sake of normalized data?
Anything is possible. Nobody is going to haul you off to denormalization jail and revoke your DBA card. I would say that you should know the rules and what breaking them means. Once you have those in hand, it's up to your and your best judgement to do what you think is best.
I came up with a workable method to normalize these options down to
numeric identifiers by creating a primitives lookup table that stores
values such as Monthly, Semi-Annually, and Yearly. I then store the
IDs of these primitives in the table of record and use a view to join
that table out to my lookup table.
Replacing text with ID numbers has nothing at all to do with normalization. You're describing a choice of surrogate keys over natural keys. Sometimes surrogate keys are a good choice, and sometimes surrogate keys are a bad choice. (More often a bad choice than you might believe.)
This means you could have hundreds of records with the string
'Monthly' stored in the table's ReviewPeriod column. The thought of
this happening has made me cringe since I've started working here, but
now I am starting to think that non-normalized data may be the best
option here.
Storing the string "Monthly" in multiple rows has nothing to do with normalization. (Or with denormalization.) This seems to be related to the notion that normalization means "replace all text with id numbers". Storing text in your database shouldn't make you cringe. VARCHAR(n) is there for a reason.
The only other way I can think of doing this using my initial method
while allowing it to be dynamic and support the constant adding of new
options to any dropdown list at any time is this: When saving the data
to the database, iterate through every single property of my business
object (.NET class in this case) and check for any string value that
exists in the primitives table. If it doesn't, add it and return the
auto-generated unique identifier for storage in the table of record.
Let's think about this informally for a minute.
Foreign keys provide referential integrity. Their purpose is to limit the values allowed in a column. Informally, the referenced table provides a set of valid values. Values that aren't in that table aren't allowed in the referencing column of other tables.
But no matter what the user types in, you're going to add it to that table of valid values.
If you're going to accept everything the user types in the first place, why use a foreign key at all?
The main problem here is that you've been poorly served by the people who taught you (mis-taught you) the relational model. (And, probably, equally poorly by the people who taught you SQL.) I hope you can unlearn those mistaken notions quickly, and soon make real progress.

Generic version control strategy for select table data within a heavily normalized database

Sorry for the long winded title, but the requirement/problem is rather specific.
With reference to the following sample (but very simplified) structure (in psuedo SQL), I hope to explain it a bit better.
TABLE StructureName {
Id GUID PK,
Name varchar(50) NOT NULL
}
TABLE Structure {
Id GUID PK,
ParentId GUID, -- FK to Structure
NameId GUID NOT NULL -- FK to StructureName
}
TABLE Something {
Id GUID PK,
RootStructureId GUID NOT NULL -- FK to Structure
}
As one can see, Structure is a simple tree structure (not worried about ordering of children for the problem). StructureName is a simplification of a translation system. Finally 'Something' is simply something referencing the tree's root structure.
This is just one of many tables that need to be versioned, but this one serves as a good example for most cases.
There is a requirement to version to any changes to the name and/or the tree 'layout' of the Structure table. Previous versions should always be available.
There seems to be a few possibilities to tackle this issue, like copying the entire structure, but most approaches causes one to 'loose' referential integrity. Example if one followed this approach, one would have to make a duplicate of the 'Something' record, given that the root structure will be a new record, and have a new ID.
Other avenues of possible solutions are looking into how Wiki's handle this or go a lot further and look how proper version control systems work.
Currently, I feel a bit clueless how to proceed on this in a generic way.
Any ideas will be greatly appreciated.
Thanks
leppie
Some quick ideas:
Full copy: Create a copy of the structure, but for every table add a version_id column to the PK and all FKs; thus you can create copies of the life data with complete referential integrity.
pro: easy to query the history
con: large amount of (redundant data copied)
Change copy: Only copy the stuff that actually changes, along with valid_from / valid_to data.
pro: low data volum copied
con: hard to query, because one has to join on intervals
Variation: This applies to both schemes. Instead of creating a copy of the structure, you might keept the current record in the same table as the old versions, but tag it as current.
pro: smaller number of tables, easier mixing of history and current information
con: normal operation operates on much bigger tables, which will cause a performance impact
Auditing log: Depending on your actual requirements it be sufficient to just create an audit trail like this:
id, timestamp, changed_table, changed_column, old_value, new_value, changed_by
You might extend that to a full table structure:
transaction, table_change, changed_column
pro: generic, hence easy to implement for a large number of tables
con: if you need to reconstruct the state of a set of records at a given time, querying will become a nightmare
I wrote a blog about various approaches to versioning, but be warned: it's in German.
The data warehousing folks have several algorithms for "slowly-changing dimensions".
The more sophisticated algorithms provide data ranges around a dimension value to indicate when it's valid.
Depending on your versioning requirements you could do one of these things, cribbed from Kimball's The Data Warehousing Toolkit.
Assign a version number to rows of the structure table. This means you have to do some reasoning to collect a a complete structure. It includes the selected version number unioned with rows that are unchanged in an earlier version.
Assign a date range or version range to rows of the structure table. This means that some rows have start dates and end dates; some rows will have end dates at some epoch in the impossible future. Or, if you use version numbers, you'll have a start-end pair or a start-infinity pair that indicates this row is still current. You can then trivially query the rows that are valid "today" or apply to the requested version.
Clone the structure for each version. This unpleasant because the clone operation is costly. The queries however, are trivial because the entire structure is available with a single, consistent version number.

Indexing URL's in SQL Server 2005

What is the best way to deal with storing and indexing URL's in SQL Server 2005?
I have a WebPage table that stores metadata and content about Web Pages. I also have many other tables related to the WebPage table. They all use URL as a key.
The problem is URL's can be very large, and using them as a key makes the indexes larger and slower. How much I don't know, but I have read many times using large fields for indexing is to be avoided. Assuming a URL is nvarchar(400), they are enormous fields to use as a primary key.
What are the alternatives?
How much pain would there likely to be with using URL as a key instead of a smaller field.
I have looked into the WebPage table having a identity column, and then using this as the primary key for a WebPage. This keeps all the associated indexes smaller and more efficient but it makes importing data a bit of a pain. Each import for the associated tables has to first lookup what the id of a url is before inserting data in the tables.
I have also played around with using a hash on the URL, to create a smaller index, but am still not sure if it is the best way of doing things. It wouldn't be a unique index, and would be subject to a small number of collisions. So I am unsure what foreign key would be used in this case...
There will be millions of records about webpages stored in the database, and there will be a lot of batch updating. Also there will be a quite a lot of activity reading and aggregating the data.
Any thoughts?
I'd use a normal identity column as the primary key. You say:
This keeps all the associated indexes smaller and more efficient
but it makes importing data a bit of a pain. Each import for the
associated tables has to first lookup what the id of a url is
before inserting data in the tables.
Yes, but the pain is probably worth it, and the techniques you learn in the process will be invaluable on future projects.
On SQL Server 2005, you can create a user-defined function GetUrlId that looks something like
CREATE FUNCTION GetUrlId (#Url nvarchar(400))
RETURNS int
AS BEGIN
DECLARE #UrlId int
SELECT #UrlId = Id FROM Url WHERE Url = #Url
RETURN #UrlId
END
This will return the ID for urls already in your URL table, and NULL for any URL not already recorded. You can then call this function inline your import statements - something like
INSERT INTO
UrlHistory(UrlId, Visited, RemoteIp)
VALUES
(dbo.GetUrlId('http://www.stackoverflow.com/'), #Visited, #RemoteIp)
This is probably slower than a proper join statement, but for one-time or occasional import routines it might make things easier.
Break up the URL into columns based on the bits your concerned with and use the RFC as a guide. Reverse the host and domain info so an index can group like domains (Google does this).
stackoverflow.com -> com.stackoverflow
blog.stackoverflow.com -> com.stackoverflow.blog
Google has a paper that outlines what they do but I can't find right now.
http://en.wikipedia.org/wiki/Uniform_Resource_Locator
I would stick with the hash solution. This generates a unique key with a fairly low chance of collision.
An alternative would be to create GUID and use that as the key.
I totally agree with Dylan. Use an IDENTITY column or a GUID column as surrogate key in your WebPage table. Thats a clean solution. The lookup of the id while importing isn't that painful i think.
Using a big varchar column as key column is wasting much space and affects insert and query performance.
Not so much a solution. More another perspective.
Storing the total unique URI of a page perhaps defeats part of the point of URI construction. Each forward slash is supposed to refer to a unique semantic space within the domain (whether that space is actual or logical). Unless the URIs you intend to store are something along the line of www.somedomain.com/p.aspx?id=123456789 then really it might be better to break a single URI metatable into a table representing the subdomains you have represented in your site.
For example if you're going to hold a number of "News" section URIs in the same table as the "Reviews" URIs then you're missing a trick to have a "Sections" table whose content contains meta information about the section and whose own ID acts as a parent to all those URIs within it.

Resources