Organization of correlated columns in Netezza - netezza

Say there is a large table in Netezza with columns COUNTRY and REGION where each country rolls up to a region. If the table is organized explicitly only on COUNTRY, it is implicitly organized also on REGION because the data are correlated.
Does Netezza know this for queries that use REGION in the WHERE clause, or does Netezza still scan the whole table regardless of the zone maps?
Put another way: to get the performance benefit for REGION, must REGION be explicitly organized in the zone map?

The sport answer is ‘yes’ the slightly longer answer is ‘it depends’
Generally speaking netezza zonemaps works for all columns except string columns which only has a zonemaps on them if they are included in the ‘organize on’, so let’s assume that both columns are integer ‘codes’ and you have a small table (dimension tables) that translates each of the two codes to a string through a join, then it will work nicely.
Of course this optimization works best if the number of distinct values are small compared to the entire table and/or the region codes are for the most part ascending when the corresponding country codes are ascending.

Related

Database / table structure - similar entries vs. too much normalization?

I have designed this relational database that is keeping track of various assets and their owners over time. One of the most important piece of analysis I want to do is to track the value of those assets over time: expected original cost, actual original cost, actual cost, etc. So I have been putting data relative to a cost / value in a separate table called “Support_Value”. To complicates things some of the assets I’m tracking are in countries with foreign currencies so I’m collecting cost / value data in US Dollars but also in local currencies (“LC”), which ends up doubling the number of columns I have in this table. I also use this table as a way to keep track of the value of the asset owners themselves in a similar fashion.
- The columns of this table are the following:
My initial plan was to carve out separate tables to deal with (1) the various “qualities” of entries relative to cost and value (i.e. the “planned”, “upper” bound, “lower” bound”, “estimated” by analysts, and “actual” and another table to track) and (2) another one for currencies. But I realize this is likely to break as it doesn’t allow to have an initial “planned” cost that is then subsequently revised unless we make it explicit by creating new column for revised appendages but then there can be more than one revision.. So still not perfect.
What I’m now envisaging is to create a different value table that would have the following columns:
ID (PK representing individual instances of cost / value estimates)
Currency (FK to my currency table)
Asset (FK to my assets table) - i.e. what this cost or value is referring to
Date (FK to my date table) - i.e. to track revisions actually
Type (i.e. “cost" or “value")
Quality (i.e. “planned”, “upper”, “lower”, “estimated”, “actual”)
Valuation - i.e. the actual absolute amount in the currency designated in the second column
What do think of this approach? Is this an improvement?
Thanks for any suggestion you could have!
Both approaches are fine.
But, if you think you may need additional similar columns,
then the second aproach is more extensible.
Your second approach, it does look it has overnormalization,
I suggest split the "Quality" column back to its parts.
Some thing like:
"ID"
"Currency"
"Asset"
"Date"
"Type"
"Planned"
"Lower"
"Upper"
"Estimated"
"Actual"
"Valuation"
Cheers.

Can we create Bitmap Index on a table column in Oracle 11 which reloads daily using a Job

We have table which stores information about clients which gets loaded using a scheduled job on daily basis from Data warehouse. There are more than 1 million records in that table.
I wanted to define BitMap Index on Country column as there would be limited number of values.
Does it have any impact on the indexes if we delete and reload data into table on daily basis. Do we need to explicitly rebuild the index after every load
Bitmap index is dangerous when the table is frequently updated (the indexed column) because DML on a single row can lock many rows in the table. That's why it is more data warehouse tool than OLTP. Also the true power of bitmap indexes comes with combining more of them using logical operations and translating the result into ROWIDs (and then accessing the rows or aggregate them). In Oracle in general there is not so many reasons to rebuild an index. When frequently modified it will always adapt by 50/50 block split. It doesn't make sense to try to compact it to smallest possible space. One million rows today is nothing unless each row contains big amount of data.
Also be aware that BITMAP indexes requires Enterprise edition license.
The rationale for defining a bitmap index is not a few values in a column, but a query(s) that can profit from it accessing the table rows.
For example if you have say 4 countries equaly populated, Oracle will not use the index as a FULL TABLE SCAN comes cheaper.
If you have some "exotic" countries (very few records) BITMAP index could be used, but you will most probably spot no difference to a conventional index.
I wanted to define BitMap Index on Country column as there would be limited number of values.
Just because a column is low cardinality does not mean it is a candidate for a bitmap index. It might be, it might not be.
Good explanation by Tom Kyte here.
Bitmap indexes are extremely useful in environments where you have
lots of ad hoc queries, especially queries that reference many columns
in an ad hoc fashion or produce aggregations such as COUNT. For
example, suppose you have a large table with three columns: GENDER,
LOCATION, and AGE_GROUP. In this table, GENDER has a value of M or F,
LOCATION can take on the values 1 through 50, and AGE_GROUP is a code
representing 18 and under, 19-25, 26-30, 31-40, and 41 and over.
For example,
You have to support a large number of ad hoc queries that take the following form:
select count(*)
from T
where gender = 'M'
and location in ( 1, 10, 30 )
and age_group = '41 and over';
select *
from t
where ( ( gender = 'M' and location = 20 )
or ( gender = 'F' and location = 22 ))
and age_group = '18 and under';
select count(*) from t where location in (11,20,30);
select count(*) from t where age_group = '41 and over' and gender = 'F';
You would find that a conventional B*Tree indexing scheme would fail you. If you wanted to use an index to get the answer, you would need at least three and up to six combinations of possible B*Tree indexes to access the data via the index. Since any of the three columns or any subset of the three columns may appear, you would need large
concatenated B*Tree indexes on
GENDER, LOCATION, AGE_GROUP: For queries that used all three, or GENDER with
LOCATION, or GENDER alone
LOCATION, AGE_GROUP: For queries that used LOCATION and AGE_GROUP or LOCATION
alone
AGE_GROUP, GENDER: For queries that used AGE_GROUP with GENDER or AGE_GROUP
alone
Having only a single Bitmap Index on a table is useless in most times. The benefit of Bitmap Indexes you get when you have several created on a table and your query combines them.
Maybe a List-Partition is more suitable in your case.

Is it a good idea to include flags in Fact table

The transactional fact table of one ofthe star schemas need to anser questions like Is the first application is final application.This is associated with one of the business process.
Is it a good idea to keep this as a part of the fact table with a column name,
IsFirstAppLastFlag.
There are not much flags to create a seperate dimension.Also this flag(calculated flag) is essential in the report writing.In this context do we need to keep it in Dimension or in Fact!
I assume the creation of junk dimension is for those flags /low cardinality columns which are not so useful can kept it inside a dimension?!
This will depend on your own needs but if you like the purest view of the fact table then the answer is no, these fields should not be included in your fact table.
The fact table should include dimension keys, degenerate dimension keys, and facts.
IsStatusOne, IsStatusTwo, etc are attributes and as you rightly suggest would be well suited to a junk dimension in the absence of them belonging to a more suitable dimension, e.g., IsWeekDay would be suited to dimension "Date" table.
You may start off with only a few "Is" attributes in your fact table but over time you may need more and more of these attributes, you will look back and possibly wish you created a junk dimension.
Performance:
Interestingly if you are using bit columns for your flags then then there is little storage difference in using 8 bit flags in your fact table then having one tinyint dimension key, however when your flags are more verbose or have multiple status values then you should use the junk dimension to improve performance on the fact table, less storage, memory, more rows in a page, etc..
Personally, I would junk them
That seems fine, as long as it it an attribute of the fact, not of one of the dimensions. In some cases I think you might have a slowly changing dimension in which it would be more appropriately placed.
I would be concerned that this plan might require updates on the fact table, for example if you were intending to flag that a particular fact was the most recent for a customer. If that was the case it might be better to keep a transaction number in the fact table, and a "most recent transaction number" in the dimension table, and provide an indexing method to effectively retrieve the most recent per-customer.
You can use Junk Dimension.
Instead of creating several dimension with few rows you can create on dimnsion with all possible combination of value then you add just one foregion key in your fact table.
you can populate your junk dimension with a query like below.
WITH cteFlags AS
(
SELECT 'N' AS Value
UNION ALL
SELECT 'Y'
)
SELECT
Flag1.Value,
Flag2.Value,
Flag3.Value
FROM
cteFlags Flag1
CROSS JOIN cteFlags Flag2
CROSS JOIN cteFlags Flag3

Storing Preferences/One-to-One Relationships in Database

What is the best way to store settings for certain objects in my database?
Method one: Using a single table
Table: Company {CompanyID, CompanyName, AutoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
Method two: Using two tables
Table Company {CompanyID, COmpanyName}
Table2 CompanySettings{CompanyID, utoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
I would take things a step further...
Table 1 - Company
CompanyID (int)
CompanyName (string)
Example
CompanyID 1
CompanyName "Swift Point"
Table 2 - Contact Types
ContactTypeID (int)
ContactType (string)
Example
ContactTypeID 1
ContactType "AutoEmail"
Table 3 Company Contact
CompanyID (int)
ContactTypeID (int)
Addressing (string)
Example
CompanyID 1
ContactTypeID 1
Addressing "name#address.blah"
This solution gives you extensibility as you won't need to add columns to cope with new contact types in the future.
SELECT
[company].CompanyID,
[company].CompanyName,
[contacttype].ContactTypeID,
[contacttype].ContactType,
[companycontact].Addressing
FROM
[company]
INNER JOIN
[companycontact] ON [companycontact].CompanyID = [company].CompanyID
INNER JOIN
[contacttype] ON [contacttype].ContactTypeID = [companycontact].ContactTypeID
This would give you multiple rows for each company. A row for "AutoEmail" a row for "AutoPrint" and maybe in the future a row for "ManualEmail", "AutoFax" or even "AutoTeleport".
Response to HLEM.
Yes, this is indeed the EAV model. It is useful where you want to have an extensible list of attributes with similar data. In this case, varying methods of contact with a string that represents the "address" of the contact.
If you didn't want to use the EAV model, you should next consider relational tables, rather than storing the data in flat tables. This is because this data will almost certainly extend.
Neither EAV model nor the relational model significantly slow queries. Joins are actually very fast, compared with (for example) a sort. Returning a record for a company with all of its associated contact types, or indeed a specific contact type would be very fast. I am working on a financial MS SQL database with millions of rows and similar data models and have no problem returning significant amounts of data in sub-second timings.
In terms of complexity, this isn't the most technical design in terms of database modelling and the concept of joining tables is most definitely below what I would consider to be "intermediate" level database development.
I would consider if you need one or two tables based onthe following criteria:
First are you close the the record storage limit, then two tables definitely.
Second will you usually be querying the information you plan to put inthe second table most of the time you query the first table? Then one table might make more sense. If you usually do not need the extended information, a separate ( and less wide) table should improve performance on the main data queries.
Third, how strong a possibility is it that you will ever need multiple values? If it is one to one nopw, but something like email address or phone number that has a strong possibility of morphing into multiple rows, go ahead and make it a related table. If you know there is no chance or only a small chance, then it is OK to keep it one assuming the table isn't too wide.
EAV tables look like they are nice and will save futue work, but in reality they don't. Genreally if you need to add another type, you need to do future work to adjust quesries etc. Writing a script to add a column takes all of five minutes, the other work will need to be there regarless of the structure. EAV tables are also very hard to query when you don;t know how many records you wil need to pull becasue normally you want them on one line and will get the information by joining to the same table multiple times. This causes performance problmes and locking especially if this table is central to your design. Don't use this method.
It depends if you will ever need more information about a company. If you notice yourself adding fields like companyphonenumber1 companyphonenumber2, etc etc. Then method 2 is better as you would seperate your entities and just reference a company id. If you do not plan to make these changes and you feel that this table will never change then method 1 is fine.
Usually, if you don't have data duplication then a single table is fine.
In your case you don't so the first method is OK.
I use one table if I estimate the data from the "second" table will be used in more than 50% of my queries. Use two tables if I need multiple copies of the data (i.e. multiple phone numbers, email addresses, etc)

SQL Optimization: how many columns on a table?

In a recent project I have seen a tables from 50 to 126 columns.
Should a table hold less columns per table or is it better to separate them out into a new table and use relationships? What are the pros and cons?
Generally it's better to design your tables first to model the data requirements and to satisfy rules of normalization. Then worry about optimizations like how many pages it takes to store a row, etc.
I agree with other posters here that the large number of columns is a potential red flag that your table is not properly normalized. But it might be fine in this case. We can't tell from your description.
In any case, splitting the table up just because the large number of columns makes you uneasy is not the right remedy. Is this really causing any defects or performance bottleneck? You need to measure to be sure, not suppose.
A good rule of thumb that I've found is simply whether or not a table is growing rows as a project continues,
For instance:
On a project I'm working on, the original designers decided to include site permissions as columns in the user table.
So now, we are constantly adding more columns as new features are implemented on the site. obviously this is not optimal. A better solution would be to have a table containing permissions and a join table between users and permissions to assign them.
However, for other more archival information, or tables that simply don't have to grow or need to be cached/minimize pages/can be filtered effectively, having a large table doesn't hurt too much as long as it doesn't hamper maintenance of the project.
At least that is my opinion.
Usually excess columns points to improper normalization, but it is hard to judge without having some more details about your requirements.
I can picture times when it might be necessary to have this many, or more columns. Examples would be if you had to denormalize and cache data - or for a type of row with many attributes. I think the keys are to avoid select * and make sure you are indexing the right columns and composites.
If you had an object detailing the data in the database, would you have a single object with 120 fields, or would you be looking through the data to extract data that is logically distinguishable? You can inline Address data with Customer data, but it makes sense to remove it and put it into an Addresses table, even if it keeps a 1:1 mapping with the Person.
Down the line you might need to have a record of their previous address, and by splitting it out you've removed one major problem refactoring your system.
Are any of the fields duplicated over multiple rows? I.e., are the customer's details replicated, one per invoice? In which case there should be one customer entry in the Customers table, and n entries in the Invoices table.
One place where you need to not fix broken normalisation is where you have a facts table (for auditing, etc) where the purpose is to aggregate data to run analyses on. These tables are usually populated from the properly normalised tables however (overnight for example).
It sounds like you have potential normalization issues.
If you really want to, you can create a new table for each of those columns (a little extreme) or group of related columns, and join it on the ID of each record.
It could certainly affect performance if people are running around with a lot of "Select * from GiantTableWithManyColumns"...
Here are the official statistics for SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms143432.aspx
Keep in mind these are the maximums, and are not necessarily the best for usability.
Think about splitting the 126 columns into sections.
For instance, if it is some sort of "person" table
you could have
Person
ID, AddressNum, AddressSt, AptNo, Province, Country, PostalCode, Telephone, CellPhone, Fax
But you could separate that into
Person
ID, AddressID, PhoneID
Address
ID, AddressNum, AddressSt, AptNo, Province, Country, PostalCode
Phone
ID, Telephone, Cellphone, fax
In the second one, you could also save yourself from data replication by having all the people with the same address have the same addressId instead of copying the same text over and over.
The UserData table in SharePoint has 201 fields but is designed for a special purpose.
Normal tables should not be this wide in my opinion.
You could probably normalize some more. And read some posts on the web about table optimization.
It is hard to say without knowing a little bit more.
Well, I don't know how many columns are possible in sql but one thing for which I am very sure is that when you design table, each table is an entity means that each table should contain information either about a person, a place, an event or an object. So till in my life I don't know that a thing may have that much data/information.
Second thing that you should notice is that that there is a method called normalization which is basically used to divide data/information into sub section so that one can easily maintain database. I think this will clear your idea.
I'm in a similar position. Yes, there truly is a situation where a normalized table has, like in my case, about 90, columns: a work flow application that tracks many states that a case can have in addition to variable attributes to each state. So as each case (represented by the record) progresses, eventually all columns are filled in for that case. Now in my situation there are 3 logical groupings (15 cols + 10 cols + 65 cols). So do I keep it in one table (index is CaseID), or do I split into 3 tables connected by one-to-one relationship?
Columns in a table1 (merge publication)
246
Columns in a table2 (SQL Server snapshot or transactional publication)
1,000
Columns in a table2 (Oracle snapshot or transactional publication)
995
in a table, we can have maximum 246 column
http://msdn.microsoft.com/en-us/library/ms143432.aspx
A table should have as few columns as possible.....
in SQL server tables are stored on pages, 8 pages is an extent
in SQL server a page can hold about 8060 bytes, the more data you can fit on a page the less IOs you have to make to return the data
You probably want to normalize (AKA vertical partitioning) your database

Resources