If I specify the distribution key to be the primary key of a table, then is it fair to assume that the data will be distributed evenly across all nodes?
And if I specify the distribution key to be a column with only one value, then all the data will end up in the first node?
If I specify the distribution key to be the primary key of a table,
then is it fair to assume that the data will be distributed evenly
across all nodes?
Quibbling about the definition of a node aside, generally speaking choosing the primary key of a table to be the distribution method will distribute the data evenly across the data slices, but this may not hold for tables with a relatively small number of rows.
And if I specify the distribution key to be a column with only one
value, then all the data will end up in the first node?
All of the data will be located on one data slice, but that may not be the "first one."
Related
I am trying to migrate a table that is currently in a relational database to BigTable.
Let's assume that the table currently has the following structure:
Table: Messages
Columns:
Message_id
Message_text
Message_timestamp
How can I create a similar table in BigTable?
From what I can see in the documentation, BigTable uses ColumnFamily. Is ColumnFamily the equivalent of a column in a relational database?
BigTable is different from a relational database system in many ways.
Regarding database structures, BigTable should be considered a wide-column, NoSQL database.
Basically, every record is represented by a row and for this row you have the ability to provide an arbitrary number of name-value pairs.
This row has the following characteristics.
Row keys
Every row is identified univocally by a row key. It is similar to a primary key in a relational database. This field is stored in lexicographic order by the system, and is the only information that will be indexed in a table.
In the construction of this key you can choose a single field or combine several ones, separated by # or any other delimiter.
The construction of this key is the most important aspect to take into account when constructing your tables. You must thing about how will you query the information. Among others, keep in mind several things (always remember the lexicographic order):
Define prefixes by concatenating fields that allows you to fetch information efficiently. BigTable allows and you to scan information that starts with a certain prefix.
Related, model your key in a way that allows you to store common information (think, for example, in all the messages that come from a certain origin) together, so it can be fetched in a more efficient way.
At the same time, define keys in a way that maximize dispersion and load balance between the different nodes in your BigTable cluster.
Column families
The information associated with a row is organized in column families. It has no correspondence with any concept in a relational database.
A column family allows you to agglutinate several related fields, columns.
You need to define the column families before-hand.
Columns
A column will store the actual values. It is similar in a certain sense to a column in a relational database.
You can have different columns for different rows. BigTable will sparsely store the information, if you do not provide a value for a row, it will consume no space.
BigTable is a third dimensional database: for every record, in addition to the actual value, a timestamp is stored as well.
In your use case, you can model your table like this (consider, for example, that you are able to identify the origin of the message as well, and that it is a value information):
Row key = message_origin#message_timestamp (truncated to half hour, hour...)1#message_id
Column family = message_details
Columns = message_text, message_timestamp
This will generate row keys like, consider for example that the message was sent from a device with id MT43:
MT43#1330516800#1242635
Please, as #norbjd suggested, see the relevant documentation for an in-deep explanation of these concepts.
One important difference with a relational database to note: BigTable only offers atomic single-row transactions and if using single cluster routing.
1 See, for instance: How to round unix timestamp up and down to nearest half hour?
I'm starting work on a data warehousing project for a customer that has multiple physical locations with separate instances of the same LOB databases at each location. There's a good bit of "common" data between the sites, but the systems are siloed, so data that conceptually refers to the same thing has a different representation in the source.
Consider, for example, a product category. The list of product categories would be identical for each location, but the auto-generated key would differ. When the data is extracted, staged, and loaded into the corresponding product category dimension table in the warehouse, the categories are effectively duplicated because they have different source system, or "natural" keys.
Clearly, the data needs to be de-duplicated, but what then would become the surrogate key that's persisted on the de-duplicated dimension record? Keep in mind that data referencing the product category will use the surrogate key from its location of origination. So, if I have three distinct locations, I'm going to have three different natural keys for the same product category and sales data corresponding to that product category which also references those three natural keys, but ultimately refer to the same conceptual category. There's a couple of ways I could handle this:
If I have three locations, write three distinct surrogate keys to the single dimension record. This would make matching in the ETL process straightforward, but it's not very scalable because additional locations can and likely will be added. For every new location that came online, I would then need to add an additional natural key field to every dimension table with such de-duplicated records.
Create a lookup table that recorded a mapping between every natural key and its corresponding surrogate key in the corresponding dimension table. I'm not sure if this approach is very standard nor am I sure about its maintainability.
Any input on how the above-referenced scenario could be handled would be greatly appreciated.
We use approach 2. Imagine one day having hundreds of locations, and you'll see that approach 1 is simply out of the question.
Approach 2 is scalable, and very easy to maintain, since your lookup table will only grow vertically.
Can HashTables be used to create indexes in databases? What is the ideal Data structure to create indexes?
If a table has has a foreign key referencing a field in other database does will it help if we create index on the foreign key?
Can HashTables be used to create indexes in databases?
Some DBMSes support hash-based indexes, some don't.
What is the ideal Data structure to create indexes?
No data structure occupies 0 bytes, nor it can be manipulated in 0 CPU cycles, therefore no data structure is "ideal". It is upon us, the software engineers, to decide which data structure has most benefits and fewest detriments to the specific goal we are trying to accomplish.
For example, B-Trees are useful for range scans and hash indexes aren't. Does that mean the B-Trees are "better"? Well, they are if you need range scans, but may not necessarily be if you don't.
If a table has has a foreign key referencing a field in other database does will it help if we create index on the foreign key?
You can not normally have a foreign key toward another database, only another table.
And yes, it tends to help, since every time a row is updated or deleted in the parent table, the child table needs to be searched to see if the FK was violated. This search can significantly benefit from such an index. Many (but not all) DBMSes require index on FK (and might even create it automatically if not already there).
OTOH, if you only add rows to the parent table, you could consider leaving the child table unindexed on FK fields (assuming your DBMS allows you to do so).
Oracle Perspective
Oracle supports clustering by hash value, either for single or multiple tables. This physically colocates rows having the same hash value for the cluster columns, and is faster than accessing via an index. There are disadvantages due to increased complexity and a certain need for preplanning.
You could also use a function-based index to index based on a hash function applied to one or more columns. I'm not sure what the advantage of that would be though.
Foreign key columns in Oracle generally benefit from indexing due to the obvious performance advantages.
Index Organized Tables (IOTs) are tables stored in an index structure. Whereas a table stored
in a heap is unorganized, data in an IOT is stored and sorted by primary key (the data is the index). IOTs behave just like “regular” tables, and you use the same SQL to access them.
Every table in a proper relational database is supposed to have a primary key... If every table in my database has a primary key, should I always use an index organized table?
I'm guessing the answer is no, so when is an index organized table not the best choice?
Basically an index-organized table is an index without a table. There is a table object which we can find in USER_TABLES but it is just a reference to the underlying index. The index structure matches the table's projection. So if you have a table whose columns consist of the primary key and at most one other column then you have a possible candidate for INDEX ORGANIZED.
The main use case for index organized table is a table which is almost always accessed by its primary key and we always want to retrieve all its columns. In practice, index organized tables are most likely to be reference data, code look-up affairs. Application tables are almost always heap organized.
The syntax allows an IOT to have more than one non-key column. Sometimes this is correct. But it is also an indication that maybe we need to reconsider our design decisions. Certainly if we find ourselves contemplating the need for additional indexes on the non-primary key columns then we're probably better off with a regular heap table. So, as most tables probably need additional indexes most tables are not suitable for IOTs.
Coming back to this answer I see a couple of other responses in this thread propose intersection tables as suitable candidates for IOTs. This seems reasonable, because it is common for intersection tables to have a projection which matches the candidate key: STUDENTS_CLASSES could have a projection of just (STUDENT_ID, CLASS_ID).
I don't think this is cast-iron. Intersection tables often have a technical key (i.e. STUDENT_CLASS_ID). They may also have non-key columns (metadata columns like START_DATE, END_DATE are common). Also there is no prevailing access path - we want to find all the students who take a class as often as we want to find all the classes a student is taking - so we need an indexing strategy which supports both equally well. Not saying intersection tables are not a use case for IOTs. just that they are not automatically so.
I'd consider them for very narrow tables (such as the join tables used to resolve many-to-many tables). If (virtually) all the columns in the table are going to be in an index anyway, then why shouldn't you used an IOT.
Small tables can be good candidates for IOTs as discussed by Richard Foote here
I consider the following kinds of tables excellent candidates for IOTs:
"small" "lookup" type tables (e.g. queried frequently, updated infrequently, fits in a relatively small number of blocks)
any table that you already are going to have an index that covers all the columns anyway (i.e. may as well save the space used by the table if the index duplicates 100% of the data)
From the Oracle Concepts guide:
Index-organized tables are useful when
related pieces of data must be stored
together or data must be physically
stored in a specific order. This type
of table is often used for information
retrieval, spatial (see "Overview of
Oracle Spatial"), and OLAP
applications (see "OLAP").
This question from AskTom may also be of some interest especially where someone gives a scenario and then asks would an IOT perform better than an heap organised table, Tom's response is:
we can hypothesize all day long, but
until you measure it, you'll never
know for sure.
An index-organized table is generally a good choice if you only access data from that table by the key, the whole key, and nothing but the key.
Further, there are many limitations about what other database features can and cannot be used with index-organized tables -- I recall that in at least one version one could not use logical standby databases with index-organized tables. An index-organized table is not a good choice if it prevents you from using other functionality.
All an IOT really saves is the logical read(s) on the table segment, and as you might have spent two or three or more on the IOT/index this is not always a great saving except for small data sets.
Another feature to consider for speeding up lookups, particularly on larger tables, is a single table hash cluster. When correctly created they are more efficient for large data sets than an IOT because they require only one logical read to find the data, whereas an IOT is still an index that needs multiple logical i/o's to locate the leaf node.
I can't per se comment on IOTs, however if I'm reading this right then they're the same as a 'clustered index' in SQL Server. Typically you should think about not using such an index if your primary key (or the value(s) you're indexing if it's not a primary key) are likely to be distributed fairly randomly - as these inserts can result in many page splits (expensive).
Indexes such as identity columns (sequences in Oracle?) and dates 'around the current date' tend to make for good candidates for such indexes.
An Index-Organized Table--in contrast to an ordinary table--has its own way of structuring, storing, and indexing data.
Index organized tables (IOT) are indexes which actually hold the data which is being indexed, unlike the indexes which are stored somewhere else and have links to actual data.
I have a table in my database which stores logs. The log files are timestamped with second accuracy and store the values of various sensors and their source:
log_id, log_date, primary_system_source, sub_system_source, values
Where log_id, primary_source and sub_source are integers and values is a variable length byte array (datatype: bytea).
In most cases a combination of log_id, log_date, primary_system_source and sub_system_source fields would be sufficient as the primary key. Unfortunately, as a result of the resolution of the timestamping in the logging system in some rows the only factor differentiating rows is if the sensor values are also added to the primary key.
It appears I have a choice between having no primary key (bad?), and including the values field in the primary key. I am concerned at the second choice as I understand it could be seriously detrimental to performance (the table will have hundreds of millions of rows).
Any hints as to which is the best solution?
That's a difficult issue since your entire row functions as your primary key in the example you just presented. Since your logs timestamp without absolute precision, I would argue that your logs themselves may not contain unique values (two similar sensory readings in the same time period). If that holds true, you do not have any way to uniquely identify your data, and therefore can not impose a unique constraint on it.
I would recommend simply adding a SERIAL PK field for links to other relations and not worrying about the uniqueness of your entries since you can not reasonably enforce it anyways. You can identify duplicated log entries if you have a greater number of entries within a certain time period than you were expecting. I'm not sure of the performance implications, but running SELECT DISTINCT may be more prudent than attempting to enforce uniqueness.