Worth a unique table for database values that repeat ~twice? - database

I have a static database of ~60,000 rows. There is a certain column for which there are ~30,000 unique entries. Given that ratio (60,000 rows/30,000 unique entries in a certain column), is it worth creating a new table with those entries in it, and linking to it from the main table? Or is that going to be more trouble than it's worth?
To put the question in a more concrete way: Will I gain a lot more efficiency by separating out this field into it's own table?
** UPDATE **
We're talking about a VARCHAR(100) field, but in reality, I doubt any of the entries use that much space -- I could most likely trim it down to VARCHAR(50). Example entries: "The Gas Patch and Little Canada" and "Kora Temple Masonic Bldg. George Coombs"

If the field is a VARCHAR(255) that normally contains about 30 characters, and the alternative is to store a 4-byte integer in the main table and use a second table with a 4-byte integer and the VARCHAR(255), then you're looking at some space saving.
Old scheme:
T1: 30 bytes * 60 K entries = 1800 KiB.
New scheme:
T1: 4 bytes * 60 K entries = 240 KiB
T2: (4 + 30) bytes * 30 K entries = 1020 KiB
So, that's crudely 1800 - 1260 = 540 KiB space saving. If, as would be necessary, you build an index on the integer column in T2, you lose some more space. If the average length of the data is larger than 30 bytes, the space saving increases. If the ratio of repeated rows ever increases, the saving increases.
Whether the space saving is significant depends on your context. If you need half a megabyte more memory, you just got it — and you could squeeze more if you're sure you won't need to go above 65535 distinct entries by using 2-byte integers instead of 4 byte integers (120 + 960 KiB = 1080 KiB; saving 720 KiB). On the other hand, if you really won't notice the half megabyte in the multi-gigabyte storage that's available, then it becomes a more pragmatic problem. Maintaining two tables is harder work, but guarantees that the name is the same each time it is used. Maintaining one table means that you have to make sure that the pairs of names are handled correctly — or, more likely, you ignore the possibility and you end up without pairs where you should have pairs, or you end up with triplets where you should have doubletons.
Clearly, if the type that's repeated is a 4 byte integer, using two tables will save nothing; it will cost you space.
A lot, therefore, depends on what you've not told us. The type is one key issue. The other is the semantics behind the repetition.

Related

Choosing best datatype for numeric column in SQL Server

I have a table in SQL Server with large amount of data - around 40 million rows. The base structure is like this:
Title
type
length
Null distribution
Customer-Id
number
8
60%
Card-Serial
number
5
70%
-
-
-
-
-
-
-
-
Note
string-unicode
2000
40%
Both numeric columns are filled by numbers with specific length.
I have no idea which data type to choose to have a database in the smallest size and having good performance by indexing the customerId column. Refer to this Post if I choose CHAR(8), database consume 8 bytes per row even in null data.
I decided to use INT to reduce the database size and having good index, but null data will use 4 bytes per rows again. If I want to reduce this size, I can use VARCHAR(8), but I don't know, the system has good performance on setting index on this type or not. The main question is reducing database size is important or having good index on numeric type.
Thanks.
If it is a number - then by all means choose a numeric datatype!! Don't store your numbers as char(n) or varchar(n) !! That'll just cause you immeasurable grief and headaches later on.
The choice is pretty clear:
if you have whole numbers - use TINYINT, SMALLINT, INT or BIGINT - depending on the number range you need
if you need fractional numbers - use DECIMAL(p,s) for the best and most robust behaviour (no rounding errors like FLOAT or REAL)
Picking the most appropriate datatype is much more important than any micro-optimization for storage. Even with 40 million rows - that's still not a big issue, whether you use 4 or 8 bytes. Whether you use a numeric type vs. a string type - that makes a huge difference in usability and handling of your database!

Maximum Number of Cells in a Cassandra Table

I have a system that stores measurements from machines with many transducers, once per second. I'm considering using Cassandra and would like to store the 1 second sample of machine state measurements in a single table, which would be something like:
create table inst_samples (
machine_id text,
batch_id int,
sample_time timestamp,
var1 double,
var2 double,
.....
varN double,
PRIMARY KEY ((machine_id, batch_id), sample_time)
);
There are approximately 20 machines with 400 state variables each and the batch_id will update every 1-2 hours. I have reviewed the documentation on the 2 billion cells maximum per table and noted similar questions
here What are the maximum number of columns allowed in Cassandra and here Cassandra has a limit of 2 billion cells per partition, but what's a partition?
If I am understanding this limit correctly I would hit the 2 billion cell limit for a single machine in the inst_samples table in approximately 60 days?
(2e9 cells / 400 cols/row) / (3600 rows / hour) / (24 hours / day) =~ 58 days?
I am a total Cassandra newbie. Thanks.
This 2 billion limit is for partition, and if you have good data model, you should have many partitions. In practice, it's recommended to keep number of cells per partition under control - something like, not more 100,000 cells per partition, otherwise there could be some performance problems, etc. But the actual limit depends on the multiple factors, like Cassandra version, what queries are executed, etc.
In your case, we have partition key of machine_id + batch_id, and that gives us for batch size of 2 hours: 400x7200 = 2880000 - almost 3 million cells. It may still work (would be better if you set batch size to 1 hour), but will require testing on real hardware - this could be done for example, with NoSQLBench.
There are also other ways to optimize your data model - for example, instead of allocating a separate column for every variable, just use frozen<map<text, double>> - in this case, all measurements will be stored as a single cell. The drawback of it - you can't change the individual values without reading the map & inserting it with changed value. Another drawback is that you'll need to read all measurements at once - but this could be ok.

The total size of the index is too large or too many parts in index in informix

I am trying to run following script on informix:
CREATE TABLE REG_PATH (
REG_PATH_ID SERIAL UNIQUE,
REG_PATH_VALUE LVARCHAR(750) NOT NULL,
REG_PATH_PARENT_ID INTEGER,
REG_TENANT_ID INTEGER DEFAULT 0,
PRIMARY KEY(REG_PATH_ID, REG_TENANT_ID) CONSTRAINT PK_REG_PATH
);
CREATE INDEX IDX1 ON REG_PATH(REG_PATH_VALUE, REG_TENANT_ID);
But it gives the following error:
517: The total size of the index is too large or too many parts in index.
I am using informix version 11.50FC9TL. My dbspace chunk size is 5M.
What is the reason for this error, and how can I fix it?
I believe 11.50 has support for large page sizes, and to create an index on a column that is LVARCHAR(750) (plus a 4-byte INTEGER), you will need to use a bigger page size for the dbspace that holds the index. Offhand, I think the page size will need to be at least 4 KiB, rather than the default 2 KiB you almost certainly are using. The rule of thumb I remember is 'at least 5 index keys per page', and at 754 bytes plus some overhead, 5 keys squeaks in at just under 4 KiB.
This is different from the value quoted by Bohemian in his answer.
See the IDS 12.10 Information Center for documentation about Informix 12.10.
Creating a dbspace with a non-default page size
CREATE INDEX statement
Index key specification
This last reference has a table of dbspace page sizes and maximum key sizes permitted:
Page Size Maximum Index Key Size
2 kilobytes 387 bytes
4 kilobytes 796 bytes
8 kilobytes 1,615 bytes
12 kilobytes 2,435 bytes
16 kilobytes 3,245 bytes
If 11.50 doesn't have support for large page sizes, you will have to migrate to a newer version (12.10 recommended, 11.70 a possibility) if you must create such an index.
One other possibility to consider is whether you really want such a large key string; could you reduce it to, say, 350 bytes? That would then fit in your current system.
From the informix documentation:
You can include up to 16 columns in a composite index. The total width of all indexed columns in a single composite index cannot exceed 380 bytes.
One of the columns you want to add to your index is REG_PATH_VALUE LVARCHAR(750); 750 bytes is longer than the 380 maximum allowed.
You can't "fix" this per se; either make the column size smaller, or don't include it in the index.

Is there any benefit to my rather quirky character sizing convention?

I love things that are a power of 2. I celebrated my 32nd birthday knowing it was the last time in 32 years I'd be able to claim that my age was a power of 2. I'm obsessed. It's like being some Z-list Batman villain, except without the colourful adventures and a face full of batarangs.
I ensure that all my enum values are powers of 2, if only for future bitwise operations, and I'm reasonably assured that there is some purpose (even if latent) for doing it.
Where I'm less sure, is in how I define the lengths of database fields. Again, I can't help it. Everything ends up being a power of 2.
CREATE TABLE Person
(
PersonID int IDENTITY PRIMARY KEY
,Firstname varchar(64)
,Surname varchar(128)
)
Can any SQL super-boffins who know about the internals of how stuff is stored and retrieved tell me whether there is any benefit to my inexplicable obsession? Is it more efficient to size character fields this way? Can anyone pop in with an "actually, what you're doing works because ....."?
I suspect I'm just getting crazier in my older age, but it'd be nice to know that there is some method to my madness.
Well, if I'm your coworker and I'm reading your code, I don't have to use SVN blame to find out who wrote it. That's kind of cool. :)
The only relevant powers of two are 512 and 4096, which is the default disk block size and memory page size respectively. If your total row-length crosses these boundaries, you might notice un-proportional jumps and dumps in performance if you look very closely. For example, if your row is 513 bytes long, you need to read twice as many blocks for a single row than for a row that is 512 bytes long.
The problem is calculating the proper row size, as the internal storage format is not very well documented.
Also, I do not know whether the SQL Server actually keeps the rows block aligned, so you might be out of luck there anyways.
With varchar, you only stored the number of characters + 2 for length.
Generally, the maximum row size is 8060
CREATE TABLE dbo.bob (c1 char(3000), c2 char(3000), c31 char(3000))
Msg 1701, Level 16, State 1, Line 1
Creating or altering table 'bob' failed because the minimum row size would be 9007, including 7 bytes of internal overhead. This exceeds the maximum allowable table row size of 8060 bytes.
The power of 2 stuff is frankly irrational and that isn't good in a programmer...

Predicting Oracle Table Growth

How can I predict the future size / growth of an Oracle table?
Assuming:
linear growth of the number of rows
known columns of basic datatypes (char, number, and date)
ignore the variability of varchar2
basic understanding of the space required to store them (e.g. number)
basic understanding of blocks, extents, segments, and block overhead
I'm looking for something more proactive than "measure now, wait, measure again."
Estimate the average row size based on your data types.
Estimate the available space in a block. This will be the block size, minus the block header size, minus the space left over by PCTFREE. For example, if your block header size is 100 bytes, your PCTFREE is 10, and your block size is 8192 bytes, then the free space in a given block is (8192 - 100) * 0.9 = 7282.
Estimate how many rows will fit in that space. If your average row size is 1 kB, then roughly 7 rows will fit in an 8 kB block.
Estimate your rate of growth, in rows per time unit. For example, if you anticipate a million rows per year, your table will grow by roughly 1 GB annually given 7 rows per 8 kB block.
I suspect that the estimate will depend 100% on the problem domain. Your proposed method seems as good a general procedure as is possible.
Given your assumptions, "measure, wait, measure again" is perfectly predictive. In 10g+ Oracle even does the "measure, wait, measure again" for you. http://download.oracle.com/docs/cd/B19306_01/server.102/b14237/statviews_3165.htm#I1023436

Resources