Predicting Oracle Table Growth - database

How can I predict the future size / growth of an Oracle table?
Assuming:
linear growth of the number of rows
known columns of basic datatypes (char, number, and date)
ignore the variability of varchar2
basic understanding of the space required to store them (e.g. number)
basic understanding of blocks, extents, segments, and block overhead
I'm looking for something more proactive than "measure now, wait, measure again."

Estimate the average row size based on your data types.
Estimate the available space in a block. This will be the block size, minus the block header size, minus the space left over by PCTFREE. For example, if your block header size is 100 bytes, your PCTFREE is 10, and your block size is 8192 bytes, then the free space in a given block is (8192 - 100) * 0.9 = 7282.
Estimate how many rows will fit in that space. If your average row size is 1 kB, then roughly 7 rows will fit in an 8 kB block.
Estimate your rate of growth, in rows per time unit. For example, if you anticipate a million rows per year, your table will grow by roughly 1 GB annually given 7 rows per 8 kB block.

I suspect that the estimate will depend 100% on the problem domain. Your proposed method seems as good a general procedure as is possible.

Given your assumptions, "measure, wait, measure again" is perfectly predictive. In 10g+ Oracle even does the "measure, wait, measure again" for you. http://download.oracle.com/docs/cd/B19306_01/server.102/b14237/statviews_3165.htm#I1023436

Related

Maximum Number of Cells in a Cassandra Table

I have a system that stores measurements from machines with many transducers, once per second. I'm considering using Cassandra and would like to store the 1 second sample of machine state measurements in a single table, which would be something like:
create table inst_samples (
machine_id text,
batch_id int,
sample_time timestamp,
var1 double,
var2 double,
.....
varN double,
PRIMARY KEY ((machine_id, batch_id), sample_time)
);
There are approximately 20 machines with 400 state variables each and the batch_id will update every 1-2 hours. I have reviewed the documentation on the 2 billion cells maximum per table and noted similar questions
here What are the maximum number of columns allowed in Cassandra and here Cassandra has a limit of 2 billion cells per partition, but what's a partition?
If I am understanding this limit correctly I would hit the 2 billion cell limit for a single machine in the inst_samples table in approximately 60 days?
(2e9 cells / 400 cols/row) / (3600 rows / hour) / (24 hours / day) =~ 58 days?
I am a total Cassandra newbie. Thanks.
This 2 billion limit is for partition, and if you have good data model, you should have many partitions. In practice, it's recommended to keep number of cells per partition under control - something like, not more 100,000 cells per partition, otherwise there could be some performance problems, etc. But the actual limit depends on the multiple factors, like Cassandra version, what queries are executed, etc.
In your case, we have partition key of machine_id + batch_id, and that gives us for batch size of 2 hours: 400x7200 = 2880000 - almost 3 million cells. It may still work (would be better if you set batch size to 1 hour), but will require testing on real hardware - this could be done for example, with NoSQLBench.
There are also other ways to optimize your data model - for example, instead of allocating a separate column for every variable, just use frozen<map<text, double>> - in this case, all measurements will be stored as a single cell. The drawback of it - you can't change the individual values without reading the map & inserting it with changed value. Another drawback is that you'll need to read all measurements at once - but this could be ok.

Worth a unique table for database values that repeat ~twice?

I have a static database of ~60,000 rows. There is a certain column for which there are ~30,000 unique entries. Given that ratio (60,000 rows/30,000 unique entries in a certain column), is it worth creating a new table with those entries in it, and linking to it from the main table? Or is that going to be more trouble than it's worth?
To put the question in a more concrete way: Will I gain a lot more efficiency by separating out this field into it's own table?
** UPDATE **
We're talking about a VARCHAR(100) field, but in reality, I doubt any of the entries use that much space -- I could most likely trim it down to VARCHAR(50). Example entries: "The Gas Patch and Little Canada" and "Kora Temple Masonic Bldg. George Coombs"
If the field is a VARCHAR(255) that normally contains about 30 characters, and the alternative is to store a 4-byte integer in the main table and use a second table with a 4-byte integer and the VARCHAR(255), then you're looking at some space saving.
Old scheme:
T1: 30 bytes * 60 K entries = 1800 KiB.
New scheme:
T1: 4 bytes * 60 K entries = 240 KiB
T2: (4 + 30) bytes * 30 K entries = 1020 KiB
So, that's crudely 1800 - 1260 = 540 KiB space saving. If, as would be necessary, you build an index on the integer column in T2, you lose some more space. If the average length of the data is larger than 30 bytes, the space saving increases. If the ratio of repeated rows ever increases, the saving increases.
Whether the space saving is significant depends on your context. If you need half a megabyte more memory, you just got it — and you could squeeze more if you're sure you won't need to go above 65535 distinct entries by using 2-byte integers instead of 4 byte integers (120 + 960 KiB = 1080 KiB; saving 720 KiB). On the other hand, if you really won't notice the half megabyte in the multi-gigabyte storage that's available, then it becomes a more pragmatic problem. Maintaining two tables is harder work, but guarantees that the name is the same each time it is used. Maintaining one table means that you have to make sure that the pairs of names are handled correctly — or, more likely, you ignore the possibility and you end up without pairs where you should have pairs, or you end up with triplets where you should have doubletons.
Clearly, if the type that's repeated is a 4 byte integer, using two tables will save nothing; it will cost you space.
A lot, therefore, depends on what you've not told us. The type is one key issue. The other is the semantics behind the repetition.

how do I figure out provisional throughput for AWS DynamoDB table?

My system is supposed to write a large amount of data into a DynamoDB table every day. These writes come in bursts, i.e. at certain times each day several different processes have to dump their output data into the same table. Speed of writing is not critical as long as all the daily data gets written before the next dump occurs. I need to figure out the right way of calculating the provisional capacity for my table.
So for simplicity let's assume that I have only one process writing data once a day and it has to write upto X items into the table (each item < 1KB). Is the capacity I would have to specify essentially equal to X / 24 / 3600 writes/second?
Thx
The provisioned capacity is in terms of writes/second. You need to make sure that you can handle the PEAK number of writes/second that you are going to expect, not the average over the day. So, if you have a single process that runs once a day and makes X number of writes, of Y size (in KB, rounded up), over Z number of seconds, your formula would be
capacity = (X * Y) / Z
So, say you had 100K writes over 100 seconds and each write < 1KB, you would need 1000 w/s capacity.
Note that in order to minimize provisioned write capacity needs, it is best to add data into the system on a more continuous basis, so as to reduce peaks in necessary read/write capacity.

Is there any benefit to my rather quirky character sizing convention?

I love things that are a power of 2. I celebrated my 32nd birthday knowing it was the last time in 32 years I'd be able to claim that my age was a power of 2. I'm obsessed. It's like being some Z-list Batman villain, except without the colourful adventures and a face full of batarangs.
I ensure that all my enum values are powers of 2, if only for future bitwise operations, and I'm reasonably assured that there is some purpose (even if latent) for doing it.
Where I'm less sure, is in how I define the lengths of database fields. Again, I can't help it. Everything ends up being a power of 2.
CREATE TABLE Person
(
PersonID int IDENTITY PRIMARY KEY
,Firstname varchar(64)
,Surname varchar(128)
)
Can any SQL super-boffins who know about the internals of how stuff is stored and retrieved tell me whether there is any benefit to my inexplicable obsession? Is it more efficient to size character fields this way? Can anyone pop in with an "actually, what you're doing works because ....."?
I suspect I'm just getting crazier in my older age, but it'd be nice to know that there is some method to my madness.
Well, if I'm your coworker and I'm reading your code, I don't have to use SVN blame to find out who wrote it. That's kind of cool. :)
The only relevant powers of two are 512 and 4096, which is the default disk block size and memory page size respectively. If your total row-length crosses these boundaries, you might notice un-proportional jumps and dumps in performance if you look very closely. For example, if your row is 513 bytes long, you need to read twice as many blocks for a single row than for a row that is 512 bytes long.
The problem is calculating the proper row size, as the internal storage format is not very well documented.
Also, I do not know whether the SQL Server actually keeps the rows block aligned, so you might be out of luck there anyways.
With varchar, you only stored the number of characters + 2 for length.
Generally, the maximum row size is 8060
CREATE TABLE dbo.bob (c1 char(3000), c2 char(3000), c31 char(3000))
Msg 1701, Level 16, State 1, Line 1
Creating or altering table 'bob' failed because the minimum row size would be 9007, including 7 bytes of internal overhead. This exceeds the maximum allowable table row size of 8060 bytes.
The power of 2 stuff is frankly irrational and that isn't good in a programmer...

How many is a "large" data set?

Assumed infinite storage where size/volume/physics (metrics,gigabytes/terrabytes) won't matter only the number of elements and their labels, statistically pattern should emerge already at 30 subsets, but can you agree that less than 1000 subsets is too little to test, and at least 10000 distinct subsets / "elements", "entries" / entities is "a large data set". Or larger?
Thanks
I'm not sure I understand your question, but it sounds like you are attempting to ask about how many elements of data set you need to sample in order to ensure a certain degree of accuracy (30 is a magic number from the Central Limit Theorem that comes in to play frequently).
If that is the case, the sample size you need depends on the confidence level and confidence interval. If you want a 95% confidence level and a 5% confidence interval (i.e. you want to be 95% confident that the proportion you determine from your sample is within 5% of the proportion in the full data set), you end up needing a sample size of no more than 385 elements. The greater the confidence level and the smaller the confidence interval that you want to generate, the larger the sample size you need.
Here is a nice discussion on the mathematics of determining sample size
and a handy sample size calculator if you just want to run the numbers.

Resources