Estimating database size [closed] - sql-server

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I was wondering what you do when developing a new application in terms of estimating database size.
E.g. I am planning to launch a website, and I am having a hard time estimating what size I could expect my database to grow. I don't expect you to tell me what size my database will be, but I'd like to know if there are general principles in estimating this.
E.g. When Jeff developed StackOverflow, he (presumably) guesstimated his database size and growth.
My dilemma is that I am going for a hosted solution for my web application (its about cost at this stage), and preferably don't want to shoot myself in the foot by not purchasing enough SQL Server space (they charge a premium for this).

If you have a database schema, sizing is pretty straightforward ... it's just estimated rows * avg row size for each table * some factor for indexes * some other factor for overhead. Given the ridiculously low price of storage nowadays, sizing often isn't a problem unless you intend to have a very high traffic site (or are building an app for a large enterprise).
For my own sizing exercises, I've always created an excel spreadsheet listing:
col 1: each table that will grow
col 2: estimated column size in bytes
col 3: estimated # of rows (per year or max, depending on application)
col 4: index factor (I always set this to 2)
col 5: overhead factor (I always set this to 1.2)
col 6: total column (col 2 X 3 X 4 X 5)
The sum of col 6 (total column), plus the initial size of your database without growth tables, is your size estimate. You can get much more scientific, but this is my quick and dirty way.

Determine:
how many visitors per day, V
how many records of each type will be created per visit, N1, N2, N3...
the size of each record type, S1, S2, S3...
EDIT: forgot index factor which a good rule of thumb is 2 times
Total growth per day = 2* V * (N1*S1 + N2*S2 + N3*S3 + ...)

My rules-of-thumb to follow are
how many users do I expect?
what content can they post?
how big is a user record?
how big is each content item a user can add?
how much will I be adding?
how long will those content items live? forever? just a couple weeks?
Multiply the user record size times the number of users; add the number of users times the content item size; multiply by two (for a convenient fudge factor).

The cost of estimating is likely to be larger than the cost of the storage
Most hosting providers sell capacity by the ammount used at the end of each month, so just let it run

Related

Array or Map is better in database? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last month.
Improve this question
I am using AWS dynamoDB.
option 1
[{"id":"01","avaliable":true},
{"id":"02","avaliable":true},
{"id":"03","avaliable":false},
{"id":"04","avaliable":true}
{"id":"05","avaliable":false}]
option 2
"avaliable":[true,true,false,true,false]
id will always in sequence and start with 0 so I think it is a waste to include "id" as attribute. I can just update avaliabe in option 2 using {id-1} as array index. But I am not sure will there be any other issue if I use option 2. I am orginally using option 1 and will check whether the id correct before update. I am afraid option 2 will have mistake easily.
Which structure do you think is better?
Personally I prefer to use Map type in DynamoDB because it allows you to update on a key versus guessing what index you need in an array. However that would be option 3:
"mymap":{
"id01":{"avaliable":true},
"id02":{"avaliable":true},
"id03":{"avaliable":true},
"id04":{"avaliable":true},
"id05":{"avaliable":true}
}
This allows you up modify elements without first trying to figure out what position in an array it might be, which sometimes requires you to first read the item and can cause concurrency issues.
I do notice you mention that you equate the position of the item in the array, however I feel this is a more fool-proof way for general implementation. For example, if you need to remove a value from the middle of the list, it would not cause any issues.
That is one thing that can influence your decision, the other 2 being item size and total storage.
If your item size is substantially less than 1KB then you will have no issue using option 1 or 3 which will increase your item size slightly compared to option 2. As long as the extra characters do not push your average item size over the nearest 1KB value as that will mean that you will have increased your capacity consumption for write requests.
The other being the total storage size. DynamoDB provides a free tier of 25GB of storage. If you have millions of items causing you to increase your storage size substantially, then you may decide to use option 2.

Google App Engine - Search API index growth

I would like to know how can I estimate the growth (how much the size increasez in a period of time) of an index of App engine Search API (FTS) based on the number of entities inserted and amount of information. For this I would like to know basically how is the index size calculated (on what does it depend). Specifically:
When inserting new entities, is the growth (size) influenced by the number of previous existing entities? (ie. is the growth exponential)? For ex. if I have 1000 entities and I insert 10, the index will grow with X bytes. But if I have 100000 entities and insert 10, will it increase with X or much more than X (exponentially, let' say 10*X) ?
Does the number of fields (properties) influences the size exponentially? For ex. if I have entity A with 2 fields and entity B with 4 fields (let's say identical in values, for mathematical simplicity) will the size increase, when adding entity B, twice as that of entity A or much more than that?
What other means can I use to find statistical information; do I have other tools in the cloud console of app engine, or can I do this programmatically ?
Thank you.
You can check the size of a given index by running the code below.
from google.appengine.api import search
for index in search.get_indexes(fetch_schema=True):
logging.info("index %s", index.storage_usage)
# pseudo code
amount_of_items_to_add = 100
x = 0
for x <= amount_of_items_to_add:
search_api_insert_insert(data)
x+=1
#rerun for loop to see how much the size increased
for index in search.get_indexes(fetch_schema=True):
logging.info("index %s", index.storage_usage)
This code is obviously not a complete working example, but you should be able to build a simple method that takes some data inserts it into the search api and returns how much the used storage increased.
I have run a number of tests for different number of entities and different number of indexed properties per entity and it seams thst the estimated growth of the index reported by the api is not exponential it is linear.
But the most interesting fact to know is that although the size reported is realtime almost, after deleting documents from the index, it may take 12, 24 even 36 hours to update.

A/B testing sorting algorithm [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I want to make an algorithm which will enable the conduct of A/B testing over a variable number of subjects with a variable number of properties per subject.
For example I have 1000 people with the following properties: they come from two departments, some are managers, some are women etc. these properties may increase/decrease according to the situation.
I want to make an algorithm which will split the population in two with the best representation possible in both A and B of all the properties. So i want two groups of 500 people with equal number of both departments in both, equal number of managers and equal number of women. More specifically, I would like to maintain the ratio of each property in both A and B. So if we have 10% managers I want 10% of sample A and Sample B to be managers.
Any pointers on where to begin? I am pretty sure that such an algorithm exists. I have a gut feeling that this may be unsolvable in some cases as there may be an odd number of managers AND women AND Dept. 1.
Make a list of permutations of all a/b variables.
Dept1,Manager,Male
Dept1,Manager,Female
Dept1,Junior,Male
...
Dept2,Junior,Female
Go through all the people and assign them to their respective permutation. Maybe randomise the order of the people first just to be sure there is no bias in the order they are added to each permutation.
Dept1,Manager,Male-> Person1, Person16, Person143...
Dept1,Manager,Female-> Person7, Person10, Person83...
Have a second process that goes through each permutation and assigns half the people to one test group and half to the other. You will need to account for odd numbers of people in the group, but that should be fairly easy to factor in, obviously a larger sample size will reduce the impact of this odd number on the final results.
The algorithm for splitting the groups is simple - take each group of people who have all dimensions in common and assign half to the treatment and half to the control. You don't need to worry about odd numbers of people, whatever statistical test you are using will account for that. If some dimension is so skewed (i.e., there are only 2 females in your entire sample), it may be wise throw the dimension out.
Simple A/B tests usually use a t-test or g-test, but in your case, you'd be better of using an ANOVA to determine the significance of the treatment on each of the individual dimensions.

Estimation of the Data logging size [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a device generating some values say N, each value having 32 bit.
I am logging these values every 10 seconds by writing a new row in an excel file. I will be creating a new file every day.
I have to estimate the hard disk storage capacity necessary to store these log files for a period of 10 years.
Can someone give any hints regarding the calculation of the size of log file generated per day ?
Assuming worst case 2's complement 32-bit ASCII...
-2147483648 is 13 characters per value
1 value / 10 seconds
3600 seconds / hour
24 hours /day
that's 112,320 bytes per day, per number of values N,
"round" that off to 112,640 bytes (divisible by 1024) per day
365.25 days per year
10 years
that's N * 411,417,600 or slightly more than N * 4Mbytes
So if N was 10, that would be slightly more than 41MBytes.
Create a sample spreadsheet. Add 1000 rows and save it as a different name.
That will give an estimate for per-row cost.
Incremental writing is not a good scenario for complex formats such as spread sheet. Text log file could be appended.
A spread sheet would tend to re-write whole file for each flush.

clustering a SUPER large data set [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project as a part of my class curriculum . Its a project for Advanced Database Management Systems and it goes like this.
1)Download large number of images (1000,000) --> Done
2)Cluster them according to their visual Similarity
a)Find histogram of each image --> Done
b)Now group (cluster) images according to their visual similarity.
Now, I am having a problem with part 2b. Here is what I did:
A)I found the histogram of each image using matlab and now have represented it using a 1D vector(16 X 16 X 16) . There are 4096 values in a single vector.
B)I generated an ARFF file. It has the following format. There are 1000,000 histograms (1 for each image..thus 1000,000 rows in the file) and 4097 values in each row (image_name + 4096 double values to represent the histogram)
C)The file size is 34 GB. THE BIG QUESTION: HOW THE HECK DO I CLUSTER THIS FILE???
I tried using WEKA and other online tools. But they all hang. Weka gets stuck and says "Reading a file".
I have a RAM of 8 GB on my desktop. I don't have access to any cluster as such. I tried googling but couldn't find anything helpful about clustering large datasets. How do I cluster these entries?
This is what I thought:
Approach One:
Should I do it in batches of 50,000 or something? Like, cluster the first 50,000 entries. Find as many possible clusters call them k1,k2,k3... kn.
Then pick the the next 50,000 and allot them to one of these clusters and so on? Will this be an accurate representation of all the images. Because, clustering is done only on the basis of first 50,000 images!!
Approach Two:
Do the above process using random 50,000 entries?
Any one any inputs?
Thanks!
EDIT 1:
Any clustering algorithm can be used.
Weka isn't your best too for this. I found ELKI to be much more powerful (and faster) when it comes to clustering. The largest I've ran are ~3 million objects in 128 dimensions.
However, note that at this size and dimensionality, your main concern should be result quality.
If you run e.g. k-means, the result will essentially be random because of you using 4096 histogram bins (way too much, in particular with squared euclidean distance).
To get good result, you need to step back an think some more.
What makes two images similar. How can you measure similarity? Verify your similarity measure first.
Which algorithm can use this notion of similarity? Verify the algorithm on a small data set first.
How can the algorithm be scaled up using indexing or parallelism?
In my experience, color histograms worked best on the range of 8 bins for hue x 3 bins for saturation x 3 bins for brightness. Beyond that, the binning is too fine grained. Plus it destroys your similarity measure.
If you run k-means, you gain absolutely nothing by adding more data. It searches for statistical means and adding more data won't find a different mean, but just some more digits of precision. So you may just as well use a sample of just 10k or 100k pictures, and you will get virtually the same results.
Running it several times for independent sets of pictures results in different cluster clusters which are difficult to merge. Thus two similar images are placed in different clusters. I would run the clustering algorithm for a random set of images (as large as possible) and use these cluster definitions to sort all other images.
Alternative: Reduce the compexity of your data, e.g. to a histogram of 1024 double values.

Resources