My system is supposed to write a large amount of data into a DynamoDB table every day. These writes come in bursts, i.e. at certain times each day several different processes have to dump their output data into the same table. Speed of writing is not critical as long as all the daily data gets written before the next dump occurs. I need to figure out the right way of calculating the provisional capacity for my table.
So for simplicity let's assume that I have only one process writing data once a day and it has to write upto X items into the table (each item < 1KB). Is the capacity I would have to specify essentially equal to X / 24 / 3600 writes/second?
Thx
The provisioned capacity is in terms of writes/second. You need to make sure that you can handle the PEAK number of writes/second that you are going to expect, not the average over the day. So, if you have a single process that runs once a day and makes X number of writes, of Y size (in KB, rounded up), over Z number of seconds, your formula would be
capacity = (X * Y) / Z
So, say you had 100K writes over 100 seconds and each write < 1KB, you would need 1000 w/s capacity.
Note that in order to minimize provisioned write capacity needs, it is best to add data into the system on a more continuous basis, so as to reduce peaks in necessary read/write capacity.
Related
We've a data stream with continously dumps data in our data lake. Is there a good solution with min running to get 10% random data samples from the data?
I'm currently using code(snipped below) but this will outgrow 10% total sampling as new batches will arrive. I've also tried to calculate 10 batches of 100 records each with (.1) mean but it resulted in ~32% sampling.
select id,
(uniform(0::float, 1::float, random(1)) < .10)::boolean as sampling
from temp_hh_mstr;
Prior to it, I thought to get sampling via snowflake's TABLESAMPLE by substracting from the total count and current IDs in sampling from the table. It takes calculations for every time and any batch arrives which will increase the cost.
Some additional referece I've been thinking towards -
Wilson Score Interval With Continuity Correction
Binomial Confidence Interval
I have a system that stores measurements from machines with many transducers, once per second. I'm considering using Cassandra and would like to store the 1 second sample of machine state measurements in a single table, which would be something like:
create table inst_samples (
machine_id text,
batch_id int,
sample_time timestamp,
var1 double,
var2 double,
.....
varN double,
PRIMARY KEY ((machine_id, batch_id), sample_time)
);
There are approximately 20 machines with 400 state variables each and the batch_id will update every 1-2 hours. I have reviewed the documentation on the 2 billion cells maximum per table and noted similar questions
here What are the maximum number of columns allowed in Cassandra and here Cassandra has a limit of 2 billion cells per partition, but what's a partition?
If I am understanding this limit correctly I would hit the 2 billion cell limit for a single machine in the inst_samples table in approximately 60 days?
(2e9 cells / 400 cols/row) / (3600 rows / hour) / (24 hours / day) =~ 58 days?
I am a total Cassandra newbie. Thanks.
This 2 billion limit is for partition, and if you have good data model, you should have many partitions. In practice, it's recommended to keep number of cells per partition under control - something like, not more 100,000 cells per partition, otherwise there could be some performance problems, etc. But the actual limit depends on the multiple factors, like Cassandra version, what queries are executed, etc.
In your case, we have partition key of machine_id + batch_id, and that gives us for batch size of 2 hours: 400x7200 = 2880000 - almost 3 million cells. It may still work (would be better if you set batch size to 1 hour), but will require testing on real hardware - this could be done for example, with NoSQLBench.
There are also other ways to optimize your data model - for example, instead of allocating a separate column for every variable, just use frozen<map<text, double>> - in this case, all measurements will be stored as a single cell. The drawback of it - you can't change the individual values without reading the map & inserting it with changed value. Another drawback is that you'll need to read all measurements at once - but this could be ok.
I have this text file containing some md5 hashes, 100 million rows of them. I have this another smaller file with few thousand md5 hashes. I want to find the corresponding indices of these md5 hashes from this new smaller file to the old bigger file.
what is the most efficient way to do it? Is it possible to do it in like 15 mins or so?
I have tried lots of things but they do not work. First I tried to import the bigger data to a database file and create an index on the md5 hash column. Creating this hash takes for ever. I am not even sure if this will increase the query speed much. Suggestions?
Don't do this in db - use a simple program.
Read the md5 hashes from the small file into a hash map in memory, that allow for fast look-ups.
Then read through the md5's in the big file one row at a time, and check if the row is in the hash map.
Average look-up time in the hash map ought to be close to O(1), so the process time of this is basically how fast you can read through the big file.
The 15 minutes is easily obtained with today's hardware with this approach.
First of all: 100 Megarows à 32 Bytes = ca. 3.2 GByte of data. Reading them in 15 Minutes translates to 3.5 Megabytes per second, which should easily be doable with modern hardware.
I recommend not to use a database, but process consisting of some easy steps:
Sort your data - you have to do this only once, and you can parallelize much of it
Read the small file into memory (sorted into an array)
Cycle this array:
Read the big file line by line, comparing with the current line of your array (first compar e first byte, then first and second, ...) until you either reach a match (output index) or pass the value (output "not found")
Move to next array element
The initial sort might easily take longer than 15 minutes, but the lookups should be quite fast: Ify you have enough RAM (and an OS that supports processes bigger than 2GB) you should be able to get a compare rate at least an order of magnitude faster!
There are algorithms specifically designed for searching for multiple strings in a large file. One of them is Rabin-Karp. I have a blog post about this.
More simply, the following pseudo-code should get you there in no time :
Load your few thousand strings in a set data structure
For each line (index: i) in your file
If that line appears in your set of values
print i
This will be very fast: The set data structure will have almost-instant lookups, so the IO will the culprit, and 100 million hashsums will fit in 15 minutes without too much difficulty.
Assumptions:
(1) every record in the small file appears in the large file
(2) the data in each file is randomly ordered.
Options:
(1) For each record in the large file, search the small file linearly for a match. Since most searches will not find a match, the time will be close to
Nlarge * Nsmall * k
where k represents the time to attempt one match.
(2) For each record in the small file, search the large file linearly for a match. Since every search will find a match, the time will be about
Nlarge/2 * Nsmall * k.
This looks twice as fast as option (1) -- but only if you can fit the large file completely into fast memory. You would probably need 6 GB of RAM.
(3) Sort the small file into an easily searchable form. A balanced binary tree is best, but a sorted array is almost as good. Or you could trust the author of some convenient hash table object to have paid attention in CS school. For each record in the large file, search the structured small file for a match. The time will be
log2 Nsmall * s
to sort the small file, where s represents the time to sort one record, plus
log2 Nsmall * Nlarge * k
for the scan. This gives a total time of
log2 Nsmall * (s + Nlarge * k).
(4) Sort the large file into an easily searchable form. For each record in the small file, search the structured large file for a match. The time will be
log2 Nlarge * s
to sort the large file plus
log2 Nlarge * Nsmall * k
for the scan, giving a total of
log2 Nlarge * (s + Nsmall * k).
Option (4) is obviously the fastest, as reducing any coefficient of Nlarge dominates all other improvements. But if the sortable structure derived from the large file will not fit completely into RAM, then option (3) might turn out to be faster.
(5) Sort the large file into an easily searchable form. Break this structure into pieces that will fit into your RAM. For each such piece, load the piece into RAM, then for each record in the small file, search the currently loaded piece for a match. The time will be
log2 Nlarge * s
to sort the large file plus
log2 Nlarge * Nsmall * k * p
for the scan, where the structure was broken into p pieces, giving a total of
log2 Nlarge * (s + Nsmall * k * p).
With the values you indicated for Nlarge and Nsmall, and enough RAM so that p can be kept to a single digit, option (5) seems likely to be the fastest.
Over the connections that most people in the USA have in their homes, what is the approximate length of time to send a list of 200,000 integers from a client's browser to an internet sever (say Google app engine)? Does it change much if the data is sent from an iPhone?
How does the length of time increase as the size of the integer list increases (say with a list of a million integers) ?
Context: I wasn't sure if I should write code to do some simple computations and sorting of such lists for the browser in javascript or for the server in python, so I wanted to explore this issue of how long it takes to send the output data from a browser to a server over the web in order to help me decide where (client's browser or app engine server) is the best place for such computations to be processed.
More Context:
Type of Integers: I am dealing with 2 lists of integers. One is a list of ids for the 200,000 objects whose integers look like {0,1,2,3,...,99,999}. The second list of 100,000 is just single digits {...,4,5,6,7,8,9,0,1,...} .
Type of Computations: From the browser a person will create her own custom index (or rankings) based changing the weights associated to about 10 variables referenced to the 100,000 objects. INDEX = w1*Var1 + w2*Var2 + ... wNVarN. So the computations refer to vector (array) multiplication to a scalar and addition of 2 vectors, as well as sorting the final INDEX variable vector of 100,000 values.
In a nutshell...
This is probably a bad idea,
in particular with/for mobile devices where, aside from the delay associated with transfer(s), limits and/or extra fees associated with monthly volumes exceeding various plans limits make this a lousy economical option...
A rough estimate (more info below) is that the one-way transmission takes between 0.7 and and 5 seconds.
There is a lot of variability in this estimate, due mainly to two factors
Network technology and plan
compression ratio which can be obtained for a 200k integers.
Since the network characteristics are more or less a given, the most significant improvement would come from the compression ratio. This in turn depends greatly on the statistic distribution of the 200,000 integers. For example, if most of them are smaller than say 65,000, it would be quite likely that the list would compress to about 25% of its original size (75% size reduction). The time estimates provided assumed only a 25 to 50% size reduction.
Another network consideration is the availability of binary mime extension (8 bits mime) which would avoid the 33% overhead of B64 for example.
Other considerations / idea:
This type of network usage for iPhone / mobile devices plans will not fare very well!!!
ATT will love you (maybe), your end-users will hate you at least the ones with plan limits, which many (most?) have.
Rather than sending one big list, you could split the list over 3 or 4 chunks, allowing the server-side sorting to take place [mostly] in parallel to the data transfer.
One gets better compression ratio for integers when they are [roughly] sorted, maybe you can have a first pass sorting of some kind client-side.
How do I figure? ...
1) Amount of data to transfer (one-way)
200,000 integers
= 800,000 bytes (assumes 4 bytes integers)
= 400,000 to 600,000 bytes compressed (you'll want to compress!)
= 533,000 to 800,000 bytes in B64 format for MIME encoding
2) Time to upload (varies greatly...)
Low-end home setup (ADSL) = 3 to 5 seconds
broadband (eg DOCSIS) = 0.7 to 1 second
iPhone = 0.7 to 5 seconds possibly worse;
possibly a bit better with high-end plan
3) Time to download (back from server, once list is sorted)
Assume same or slightly less than upload time.
With portable devices, the differential is more notable.
The question is unclear of what would have to be done with the resulting
(sorted) array; so I didn't worry to much about the "return trip".
==> Multiply by 2 (or 1.8) for a safe estimate of a round trip, or inquire
about specific network/technlogy.
By default, typically integers are stored in a 32-bit value, or 4 bytes. 200,000 integers would then be 800,000 bytes, or 781.25 kilobytes. It would depend on the client's upload speed, but at 640Kbps upload, that's about 10 seconds.
well that is 800000 bytes or 781.3 kb, or you could say the size of a normal jpeg photo. for broadband, that would be within seconds, and you could always consider compression (there are libraries for this)
the time increases linearly for data.
Since you're sending the data from JavaScript to the server, you'll be using a text representation. The size will depend a lot on the number of digits in each integer. Are talking about 200,000 two to three digit integers or six to eight integers? It also depends on if HTTP compression is enabled and if Safari on the iPhone supports it (I'm not sure).
The amount of time will be linear depending on the size. Typical upload speeds on an iPhone will vary a lot depending on if the user is on a business wifi, public wifi, home wifi, 3G, or Edge network.
If you're so dependent on performance perhaps this is more appropriate for a native app than an HTML app. Even if you don't do the calculations on the client, you can send/receive binary data and compress it which will reduce time.
How can I predict the future size / growth of an Oracle table?
Assuming:
linear growth of the number of rows
known columns of basic datatypes (char, number, and date)
ignore the variability of varchar2
basic understanding of the space required to store them (e.g. number)
basic understanding of blocks, extents, segments, and block overhead
I'm looking for something more proactive than "measure now, wait, measure again."
Estimate the average row size based on your data types.
Estimate the available space in a block. This will be the block size, minus the block header size, minus the space left over by PCTFREE. For example, if your block header size is 100 bytes, your PCTFREE is 10, and your block size is 8192 bytes, then the free space in a given block is (8192 - 100) * 0.9 = 7282.
Estimate how many rows will fit in that space. If your average row size is 1 kB, then roughly 7 rows will fit in an 8 kB block.
Estimate your rate of growth, in rows per time unit. For example, if you anticipate a million rows per year, your table will grow by roughly 1 GB annually given 7 rows per 8 kB block.
I suspect that the estimate will depend 100% on the problem domain. Your proposed method seems as good a general procedure as is possible.
Given your assumptions, "measure, wait, measure again" is perfectly predictive. In 10g+ Oracle even does the "measure, wait, measure again" for you. http://download.oracle.com/docs/cd/B19306_01/server.102/b14237/statviews_3165.htm#I1023436