how youtube can get really huge database?
from google searching 18 ~ 20 terabytes HDD is the largest HDD in the world
and there are more than 800 million videos on youtube
each videos have lots of memories
if each video's memory is 50GB then 800 million * 50GB is 40 billion GB == 39,062,500 TB
then they need at least 1,953,125 HDD
is youtube really have like that numbers of HDD? or is there other way to storage data?
Related
I have a large dataset comprised of over 1000 .csv files that are very large in size (approximately 700 mb each) and I had scheduled it for upload to DocumentDB with AWS Glue, and after 48 hours it timed out. I know some of the data had made the upload but some was left out because of the time out.
I haven’t tried anything because I am not sure where to go from here. I only want one copy of the data in the DocuDB and if I reupload I will likely get 1.5 times the amount of data. I also know that the connection was not at fault because I had seen the DocuDB CPU spike, and checked to see if data was moving there.
I have a large dataset (>40G) which I want to store in S3 and then use Athena for query.
As suggested by this blog post, I could store my data in the following hierarchical directory structure to enable usingMSCK REPAIR to automatically add partitions while creating table from my dataset.
s3://yourBucket/pathToTable/<PARTITION_COLUMN_NAME>=<VALUE>/<PARTITION_COLUMN_NAME>=<VALUE>/
However, this requires me to split my dataset into many smaller data files and each will be stored under a nested folder depending on the partition keys.
Although using partition could reduce amount of data to be scanned by Athena and therefore speed up a query, would managing large amount of small files cause performance issue for S3? Is there a tradeoff here I need to consider?
Yes, you may experience an important decrease of efficiency with small files and lots of partitions.
Here there is a good explanation and suggestion on file sizes and number of partitions, which should be larger than 128 MB to compensate the overhead.
Also, I performed some experiments in a very small dataset (1 GB), partitioning my data by minute, hour and day. The scanned data decreases when you make the partitions smaller, but the time spent on the query will increase a lot (40 times slower in some experiments).
I will try to get into it without veering too much into the realm of opinion.
For the use cases which I have used Athena, 40 GB is actually a very small dataset by the standards of what the underlying technology (Presto) is designed to handle. According to the Presto web page, Facebook uses the underlying technology to query their 300 PB data warehouse. I routinely use it on datasets between 500 GB and 1 TB in size.
Considering the underlying S3 technology, S3 was used to host Dropbox and Netflix, so I doubt most enterprises could come anywhere near taxing the storage infrastructure. Where you may have heard about performance issues and S3 relates to websites storing multiple, small, pieces of static content on many files scattered across S3. In this case, a delay in retrieving one of these small pieces of content might affect user experience on the larger site.
Related Reading:
Presto
I have a SOLR instance running with a couple of cores, each one of them having between 15 to 25 million documents. Normally the size of each core index (on the disk) is around 30-50 GB each, but there is one particular core index that keeps increasing until hard disk space is full (raising up to 200 GB and more).
When I look at other indexes all files are from the current day, but this one core keeps files also from 4-5 days ago (I guess the data is always duplicated on every import).
What could be causing such behavior and what should I look for when debugging it? Thanks.
We are planning our new EBS structure on amazon to get the best performance out of SQL Server. During the process some doubts appeared:
1 - Using the Amazon calculator (http://calculator.s3.amazonaws.com/index.html) we got the costs below:
General purpose (SSD) - 1000GB - 3000 IOPS = $184,30
Provisioned IOPS (SSD) - 1000GB - 3000 IOPS = $511,00
This amount is a huge diference in a month for the same performance (???), I'm aware about the "IOPS burst implementation" on General purpose SSD, but according to documentation:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html
When the volume size is 1000 GB the burst duration is "infinite" (Always 3000 IOPS).
Is it safe to say that the performance between the two disks above are exactly the same?
2 - We need about 1700 GB for 100 databases, what layout should we use?
Options:
Get two disks (GP SSD) with 1000GB (3000 IOPS) each and distribute the workload among this two.
Get two disks (GP SSD) with 1000GB (3000 IOPS) each and put then together with RAID 0 ? (We will be able to have 6000 IOPS burst, but should I be worried about EBS fault?)
Get four disks (GP SSD) with 1000GB (3000 IOPS) each and use RAID 10? (Is it necessary with EBS?)
Give your suggestion, i will be glad to hear.
From Amazon support, hope this helps!
Greetings
The disk cost question is easy enough to answer. General purpose (SSD) and Provisioned IOPS (SSD) use similar technology. Side by side they can achieve the same speeds, the only difference being that GP2 maximum sped is 3000 and PIOPs is 4000, per volume. One reason PIOPS is much more expensive is that you also pay for the number of IO you use, where as GP2 there is no per IO cost.
As for the design of the 1700GB datastore, there are 2 main factors. Redundancy and Performance. And of course cost is a big factor. To provide proper guidance here we would need to know what your actual needs are going to be then we could suggest some solutions. However, there are a couple of main RAID levels etc that match what you suggested that we can talk about.
Get two disks (GP SSD) with 1000GB (3000 IOPS) each and distribute the workload among this two.
No RAID. I take it you mean just have some databases on one volume and some on the other? This to me, is actually fine. All i would do in addition is backup the DBs to some other locally attached EBS volumes. This would be for workloads no greater that 3000 IO (read and writes combined). It's also easily expandable. Just add more disk.
Get two disks (GP SSD) with 1000GB (3000 IOPS) each and put then together with RAID 0 ? (We will be able to have 6000 IOPS burst, but should I be worried about EBS fault?)
RAID 0. All you have done here is make things twice as fast. But lose one disk and you lose everything. Again, if you are happy to restore from backup if a disk fails, this is a fast cheap config, for upto 6000 IO. Not easily expandable.
Get four disks (GP SSD) with 1000GB (3000 IOPS) each and use RAID 10? (Is it necessary with EBS?)
RAID 5, 6, and 10. These are all faster and more redundant. Arguably, RAID 10 is the best config for database, and probably the right config for you. With 1700 GB of data, if things go wrong there will be lots and lots of unhappy people.
Any suggestions?
Have you considered Amazon RDS? RDS has lots of advantages. We do all the heavy lifting, including multi AZ deployments, and RDS can expand vertically (CPU) and horizontally (Space) as your needs grow.
http://aws.amazon.com/rds/details/
The other thing to consider with GP2 is.... you 'might' not need to provision 1TB volumes. You probably do not need the 3000 IO 'infinity' burst model. Lets say you do want to run at 3000 IO all the time. Why not provision 5 x 200GB volumes, where each volume has 3 IO per GB. So 5x200x3=3000IO baseline. Put the 5 volumes in raid 5 (for example) and you should get around 3000IO all day long, and never run out of credit if you dont go over that (and IO is equally distributed)
However, those volumes can each burst to 3000 IO for 30 minutes continuous before you get rate limited to 600IO per vol. Which is still 3000IO in total. So... in this config you can burst to 15,000IO at anytime and when you do get limited you still have the 3000IO you predicted you need. Just don't run at over 3000 for more than needed or you'll have no burst left.
Neat huh? I think it is worthwhile to call or chat in to discuss your actual needs and answer any questions. Ultimately though, you will need to test and benchmark which ever design you decide to go with as talking about things and actual results will always differ! I imagine you guys are quite advanced but - here is a great example benchmark if you want to do some simple tests on various designs to help you decide what is best.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/benchmark_piops.html
What is Redis's database size to memory ratio?
For instance, if I have an 80MB database, how much RAM will Redis use (when used with a normal web app)?
Redis will use a bit more RAM than disk. The dumpfile format is probably a bit more densely packed. This is some numbers from a real production system (a 64 bit EC2 large instance running Redis 2.0.4 on Ubuntu 10.04):
$ redis-cli info | grep used_memory_human
used_memory_human:1.36G
$ du -sh /mnt/data/redis/dump.rdb
950M /mnt/data/redis/dump.rdb
As you can see, the dumpfile is a few hundred megs smaller than the memory usage.
In the end it depends on what you store in the database. I have mainly hashes in mine, with only a few (perhaps less than 1%) sets. None of the keys contain very large objects, the average object size is 889 bytes.
Redis databases are stored in memory, so an 80mb database would take up 80mb in ram.
Redis is an extremely low memory using program, and you can see that from this example from the website "1 Million keys with the key being the natural numbers from 0 to 999999 and the string "Hello World" as value uses 100MB [of Ram]". My Redis app uses around 300kb to 500kb of ram, so you would need a lot of data to reach a database of 80mb. Redis also saves to disk snapshots of the database, so 80mb in ram and 80mb on the hard drive.