Upload millions of images from tablet to server - sql-server

I want to architect a system that will allow thousands of users to upload images from a tablet to a content management system. In one upload, each user can upload up to 12 images at a time and there could be up to 20,000 uploads per day. As the numbers are <240,000 images per day, I've been wondering what is the best approach to avoid bottle necking during peak times.
I'm thinking along the lines of using a web server farm (IIS) to upload the images though HTTP POST. Where each image is less than 200kB and I could store the images on a file system. This would be 48GB per day and only 16TB per year.
Then I could store the image metadata in SQL Server DB along with other textual data. At a later time, the users will want to recall the images and other (text) data from a DB to the tablet for further processing.
On a small scale, this is no problem ,but I'm interested in what everyone thinks is the best approach for uploading/retrieving such a large number of images/records per day?

I've been wondering what is the best approach to avoid bottle necking during peak times.
Enough hardware. Period.
I'm thinking along the lines of using a web server farm (IIS) to upload the images though
HTTP POST.
No alternative to that that is worth mentioning.
This would be 48GB per day and only 16TB per year.
Yes. Modern storage is just fantastic ;)
Then I could store the image metadata in SQL Server DB along with other textual data.
Which makes this ia pretty smal ldatbase - which is good. At the end that means the problem runs down into the image storage, the database is not really that big.
On a small scale, this is no problem ,but I'm interested in what everyone thinks is the
best approach for uploading/retrieving such a large number of images/records per day?
I am not sure you are on a large scale yet. Problems will be around:
Number of files. You need to split them into multiple folders and best have the concept of buckets in the database so you can split them into multiple buckets each being their own server(s) - good for long term maintenance.
Backup / restore is a problem, but a lot less when you use (a) tapes and (b) buckets as said above - the chance of a full problem is tiny. ALso "3-4 copies on separate machines" may work well enough.
Except the bucket problem - i.e. yo can not put all those files into a simple folder, that iwll be seirously unwieldely - you are totally fine. This is not exactly super big. Keep the web level stateless so you can scale it, same on the storage backend, then use the database to tie it all together and make sure you do FREQUENT database backups (like all 15 minutes).

One of possible way is upload from client directly to Amazon S3. It will scale and receive any amount of files thrown at it. After upload to S3 is complete, save a link to S3 object along with useful meta to your DB. In this setup you will avoid file upload bottleneck and only have to be able to save ~240,000 records per day to your DB which should not be a problem.
If you want to build service that adds a value and save some (huge amount actually) time on file uploads, consider using existing 3rd party solutions that are built to solve this particular issue. For example - Uploadcare and some of its competitors.

Related

Scale database that receives streaming data with small resources

My use case is the following: I run about 60 websockets from 7 data sources in parallel that record stock tickers (so time-series data). Currently, I'm writing the data into a mongodb that is hosted on a Google Cloud VM such that every data source has its own collection and all collections are hosted inside the same database.
However, the database has grown to 0.6 GB and ~ 10 million rows after only five days of data. I'm pretty new to such questions, but I have a feeling that this is not a viable long-term solution. I will never need all of the data at once, but I need all of the data in order to query by date / currency. However, as I understood those queries might become impossible once the dataset is bigger than my RAM, is that true?
Moreover, this is a research project, but unfortunately I'm currently not able to use a university cluster, therefore I'm hosting the data on a private VM. However, this is subject to a budget constraint, and highly performant machines quickly become very expensive. That's why I'm questioning my design choice. Currently, I'm thinking of either switching to another kind of database, but fear that I'm running into the same issues again, or exporting the database once per week / month / whatever to CSV and wiping out. This would be quite a hastle though and I'm also scared of losing data.
So my question is, how can I design this database such that I can subset the data per one of the keys (either datetime or ticker_id) even when the database grows larger than my machine's RAM? Diskspace is not an issue.
On top of what Alex Blex already commented about storage and performance.
Query response time,in 5 days you have close to 10M rows, will worsen as data set grows. You can look at sharding to break the table down to reasonable chunks and still have acees to all data for query purpose.

Continuously updated database shared between multiple AWS EC2 instances

For a small personal project, I've been scraping some data every 5 minutes and saving it in a SQL database. So far I've been using a tiny EC2 AWS instance in combination with a 100GB EBS storage. This has been working great for the scraping, but is becoming unusable for analysing the resulting data, as the EC2 instance doesn't have enough memory.
The data analysis only happens irregularly, so it would feel a waste to pay 24/7 to have a bigger EC2 instance, so I'm looking for something more flexible. From reading around I've learned:
You can't connect EBS to two EC2 instances at the same time, so spinning up a second temporary big instance whenever analysis needed isn't an option.
AWS EFS seems a solution, but is quite a lot more expensive and considering my limited knowledge, I'm not a 100% sure this is the ideal solution.
The serverless options like Amazon Athena look great, but this is based on S3 which is a no-go for data that needs continuous updating (?).
I assume this is quite a common usecase for AWS, so I'm hoping to try to get some pointers in the right direction. Are there options I'm overlooking that fit my problem? Is EFS the right way to go?
Thanks!
Answers by previous users are great. Let's break them down in options. It sounds to me that your initial stack is a Custom SQL Database you installed in EC2.
Option 1 - RDS Read Replicas
Move your DB to RDS, this would give you a lot of goodies, but the main one we are looking for is Read Replicas if your reading/s grows you can create additional read replicas and put them behind a load balancer. This setup is the lowest hanging fruit without too many code changes.
Option 2 - EFS to Share Data between EC2 Instances
Using EFS is not straightforward, to no fault of EFS. Some databases save unique IDs to the filesystem, meaning you can't share the hard drive. EFS is a service and will add some lag to every read/write operation. Depending on how your installed Database distribution it might not even be possible.
Option 3 - Athena and S3
Having the workers save to S3 instead of SQL is also doable, but it means rewriting your web scraping tool. You can call S3 -> PutObject on the same key multiple times, and it will overwrite the previous object. Then you would need to rewrite your analytics tool to query S3. This option is excellent, and it's likely the cheapest in 'operation cost,' but it means that you have to be acquainted with S3, and more importantly, Athena. You would also need to figure out how you will save new data and the best file format for your application. You can start with regular JSON or CSV blobs and then later move to Apache Parquet for lower cost. (For more info on how that statement means savings see here: https://aws.amazon.com/athena/pricing/)
Option 4 - RedShift
RedShift is for BigData, I would wait until querying regular SQL is a problem (multiple seconds per query), and then I would start looking into it. Sure it would allow you query very for cheap, but you would probably have to set up a Pipeline that listens to SQL (or is triggered by it) and then updates RedShift. Reason is because RedShift scales depending on your querying needs, and you can spin up multiple machines easily to make querying faster.
As far as I can see S3 and Athena is good option for this. I am not sure about your concern NOT to use S3, but once you can save scraped data in S3 and you can analyse them with Athena (Pay Per Query model).
Alternatively, you can use RedShift to save data and analyse which has on demand service similar to ec2 on demand pricing model.
Also, you may use Kenisis Firehose which can be used to analyse data real time as and when you ingest them.
Your scraping workers should store data in Amazon S3. That way, worker instances can be scaled (and even turned off) without having to worry about data storage. Keep process data (eg what has been scraped, where to scrape next) in a database such as DynamoDB.
When you need to query the data saved to Amazon S3, Amazon Athena is ideal if it is stored in a readable format (CSV, ORC, etc).
However, if you need to read unstructured data, your application can access the files directly S3 by either downloading and using them, or reading them as streams. For this type of processing, you could launch a large EC2 instance with plenty of resources, then turn it off when not being used. Better yet, launch it as a Spot instance to save money. (It means your system will need to cope with potentially being stopped mid-way.)

Partition data for AWS Athena results in a lot of small files in S3

I have a large dataset (>40G) which I want to store in S3 and then use Athena for query.
As suggested by this blog post, I could store my data in the following hierarchical directory structure to enable usingMSCK REPAIR to automatically add partitions while creating table from my dataset.
s3://yourBucket/pathToTable/<PARTITION_COLUMN_NAME>=<VALUE>/<PARTITION_COLUMN_NAME>=<VALUE>/
However, this requires me to split my dataset into many smaller data files and each will be stored under a nested folder depending on the partition keys.
Although using partition could reduce amount of data to be scanned by Athena and therefore speed up a query, would managing large amount of small files cause performance issue for S3? Is there a tradeoff here I need to consider?
Yes, you may experience an important decrease of efficiency with small files and lots of partitions.
Here there is a good explanation and suggestion on file sizes and number of partitions, which should be larger than 128 MB to compensate the overhead.
Also, I performed some experiments in a very small dataset (1 GB), partitioning my data by minute, hour and day. The scanned data decreases when you make the partitions smaller, but the time spent on the query will increase a lot (40 times slower in some experiments).
I will try to get into it without veering too much into the realm of opinion.
For the use cases which I have used Athena, 40 GB is actually a very small dataset by the standards of what the underlying technology (Presto) is designed to handle. According to the Presto web page, Facebook uses the underlying technology to query their 300 PB data warehouse. I routinely use it on datasets between 500 GB and 1 TB in size.
Considering the underlying S3 technology, S3 was used to host Dropbox and Netflix, so I doubt most enterprises could come anywhere near taxing the storage infrastructure. Where you may have heard about performance issues and S3 relates to websites storing multiple, small, pieces of static content on many files scattered across S3. In this case, a delay in retrieving one of these small pieces of content might affect user experience on the larger site.
Related Reading:
Presto

What's the best way to send pictures to a browser when they have to be stored as blobs in a database?

I have an existing database containing some pictures in blob fields. For a web application I have to display them.
What's the best way to do that, considering stress on the server and maintenance and coding efforts.
I can think of the following:
"Cache" the blobs to external files and send the files to the browser.
Read them from directly the database every time it's requested.
Some additionals facts:
I cannot change the database and get rid of the blobs alltogether and only save file references in the database (Like in the good ol' Access days), because the database is used by another application which actually requires the blobs.
The images change rarely, i.e. if an image is in the database it mostly stays that way forever.
There'll be many read accesses to the pictures, 10-100 pictures per view will be displayed. (Depending on the user's settings)
The pictures are relativly small, < 500 KB
I would suggest a combination of the two ideas of yours: The first time the item is requested read it from the database. But afterwards make sure they are cached by something like Squid so you don't have to retrieve the every time they are requested.
one improtant thing is to use proper HTTP cache control ... that is, setting expiration dates properly, responding to HEAD requests correctly (not all plattforms/webservers allow that) ...
caching thos blobs to the file system makes sense to me ... even more so, if the DB is running on another server ... but even if not, i think a simple file access is a lot cheaper, than piping the data through a local socket ... if you did cache the DB to the file system, you could most probably configure any webserver to do a good cache control for you ... if it's not the case already, you should maybe request, that there is a field to indicate the last update of the image, to make your cache more efficient ...
greetz
back2dos

Advice on using a web server as a cache

I'd like advice on the following design. Is it reasonable? Is it stupid/insane?
Requirements:
We have some distributed calculations that work on chunks of data that are sometimes up to 50Mb in size.
Because the calculations take a long time, we like to parallelize the calculations on a small grid (around 20 nodes)
We "produce" around 10000 of these "chunks" of binary data each day - and want to keep them around for up to a year... Most of the items aren't 50Mb in size though, so the total daily space requirement is more around 5Gb... But we'd like to keep stuff around for as long as possible, (a year or more)... But hey, you can get 2TB hard disks nowadays.
Although we'd like to keep the data around, this is essentially a "cache". It's not the end of the world if we lose data - it just has to get recalculated, which just takes some time (an hour or two).
We need to be able to efficiently get a list of all "chunks" that were produced on a particular day.
We often need to, from a support point of view, delete all chunks created on a particular day or remove all chunks created within the last hour.
We're a Windows shop - we can't easily switch to Linux/some other OS.
We use SQLServer for existing database requirements.
However, it's a large and reasonably bureaucratic company that has some policies that limit our options: for example, conventional database space using SQLServer is charged internally at extremely expensive prices. Allocating 2 terabytes of SQL Server space is prohibitively expensive. This is mainly because our SQLServer instances are backed up, archived for 7 years, etc. etc. But we don't need this "gold-plated" functionality because we can just recreate the stuff if it goes missing. At heart, it's just a cache, that can be recreated on demand.
Running our own SQLServer instance on a machine that we maintain is not allowed (all SQLServer instances must be managed by a separate group).
We do have fairly small transactional requirement: if a process that was producing a chunk dies halfway through, we'd like to be able to detect such "failed" transactions.
I'm thinking of the following solution, mainly because it seems like it would be very simple to implement:
We run a web server on top of a windows filesystem (NTFS)
Clients "save" and "load" files by using HTTP requests, and when processes need to send blobs to each other, they just pass the URLs.
Filenames are allocated using GUIDS - but have a directory for each date. So all of the files created on 12th November 2010 would go in a directory called "20101112" or something like that. This way, by getting a "directory" for a date we can find all of the files produced for that date using normal file copy operations.
Indexing is done by a traditional SQL Server table, with a "URL" column instead of a "varbinary(max)" column.
To preserve the transactional requirement, a process that is creating a blob only inserts the corresponding "index" row into the SQL Server table after it has successfully finished uploading the file to the web server. So if it fails or crashes halfway, such a file "doesn't exist" yet because the corresponding row used to find it does not exist in the SQL server table(s).
I like the fact that the large chunks of data can be produced and consumed over a TCP socket.
In summary, we implement "blobs" on top of SQL Server much the same way that they are implemented internally - but in a way that does not use very much actual space on an actual SQL server instance.
So my questions are:
Does this sound reasonable. Is it insane?
How well do you think that would work on top of a typical windows NT filesystem? - (5000 files per "dated" directory, several hundred directories, one for each day). There would eventually be many hundreds of thousands of files, (but not too many directly underneath any one particular directory). Would we start to have to worry about hard disk fragmentation etc?
What about if 20 processes are all, via the one web server, trying to write 20 different "chunks" at the same time - would that start thrashing the disk?
What web server would be the best to use? It needs to be rock solid, runs on windows, able to handle lots of concurrent users.
As you might have guessed, outside of the corporate limitations, I would probably set up a SQLServer instance and just have a table with a "varbinary(max)" column... But given that is not an option, how well do you think this would work?
This is all somewhat out of my usual scope so I freely admit I'm a bit of a Noob in this department. Maybe this is an appalling design... but it seems like it would be very simple to understand how it works, and to maintain and support it.
Your reasons behind the design are insane, but they're not yours :)
NTFS can handle what you're trying to do. This shouldn't be much of a problem. Yes, you might eventually have fragmentation problems if you run low on disk space, but make sure that you have copious amounts of space and you shouldn't have a problem. If you're a Windows shop, just use IIS.
I really don't think you will have much of a problem with this architecture. Just keep it simple like you're doing and things should be fine.

Resources