querying large number (300k) of csv files stored on S3 - database

I am performing some scraping jobs on EC2, and plan to keep my data on S3 once they are fetched. They will be:
300K individual .csv files
each csv file is about 3000 lines and 60 columns. Mostly str data.
each csv file is about 3m in size.
They are stored on AWS S3.
I will be analyzing these data in details later. I should note that:
This is not for production purpose but for a academic research project;
We care more about query efficiency than cost;
We will probably not constantly querying the data. Probably a few hundred times in the next couple of months;
I imagine I will probably have to use some services on AWS (Athena, or write them to DynamoDB, or RDS?). Of these three services I have no hands-on experience, so I am looking for advice.
Another thought is that: should I save the data in .parquet? I have read about its efficiency over other formats.
Thank you very much.

Without more information from you it is difficult to know what the right solution is, but if the data is already in S3, I'd try and use Athena first. If that does do what you want or costs too much, I'd then look at RDS Aurora MySQL or PostegreSQL or Amazon DocumentDB.
If you are going to make a user facing high performance app where you know the access patterns users will be doing in a repeatable fashion, I'd look at DynamoDB.
First though, you really need to figure out what you want to achieve with this data. That should guide you to the correct solution.

Related

What is the best way to run a report in notebooks when connected to snowflake connector?

My last couple of questions have been on how to connect to snowflake and add and read data with the python connector in a ipython notebook. However, I am having troubling with the next best step to create a report with the data I seek to visualize.
I would like to upload all of the data, store it, then analyze it, kind of like a homemade dashboard.
So what I have done so far is a small version:
Staged my data from a local file, and I will run adding new data
each time I open the notebook
Then I will use the python connector to call any data from storage
Create visualizations with numpy objects in the local notebook.
My data will start out very small, but over time I would imagine I would have to move computation to the cloud to minimize the memory used locally for the small dashboard.
My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?
Put the raw data into Snowflake. Use tasks and procedures to aggregate it and store the result. Or better yet, don't do any aggregations except for when you want the data - let Snowflake do the aggregations in real-time off the raw data.
I think what you might be asking is whether you should ETL your data or ELT your data:
ETL: Extract, Transform, Load (in that order) - Extract data from your API. Transform it locally on your computer. Load it into Snowflake.
ELT: Extract, Load, Transform (in that order) - Extract data from your API. Load it into Snowflake. Transform it after it's in Snowflake.
Both ETL and ELT are valid. Many companies use both approaches w/ snowflake interchangeably. But Snowflake was built for it to kind of be your data lake - the idea being, "Just throw all your data up here and then use our awesome compute and storage resources to transform them quickly and easily."
Do a Google search on "Snowflake ELT" or "ELT vs ETL" for more information.
Here are some considerations either way off the top of my head:
Tools you're using: Some tools like SSIS were built w/ ETL in mind - transformation of the data before you store it in your warehouse. That's not to say you can't ELT, but it wasn't built w/ ELT in mind. More modern tools - like Fivetran or even Snowpipe assume you're going to aggregate all your data into Snowflake, and then transform it once it's up there. I really like the ELT paradigm - i.e. just get your data into the cloud - transform it quickly once it's up there.
Size and growth of your data: If your data is growing, it becomes harder and harder to manage it on local resources. It might not matter when your data is in gigabytes or millions of rows. But as you get into billions of rows or terabytes of data, the scale-ability of the cloud can't be matched. If you feel like this might happen and you think putting it into the cloud isn't a premature optimization, I'd load your raw data into Snowflake and transform it after it's up there.
Compute and Storage Capacity: Maybe you have a massive amount of storage and compute at your fingertips. Maybe you have an on-prem cluster you can provision resources from at the drop of a hat. Most people don't have that.
Short-Term Compute and Storage Cost: Maybe you have some modest resources you can use today and you'd rather not pay Snowflake while your modest resources can do the job. Having said that, it sounds like the compute to transform this data will be pretty minimal, and you'll only be doing it once a day or once a month. If that's the case, the compute cost will be very minimal.
Data Security or Privacy: Maybe you have a need to anonymize data before moving it to the public cloud. If this is important to you you should look into Snowflake's security features, but if you're in an organization where it's super difficult to get a security review and you need to move forward with something, transforming it on-prem while waiting for security review is a good alternative.
Data Structure: Do you have duplicates in your data? Do you need access to other data in Snowflake to join on in order to perform your transformations? As you start putting more and more data into Snowflake, it makes sense to transform it after it's in Snowflake - that's where all your data is and you will find it easier to join, query and transform in the cloud where all your other data is.
My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?
I would flatten your data in python or Snowflake - depending on which you feel more comfortable using or how complex the data is. You can just do everything on the straight json, although I would rarely look to design something that way myself (it's going to be the slowest to query.)
As far as aggregating the data, I'd always do that on Snowflake. If you would like to slice and dice the data various ways, you may look to design a data mart data model and have your dashboard simply aggregate data on the fly via queries. Snowflake should be pretty good with that, but for additional speed then aggregating it up to months may be a good idea too.
You can probably mature your process from being local python script driven too something like serverless lambda and event driven wwith a scheduler as well.

How do I query heterogeneous JSON data in S3?

We have an Amazon S3 bucket that contains around a million JSON files, each one around 500KB compressed. These files are put there by AWS Kinesis Firehose, and a new one is written every 5 minutes. These files all describe similar events and so are logically all the same, and are all valid JSON, but have different structures/hierarchies. Also their format & line endings are inconsistent: some objects are on a single line, some on many lines, and sometimes the end of one object is on the same line as the start of another object (i.e., }{).
We need to parse/query/shred these objects and then import the results into our on-premise data warehouse SQL Server database.
Amazon Athena can't deal with the inconsistent spacing/structure. I thought of creating a Lambda function that would clean up the spacing, but that still leaves the problem of different structures. Since the files are laid down by Kinesis, which forces you to put the files in folders nested by year, month, day, and hour, we would have to create thousands of partitions every year. The limit to the number of partitions in Athena is not well known, but research suggests we would quickly exhaust this limit if we create one per hour.
I've looked at pumping the data into Redshift first and then pulling it down. Amazon Redshift external tables can deal with the spacing issues, but can't deal with nested JSON, which almost all these files have. COPY commands can deal with nested JSON, but require us to know the JSON structure beforehand, and don't allow us to access the filename, which we would need for a complete import (it's the only way we can get the date). In general, Redshift has the same problem as Athena: the inconsistent structure makes it difficult to define a schema.
I've looked into using tools like AWS Glue, but they just move data, and they can't move data into our on-premise server, so we have to find some sort of intermediary, which increases cost, latency, and maintenance overhead.
I've tried cutting out the middleman and using ZappySys' S3 JSON SSIS task to pull the files directly and aggregate them in an SSIS package, but it can't deal with the spacing issues or the inconsistent structure.
I can't be the first person to face this problem, but I just keep spinning my wheels.
Rumble is an open-source (Apache 2.0) engine that allows you to use the JSONiq query language to directly query JSON (specifically, JSON Lines files) stored on S3, without having to move it anywhere else or import it into any data store. Internally, it uses Spark and DataFrames.
It was successfully tested on collections of more than 20 billion objects (10+ TB), and it also works seamlessly if the data is nested and heterogeneous (missing fields, extra fields, different types in the same field, etc). It was also tested with Amazon EMR clusters.
Update: Rumble also works with Parquet, CSV, ROOT, AVRO, text, and SVM, and on HDFS, S3, and Azure.
I would probably suggest 2 types of solutions
I believe MongoDB/DynamoDB/Cassandra are good at processing Heterogenous JSON structure. I am not sure about the inconsistency in ur JSON but as long as it is a valid JSON, I believe it should be ingestable in one of the above DBs. Please provide a sample JSON if possible. But these tools have their own advantages and disadvantages. The data modelling is entirely different for these No SQL's than the traditional SQLs.
I am not sure why your Lambda is not able to do a cleanup. I believe you would have tried to call a Lambda when a S3 PUT happens in a bucket. This should be able to cleanup the JSON unless there are complex processes involved.
Unless the JSON is in a proper format, no tool would be able to process it perfectly, I believe more than Athena or Spectrum, MongoDB/DyanoDB/Cassandra will be right fit to this use case
Would be great if you could share the limitations that you faced when you created a lot of partitions?

how to handle very large data?

I'm about to start a new project which is basically a reporting tool which should have a rather very large database.
The number of tables will not be large (<200), majority of data (80%) will be contained in 20 tables, all data are almost insert/read only (no updates).
The estimated amount of data in that one table is going to grow at 240,000 records per minute , and we should keep at least 1 to 3 year of them to be able to do various reports and reports will be seen online by administrator.
I don't have first hand experience with that large databases, so I'm asking the ones that have which DB is the best choice in this situation. I know that Oracle is the safe bet, but am more interested if anyone have experience other than database like hadoopdb or Google's big table.
please guide me .
thanks in advance
Oracle is going to get very expensive to scale up enough. MySQL will be hard to scale. It's not their fault; an RDBMS is overkill for this.
Let me start with a dumb question: what are you doing with this data? "various reports" could be a lot of things. If these reports can be generated in bulk, offline, then, why not keep your data in a flat file on a shared file system?
If it needs to be more online, then yes the popular wisdom from the past 2 years is to look at NoSQL databases like Mongo, Couch and Cassandra. They're simpler, faster creatures that scale easily and provide more random access to your data.
Doing analytics on NoSQL is all the rage this year. For example, I'd look at what Acunu is doing to embed analytics into their flavor of Cassandra: http://www.acunu.com/blogs/andy-twigg/acunu-analytics-preview/
You can Also Use Apache Solr And MongoDB.
Mongo DB and Apache Solr are alos used for Handling Big data in NOSQL its very fast to insert and retrieve data into database.
So you can use Apache Solr Or MongoDb database.

Storing Images : DB or File System -

I read some post in this regard but I still don't understand what's the best solution in my case.
I'm start writing a new webApp and the backend is going to provide about 1-10 million images. (average size 200-500kB for a single image)
My site will provide content and images to 100-1000 users at the same time.
I'd like also to keep Provider costs as low as possible (but this is a secondary requirement).
I'm thinking that File System space is less expensive if compared to the cost of DB size.
Personally I like the idea of having all my images in the DB but any suggestion will be really appreciated :)
Do you think that in my case the DB approach is the right choice?
Putting all of those images in your database will make it very, very large. This means your DB engine will be busy caching all those images (a task it's not really designed for) when it could be caching hot application data instead.
Leave the file caching up to the OS and/or your reverse proxy - they'll be better at it.
Some other reasons to store images on the file system:
Image servers can run even when the database is busy or down.
File systems are made to store files and are quite efficient at it.
Dumping data in your database means slower backups and other operations.
No server-side coded needed to serve up an image, just plain old IIS/Apache.
You can scale up faster with dirt-cheap web servers, or potentially to a CDN.
You can perform related work (generating thumbnails, etc.) without involving the database.
Your database server can keep more of the "real" table data in memory, which is where you get your database speed for queries. If it uses its precious memory to keep image files cached, that doesn't buy you hardly anything speed-wise versus having more of the photo index in memory.
Most large sites use the filesystem.
See Store pictures as files or in the database for a web app?
When dealing with binary objects, follow a document centric approach for architecture, and not store documents like pdf's and images in the database, you will eventually have to refactor it out when you start seeing all kinds of performance issues with your database. Just store the file on the file system and have the path inside a table of your databse. There is also a physical limitation on the size of the data type that you will use to serialize and save it in the database. Just store it on the file system and access it.
Your first sentence says that you've read some posts on the subject, so I won't bother putting in links to articles that cover this. In my experience, and based on what you've posted as far as the number of images and sizes of the images, you're going to pay dearly in DB performance if you store them in the DB. I'd store them on the file system.
What database are you using? MS SQL Server 2008 provides FILESTREAM storage
allows storage of and efficient access to BLOB data using a combination of SQL Server 2008 and the NTFS file system. It covers choices for BLOB storage, configuring Windows and SQL Server for using FILESTREAM data, considerations for combining FILESTREAM with other features, and implementation details such as partitioning and performance.
details
We use FileNet, a server optimized for imaging. It's very expensive. A cheaper solution is to use a file server.
Please don't consider storing large files on a database server.
As others have mentioned, store references to the large files in the database.

Do you think it's a good idea to save billions of images into Database?

Recently, I and my colleagues, we are discussing how to build a huge storage systems which could store billions a pictures which could searched and download quickly.
Something like a fickr, but not for an online gallery. Which means, most of these picture will never be download.
My colleages suggest that we should save all these files in database directly. I really feels that it's not a good idea and I think database is not desgined for restore huge number of binary files. But I have very strong reason for why that's not a good ideas.
What do you think about it.
When dealing with binary objects, follow a document centric approach for architecture, and not store documents like pdf's and images in the database, you will eventually have to refactor it out when you start seeing all kinds of performance issues with your database. Just store the file on the file system and have the path inside a table of your databse. There is also a physical limitation on the size of the data type that you will use to serialize and save it in the database. Just store it on the file system and access it.
If you are really talking about billions of images, I would store them in the file system because retrieval will be faster than serializing and de-seralizing the images
The answers above appear to assume the database is an RDBMS. If your database is a document-oriented database with support for binary documents of the size you expect, then it may be perfectly wise to store them in the database.
It's not a good idea. The point of a database is that you can quickly resolve complex queries to retrieve textual data. While binary data can be stored in a database, it can slow transactions. This is especially true when the database is on a separate server from the running application. In the database, store meta-data and the location/filename of the images. Images themselves should be on static server(s).

Resources