Load data from firebase to amazon redshift - database

I have around 500MB data in firebase and I want to move it to amazon redshift on daily basis. what is the best way for above problem.
thanks in advance.

What is "the best way" depends on your criteria, and often highly subjective. But a few pointers may help you get started:
don't download the entire data with a single ref.once('value'. Loading that much data will take time and all your regular users will be blocked while your read is being fulfilled.
do consider using Firebase's private backups. These are coming out of a different data stream, so will not interfere with your regular users. But the downside is that you'll need to a paid app to be able to use this feature.
do consider how you can make your backup process streaming, instead of daily. Firebase is a real-time database, and typically works best when you consider the data flow to be real-time too.

Related

Cloud solutions eg Snowflake for server data filling up fast

Firstly I'm new to development and currently I have a problem with server data filling up rapidly. I'm looking at solutions such as watcher programs to help me detect when the server data is reaching the limit but I wanted to know if cloud solutions could help in this regard. Additionally I also wanted to know if companies such as Snowflake can help to handle fast growing data and in what way can a developer use it or will it be too costly to use this approach from an enterprise point of view.
I have tried to look up the documentations of Snowflake but I am unable to reach any conclusions as to whether it can help me. I could just see articles about storage and that they store data by compressing it but I wanted more clarity on this solution.
Snowflake stores the data using Cloud Storege Services (AWS S3, Google Cloud Storage, or Microsoft Azure), so you can't fill the server data in normal conditions (never heard that S3 is full on any region).
Check the pricing page to see if it will be costly for you (or not):
https://www.snowflake.com/pricing/

Snowflake Tasks and Streams - Complexity and Visualization

We're heading into a POC and would need to determine if Snowflake tasks and streams are useful for CDC and data transformation. I have read snowflake documentation and the more I read it seems like it will be a complex mess to handle. Thinking about thousands of tables and complex transformations, how will tasks and streams scale up? Considering of a table that gets loaded from 5 other feeds, how will the process look like. On top of that, snowflake doesn't offer any visualization to work with tasks. Can some of you who worked with Snowflake streams/tasks comment and share you opinion of using tasks and streams? If you went with an alternative after trying them out, was it a commercial ETL tool or databricks? If we're already using qlik to bring in data into AWS S3 (data lake), would it make sense to use streams to ingest from our data lake into snowflake?
TIA
This question seems too wide for the typical Stack Overflow process (so the community might choose to close it).
In the meantime, I'll reply here to one of the stated questions: "On top of that, snowflake doesn't offer any visualization to work with tasks"
There is a tool to visualize tasks, created by a Snowflake SE:
https://medium.com/snowflake/visualizing-task-hierarchies-and-dependencies-in-snowflake-snowsight-d28298d0f0ed
For the larger picture: Snowflake streams and tasks are basic building blocks for more complex solutions. As your use case grows more complex, you'll need to find ways to manage this complexity - either with your own tools, Snowflake's, or third parties.
Since you are running a POC: Make sure to ask your Snowflake sales contact. Engineers like Dave are ready and eager to find a solution that fits your needs.

What kind of connector to snowflake that automaticly uploads new data would you use for IoT data?

I am just starting to set up a project to keep track of some open, home devices that are enabled for an at home network. I have a program that saves this data, and am putting together a process to upload that data to Snowflake automatically. I would like to know what you would recommend so I can easily access the home device information from anywhere.
The two options I am considering are aws's and snowflake's auto ingest option using the snowpipe rest api, which I have tested with only a few devices.
I am considering these two factors - which method can I set up to upload and select data quickly from a mobile app written in python or ruby depending on the device.
Any advice or resources you can point me to on this?
Thank you!
Your question is a pretty open question, so details from you might make this answer a bit more detailed, as well. However, in general, I would suggest that if your IoT data can be stored directly to Blob Storage (S3 in the case of AWS), then you should leverage Snowflake's Snowpipe for continuous ingestion. Also, look into Tasks and Streams to automate moving that data through whatever processes you'll setup once the data is in Snowflake.
A good reference for you:
https://docs.snowflake.net/manuals/user-guide/data-pipelines-intro.html

What database is good enough for logging application?

I am writing a web application with nodeJS that can be used by other applications to store logs and accessed later in a web interface or by applications themselves providing an API. Similar to Graylog2 but schema free.
I've already tried couchDB in which each document would be a log doc but since I'm not really using revisions it seems to me I'm not using its all features. And beside that I think if the logs exceeds a limit it would be pretty hard to manage in couchDB.
What I'm really looking for, is a big array of logs that can be sorted, filtered, searched and capped on. Then the last events of it accessed. It should be schema free and writing to it should be non-blocking.
I'm considering using Cassandra(I'm not really familiar with it) due to the points here said. MongoDB seems good here too, since Graylog2 uses in mongoDB, in here it has some good points about it.
I've already have seen this question, but not satisfied with the answers.
Edit:
For some reasons I can't use Cassandra in production, now I'm trying MongoDB.
One more reason to use mongoDB :
http://www.slideshare.net/WombatNation/logging-app-behavior-to-mongo-db
More edits:
It is similar to graylog2, but the difference I want to make that instead of having a message field, having fileds defined by the client, which is why I want it to be schema free, and because of that, I may need to query in the user defined fields. We can build it on SQL, but querying on the user defined fields would be reinventing wheel. Same goes with files.
Technically what I'm looking for is to get rich statistical data in the end, or easy debugging and a lot of other stuff that we can't get out of the logs.
Where shall it be stored and how shall it be retrieved?
I guess it depends on how much data you are dealing with. If you have a huge amount (terabytes and petabytes per day) of logs then Apache Kafka, which is designed to allow data to be PULLED by HDFS in parallel, is a interesting solution - still in the incubation stage. I believe if you want to consume Kafka messages with MongoDb, you'd need to develop your own adapter to ingest it as a consumer of a particular Kafka topic. Although MongoDb data (e.g. shards and replicas) is distributed, it may be a sequential process to ingest each message. So, there may be a bottleneck or even race conditions depending on the rate and size of message traffic. Kafka is optimized to pump and append that data to HDFS nodes using message brokers FAST. Then once it is in HDFS you can map/reduce to analyze your information in a variety of ways.
If MongoDb can handle the ingestion load, then it is an excellent, scalable, real-time solution to find information, particularly documents. Otherwise, if you have more time to process data (i.e. batch processes that take hours and sometimes days), then Hadoop or some other Map Reduce database is warranted. Finally, Kafka can distribute that load of messages and hookup that fire-hose to a variety of consumers. Overall, these new technologies spread the load and huge amounts of data across cheap hardware using software to manage failure and recover with a very low probability of losing data.
Even with a small amount of data, MongoDb is a nice option to traditional relational database solutions which require more overhead of developer resources to design, build and maintain.
General Approach
You have a lot of work ahead of you. Whichever database you use, you have many features which you must build on top of the DB foundation. You have done good research about all of your options. It sounds like you suspect that all have pros and cons but all are imperfect. Your suspicion is correct. At this point it is probably time to start writing code.
You could just choose one arbitrarily and start building your application. If your guess was correct that the pros and cons balance out and it's all about the same, then why not simply start building immediately? When you hit difficulty X on your database, remember that it gave you convenience Y and Z and that's just life.
You could also establish the fundamental core of your application and implement various prototypes on each of the databases. That might give you true insight to help discriminate between the databases for your specific application. For example, besides the interface, indexing, and querying questions, what about deployment? What about backups? What about maintenance and security? Maybe "wasting" time to build the same prototype on each platform will make the answer very clear for you.
Notes about CouchDB
I suppose CouchDB is "NoSQL" if you say so. Other things which are "no SQL" include bananas, poems, and cricket. It is not a very meaningful word. We have general-purpose languages and domain-specific languages; similarly CouchDB is a domain-specific database. It can save you time if you need the following features:
Built-in web API: clients may query directly
Incremental map-reduce: CouchDB runs the job once, but you can query repeatedly at no cost. Updates to the data set are immediately reflected in the map/reduce result without full re-processing
Easy to start small but expand to large clusters without changing application code.
Have you considered Apache Kafka?
Kafka is a distributed messaging system developed at LinkedIn for
collecting and delivering high volumes of log data with low latency.
Our system incorporates ideas from existing log aggregators and
messaging systems, and is suitable for both offline and online message
consumption.

How important is a database in managing information?

I have been hired to help write an application that manages certain information for the end user. It is intended to manage a few megabytes of information, but also manage scanned images in full resolution. Should this project use a database, and why or why not?
Any question "Should I use a certain tool?" comes down to asking exactly what you want to do. You should ask yourself - "Do I want to write my own storage for this data?"
Most web based applications are written against a database because most databases support many "free" features - you can have multiple webservers. You can use standard tools to edit, verify and backup your data. You can have a robust storage solution with transactions.
The database won't help you much in dealing with the image data itself, but anything that manages a bunch of images is going to have meta-data about the images that you'll be dealing with. Depending on the meta-data and what you want to do with it, a database can be quite helpful indeed with that.
And just because the database doesn't help you much with the image data, that doesn't mean you can't store the images in the database. You would store them in a BLOB column of a SQL database.
If the amount of data is small, or installed on many client machines, you might not want the overhead of a database.
Is it intended to be installed on many users machines? Adding the overhead of ensuring you can run whatever database engine you choose on a client installed app is not optimal. Since the amount of data is small, I think XML would be adequate here. You could Base64 encode the images and store them as CDATA.
Will the application be run on a server? If you have concurrent users, then databases have concepts for handling these scenarios (transactions), and that can be helpful. And the scanned image data would be appropriate for a BLOB.
You shouldn't store images in the database, as is the general consensus here.
The file system is just much better at storing images than your database is.
You should use a database to store meta information about those images, such as a title, description, etc, and just store a URL or path to the images.
When it comes to storing images in a database I try to avoid it. In your case from what I can gather of your question there is a possibilty for a subsantial number of fairly large images, so I would probably strong oppose it.
If this is a web application I would use a database for quick searching and indexing of images using keywords and other parameters. Then have a column pointing to the location of the image in a filesystem if possible with some kind of folder structure to help further decrease the image load time.
If you need greater security due to the directory being available (network share) and the application is local then you should probably bite the bullet and store the images in the database.
My gut reaction is "why not?" A database is going to provide a framework for storing information, with all of the input/output/optimization functions provided in a documented format. You can go with a server-side solution, or a local database such as SQLite or the local version of SQL Server. Either way you have a robust, documented data management framework.
This post should give you most of the opinions you need about storing images in the database. Do you also mean 'should I use a database for the other information?' or are you just asking about the images?
A database is meant to manage large volumes of data, and are supposed to give you fast access to read and write that data in spite of the size. Put simply, they manage scale for data - scale that you don't want to deal with. If you have only a few users (hundreds?), you could just as easily manage the data on disk (say XML?) and keep the data in memory. The images should clearly not go in to the database so the question is how much data, or for how many users are you maintaining this database instance?
If you want to have a structured way to store and retrieve information, a database is most definitely the way to go. It makes your application flexible and more powerful, and lets you focus on the actual application rather than incidentals like trying to write your own storage system.
For individual applications, SQLite is great. It fits right in an app as a file; no need for a whole DRBMS juggernaut.
There are a lot of factors to this. But, being a database weenie, I would err on the side of having a database. It just makes life easier when things changes. and things will change.
Depending on the images, you might store them on the file system or actually blob them and put them in the database (Not supported in all DBMS's). If the files are very small, then I would blob them. If they are big, then I would keep them on he file system and manage them yourself.
There are so many free or cheap DBMS's out there that there really is no excuse not to use one. I'm a SQL Server guy, but f your application is that simple, then the free version of mysql should do the job. In fact, it has some pretty cool stuff in there.
Our CMS stores all of the check images we process. It uses a database for metadata and lets the file system handle the scanned images.
A simple database like SQLite sounds appropriate - it will let you store file metadata in a consistent, transactional way. Then store the path to each image in the database and let the file system do what it does best - manage files.
SQL Server 2008 has a new data type built for in-database files, but before that BLOB was the way to store files inside the database. On a small scale that would work too.

Resources