Fast JSON/flat data server for mostly reads - database

I ask this question apprehensively because it is not a pure programming question, and because I am seeking a (well informed) suggestion.
I have an analytic front end, written in JavaScript, with lots of aggregations and charting happening in the browser (dimple.js, even stats.js, ...)
I want to feed this application with JSON or delimited data from some high performance data structure server. No writes except for loading. Data will be maybe 1-5 GB in size and there could be dozens, if not hundreds concurrent readers, but only in peak hours. This data is collected from and fed by Apache Hive.
Now my question is about the selection of a database/datastore server choices for this.
(I have pretty good command of SQL/NoSQL choices, so I am really seeking advice for the very specific requirements)
Requirements and specifications for this datastore are:
Mostly if not all queries will be reads, initiated by the web, JS-based front end.
Data can be served as JSON or flat tabular csv, psv, tsv.
Total data size on this store will be 1-5 GB, with possible future growth, but nothing imminent (6-12 months)
Data on this datastore will be refreshed/loaded into this store daily. Probably never in a real time.
Data will/can be accessed via some RESTful web services, Socket IO, etc.
Faster read access, the better. Speed matters.
There has to be a security/authentication method for sensitive data protection.
It needs to be reasonably stable, not a patching-requiring bleeding edge.
Liberal, open source license.
So far, my initial candidates for examination were Postgres (optimized for large cache) and Mongo. Just because I know them pretty well.
I am also familiar with Redis, Couch.
I did not do benchmark myself, but I have seen benchmarks where Postgres was faster than Mongo (while offering JSON format). Mongo is web-friendlier.
I am considering in-memory stores with persistence such as Redis, Aerospike, Memcached. Redis 3.0 is my favorite so far.
So, I ask you here if you have any recommendations for the production quality datastore that would fit well what I need.
Any civil and informed suggestions are welcome.

What exactly does your data look like? Since you said CSV like exports, I'm assuming this is tabular, structured data that would usually be found in a relational database?
Some options:
1. Don't use a database
Given the small dataset, just serve it out of memory. You can probably spend a few hours to write a quick app with any decent web framework that just loads up the data into memory (for example, from a flat file) and then searches and returns this data in whatever format and way you need.
2. Use an embedded database
You can also try an embedded database like SQLite which gives you in-memory performance but with a reliable SQL interface. Since it's just a single-file database, you can have another process generate a new DB file, then swap it out when you update the data for the app.
3. Use a full database system
Use a regular relational database. mySQL, PostgreSQL, SQL Server (Express Edition) are all free and can handle that dataset easily and will just cache it all in RAM. If it's read queries, I don't see any issues with a few hundred concurrent users. You can also use memSQL community edition if you need more performance. They all support security, are very reliable, and you can't beat SQL for data access.
Use a key/value system if your data isn't relational or tabular and is more of a fit as simple values or documents. However remember KV stores aren't great at scans or aggregations and don't have joins. Memcached is just a distributed cache, don't use it for real data. Redis and Aerospike are both great key/value systems with Redis giving you lots of nice data structures to use. Mongo is good for data flexibility. Elasticsearch is a good option for advanced search-like queries.
If you're going to these database systems though, you will still need a thin app layer somewhere to interface with the database and then return the data in the proper format for your frontend.
If you want to skip that part, then just use CouchDB or Riak instead. Both are document oriented and have a native HTTP interface with JSON responses so you can consume it directly from your frontend, although this might cause security issues since anyone can see the javascript calls.

Related

What are the approaches to the Big-Data problems? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Let us consider the following problem. We have a system containing a huge amount of data (Big-Data). So, in fact we have a data base. As the first requirement we want to be able to write to and to read from the data base quickly. We also want to have a web-interface to the data-bases (so that different clients can write to and read from the data base remotely).
But the system that we want to have should be more than a data base. First, we want to be able to run different data-analysis algorithm on the data to find regularities, correlations, abnormalities and so on (as before we do care a lot about the performance). Second, we want to bind a machine learning machinery to the data-base. Which means that we want to run machine learning algorithms on the data to be able to learn "relations" present on the data and based on that predict the values of entries that are not yet in the data base.
Finally, we want to have a nice clicks based interface that visualize the data. So that the users can see the data in form of nice graphics, graphs and other interactive visualisation objects.
What are the standard and widely recognised approaches to the above described problem. What programming languages have to be used to deal with the described problems?
I will approach your question like this: I assume you are firmly interested in big data database use already and have a real need for one, so instead of repeating textbooks upon textbooks of information about them, I will highlight some that meet your 5 requirements - mainly Cassandra and Hadoop.
1) The first requirement we want to be able to write to and to read from the database quickly.
You'll want to explore NoSQL databases which are often used for storing “unstructured” Big Data. Some open-source databases include Hadoop and Cassandra. Regarding the Cassandra,
Facebook needed something fast and cheap to handle the billions of status updates, so it started this project and eventually moved it to Apache where it's found plenty of support in many communities (ref).
References:
Big Data and NoSQL: Five Key Insights
NoSQL standouts: New databases for new applications
Big data woes: Which database should I use?
Cassandra and Spark: A match made in big data heaven
List of NoSQL databases (currently 150)
2) We also want to have a web interface to the database
See the list of 150 NoSQL databases to see all the various interfaces available, including web interfaces.
Cassandra has a cluster admin, a web-based environment, a web-admin based on AngularJS, and even GUI clients.
References:
150 NoSQL databases
Cassandra Web
Cassandra Cluster Admin
3) We want to be able to run different data-analysis algorithm on the data
Cassandra, Hive, and Hadoop are well-suited for data analytics. For example, eBay uses Cassandra for managing time-series data.
References:
Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack
Cassandra at eBay - Cassandra Summit
An Introduction to Real-Time Analytics with Cassandra and Hadoop
4) We want to run machine learning algorithms on the data to be able to learn "relations"
Again, Cassandra and Hadoop are well-suited. Regarding Apache Spark + Cassandra,
Spark was developed in 2009 at UC Berkeley AMPLab, open sourced in
2010, and became a top-level Apache project in February, 2014. It has
since become one of the largest open source communities in big data, with over 200 contributors in 50+ organizations (ref).
Regarding Hadoop,
With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets.
References:
Getting Started with Apache Spark and Cassandra
What is Apache Mahout?
Data Science with Apache Hadoop: Predicting Airline Delays
5) Finally, we want to have a nice clicks-based interface that visualize the data.
Visualization tools (paid) that work with the above databases include Pentaho, JasperReports, and Datameer Analytics Solutions. Alternatively, there are several open-source interactive visualization tools such as D3 and Dygraphs (for big data sets).
References:
Data Science Central - Resources
Big Data Visualization
Start looking at:
what kind of data you want to store in the Database?
what kind of relationship between data you got?
how this data will be accessed? (for instance you need to access a certain set of data quite often)
are they documents? text? something else?
Once you got an answer for all those questions, you can start looking at which NoSQL Database you could use that would give you the best results for your needs.
You can choose between 4 different types: Key-Value, Document, Column family stores, and graph databases.
Which one will be the best fit can be determined answering the question above.
There are ready to use stack that may really help to start with your project:
Elasticsearch that would be your Database (it has a REST API that you can use to write them to the DB and to make queries and analysis)
Kibana is a visualization tool, it will allows you to explore and visualize your data, it it quite powerful and will be more than enough for most of your needs
Logstash can centralize the data processing and help you process and save them in elasticsearch, it already support quite few sources of logs and events, and you can also write your own plugin as well.
Some people refers to them as the ELK stack.
I don't believe you should worry about the programming language you have to use at this point, try to select the tools first, sometimes the choices are limited by the tools you want to use and you can still use a mixture of languages and make the effort only if/when it make sense.
A common way to solve such a requirements is to use Amazon Redshift and the ecosystem around it.
Redshift is a peta-scale data warehouse (it can also start with giga-scale), that exposes Ansi SQL interface. As you can put as much data as you like into the DWH and you can run any type of SQL you wish against this data, this is a good infrastructure to build almost any agile and big data analytics system.
Redshift has many analytics functions, mainly using Window functions. You can calculate averages and medians, but also percentiles, dense rank etc.
You can connect almost every SQL client you want using JDBS/ODBC drivers. It can be from R, R studio, psql, but also from MS-Excel.
AWS added lately a new service for Machine Learning. Amazon ML is integrating nicely with Redshift. You can build predictive models based on data from Redshift, by simply giving an SQL query that is pulling the data needed to train the model, and Amazon ML will build a model that you can use both for batch prediction as well as for real-time predictions. You can check this blog post from AWS big data blog that shows such a scenario: http://blogs.aws.amazon.com/bigdata/post/TxGVITXN9DT5V6/Building-a-Binary-Classification-Model-with-Amazon-Machine-Learning-and-Amazon-R
Regarding visualization, there are plenty of great visualization tools that you can connect to Redshift. The most common ones are Tableau, QliView, Looker or YellowFin, especially if you don't have any existing DWH, where you might want to keep on using tools like JasperSoft or Oracle BI. Here is a link to a list of such partners that are providing free trial for their visualization on top of Redshift: http://aws.amazon.com/redshift/partners/
BTW, Redshift also provides a free trial for 2 months that you can quickly test and see if it fits your needs: http://aws.amazon.com/redshift/free-trial/
Big Data is a tough problem primarily because it isn't one single problem. First if your original database is a normal OLTP database that is handling business transactions throughout the day, you will not want to also do your big data analysis on this system since the data-analysis you will want to do will interfere with the normal business traffic.
Problem #1 is what type of database do you want to use for data-analysis? You have many choices ranging from RDBMS, Hadoop, MongoDB, and Spark. If you go with RDBMS then you will want to change the schema to be more compliant with data-analysis. You will want to create a data warehouse with a star schema. Doing this will make many tools available to you because this method of data analysis has been around for a very long time. All of the other "big data" and data analysis databases do not have the same level of tooling available, but they are quickly catching up. Each one of these will require research on which one you will want to use based on your problem set. If you have big batches of data RDBMS and Hadoop will be good. If you have streaming types of data then you will want to look at MongoDB and Spark. If you are a Java shop then RDBMS, Hadoop or Spark. If you are JavaScript MongoDB. If you are good with Scala then Spark.
Problem #2 is getting your data from your transactional database into your big data storage. You will need to find a programming language that has libraries to talk to both databases and you will have to decide when and where you will be moving this data. You can use Python, Java or Ruby to do this work.
Problem #3 is your UI. If you decide to go with RDBMS then you can use many of the available tools available or you can build your own. The other data storage solutions will have tool support but it isn't as mature is that available for the RDBMS. You are most likely going to build your own here anyway because your analysts will want to have the tools built to their specifications. Java works with all of these storage mechanisms but you can probably get Python to work too. You may want to provide a service layer built in Java that provides a RESTful interface and then put a web layer in front of that service layer. If you do this, then your web layer can be built in any language you prefer.
These three languages are most commonly used for machine learning and data mining on the Server side: R, Python, SQL. If you are aiming for heavy mathematical functions and graph generation, Haskell is very popular.

What database is good enough for logging application?

I am writing a web application with nodeJS that can be used by other applications to store logs and accessed later in a web interface or by applications themselves providing an API. Similar to Graylog2 but schema free.
I've already tried couchDB in which each document would be a log doc but since I'm not really using revisions it seems to me I'm not using its all features. And beside that I think if the logs exceeds a limit it would be pretty hard to manage in couchDB.
What I'm really looking for, is a big array of logs that can be sorted, filtered, searched and capped on. Then the last events of it accessed. It should be schema free and writing to it should be non-blocking.
I'm considering using Cassandra(I'm not really familiar with it) due to the points here said. MongoDB seems good here too, since Graylog2 uses in mongoDB, in here it has some good points about it.
I've already have seen this question, but not satisfied with the answers.
Edit:
For some reasons I can't use Cassandra in production, now I'm trying MongoDB.
One more reason to use mongoDB :
http://www.slideshare.net/WombatNation/logging-app-behavior-to-mongo-db
More edits:
It is similar to graylog2, but the difference I want to make that instead of having a message field, having fileds defined by the client, which is why I want it to be schema free, and because of that, I may need to query in the user defined fields. We can build it on SQL, but querying on the user defined fields would be reinventing wheel. Same goes with files.
Technically what I'm looking for is to get rich statistical data in the end, or easy debugging and a lot of other stuff that we can't get out of the logs.
Where shall it be stored and how shall it be retrieved?
I guess it depends on how much data you are dealing with. If you have a huge amount (terabytes and petabytes per day) of logs then Apache Kafka, which is designed to allow data to be PULLED by HDFS in parallel, is a interesting solution - still in the incubation stage. I believe if you want to consume Kafka messages with MongoDb, you'd need to develop your own adapter to ingest it as a consumer of a particular Kafka topic. Although MongoDb data (e.g. shards and replicas) is distributed, it may be a sequential process to ingest each message. So, there may be a bottleneck or even race conditions depending on the rate and size of message traffic. Kafka is optimized to pump and append that data to HDFS nodes using message brokers FAST. Then once it is in HDFS you can map/reduce to analyze your information in a variety of ways.
If MongoDb can handle the ingestion load, then it is an excellent, scalable, real-time solution to find information, particularly documents. Otherwise, if you have more time to process data (i.e. batch processes that take hours and sometimes days), then Hadoop or some other Map Reduce database is warranted. Finally, Kafka can distribute that load of messages and hookup that fire-hose to a variety of consumers. Overall, these new technologies spread the load and huge amounts of data across cheap hardware using software to manage failure and recover with a very low probability of losing data.
Even with a small amount of data, MongoDb is a nice option to traditional relational database solutions which require more overhead of developer resources to design, build and maintain.
General Approach
You have a lot of work ahead of you. Whichever database you use, you have many features which you must build on top of the DB foundation. You have done good research about all of your options. It sounds like you suspect that all have pros and cons but all are imperfect. Your suspicion is correct. At this point it is probably time to start writing code.
You could just choose one arbitrarily and start building your application. If your guess was correct that the pros and cons balance out and it's all about the same, then why not simply start building immediately? When you hit difficulty X on your database, remember that it gave you convenience Y and Z and that's just life.
You could also establish the fundamental core of your application and implement various prototypes on each of the databases. That might give you true insight to help discriminate between the databases for your specific application. For example, besides the interface, indexing, and querying questions, what about deployment? What about backups? What about maintenance and security? Maybe "wasting" time to build the same prototype on each platform will make the answer very clear for you.
Notes about CouchDB
I suppose CouchDB is "NoSQL" if you say so. Other things which are "no SQL" include bananas, poems, and cricket. It is not a very meaningful word. We have general-purpose languages and domain-specific languages; similarly CouchDB is a domain-specific database. It can save you time if you need the following features:
Built-in web API: clients may query directly
Incremental map-reduce: CouchDB runs the job once, but you can query repeatedly at no cost. Updates to the data set are immediately reflected in the map/reduce result without full re-processing
Easy to start small but expand to large clusters without changing application code.
Have you considered Apache Kafka?
Kafka is a distributed messaging system developed at LinkedIn for
collecting and delivering high volumes of log data with low latency.
Our system incorporates ideas from existing log aggregators and
messaging systems, and is suitable for both offline and online message
consumption.

Best highperf database for simple read/write (no update) scenario

I'm interested in opinions on what database system to select for this project where I basically need to persist a constant stream of messages at potentially high speed. There's basically four types of messages with some commonalities. No relations needed. I guess you could call it an event store.
I will need to read (query by a non-unique key), but I don't need to update any data. I will have to delete old data though.
Considerations:
Database must be able to scale out
Performance is crucial
as well as up-time (system allowing live updates would be nice)
Preferably something running on Windows Server, but this is not a requirement
I'm familiar with document databases (MongoDB), but don't know what other kinds of NoSQL solutions would fit my problem, or how they compare.
MongoDb would be ideal. But if all you want to do is read from the stream and serve up content, more than database consideration (use any db - mysql, access, sql server express, xml files), I would suggest you look at putting all your data in memory (maybe at app startup); and then serve up data from memory.
You should also look at some caching solutions like Memcached (http://memcached.org/)

Is SQLite a good choice for a large, read only database for research?

I have a large number of records (say around 10 to 100 million), which I want to be able to query.
This is a research project, the database is going to be mostly read only, and I only need one connection at a time. I would like the queries to be reasonably fast.
Is SQLite a reasonable choice for this purpose?
My experience with SQLite is that it may be quite slow on large recordsets, depending on how you structure your queries. If your data is de-normalized and you can get by querying a single table against its primary key then it's acceptably fast, but if your data is fully normalized and your queries involve several joins then it can be much slower than a client-server database.
SQLite's principal advantage is its small size and single file nature that make it easy to distribute embedded in an app. As that doesn't seem to be a requirement for you though, I think you'd be better off going with something else. SQL Server Express is good if you're using Windows, MySQL or Postgres otherwise would be a good choice.
As pointed out in the previous posts, SQLite is a great SQL library, but it can run out of gas when the data set gets very large. Berkeley DB recently introduced a SQL API which is completely SQLite compatible. It was added to Berkeley DB in order to provide the best of both worlds to SQLite users -- the ubiquity, simplicity and ease of use of SQLite with the concurrency, scalability and reliability of Berkeley DB.
The Berkeley DB SQL API was designed to be a drop-in replacement for SQLite applications, especially those that specifically need the Berkeley DB features and scalability that isn't available in native SQLite. You can read more about it in the Berkeley DB SQL API documentation.
Disclaimer: I'm one of the Product Managers for Berkeley DB, so I'm a little biased. But your use case is one of the reasons that we worked with Dr. Hipp and the SQLite developers in order to combine the SQLite API with the Berkeley DB storage manager. It allows SQLite application developers to take their applications into new areas with added capabilities, while remaining compatible with their existing implementation.
Please let us know if you have any questions or if there is anything that we can do to help. You can find an active community of Berkeley DB developers on the OTN Forums.
Best of luck with your project.
Regards,
Dave
SQLite is not particularly fast when getting into the millions of entries. Results will vary according to what you put in there, schema, number of columns, indexes.
The advantage (especially in your case) of SQLite is that it is so light that trying it with some data would probably be worth the time and effort. It's very much straightforward and its ideal use case is indeed for single user access.
I'd say try and build it up with a representative amount of data (you can do an import from a CSV file from the command line, or use one of the many wrappers available out there). If the speed is not satisfactory, you might have to switch to something with more power, but admittedly a bit more setup too, like MySQL.

Which NoSQL backend to store trace data from webpage

In our web application we need to trace what users click, what they write into search box, etc. Lots of data will be sent by AJAX. Generally functionality is a bit similar to google analytics, but we need to customize it in different ways.
Data will be collected and once per day aggregated and exported to PostgreSQL, so backend should be able to handle dozens of inserts. I don't consider usage of traditional SQL database, because probably it won't handle so many inserts efficiently.
I wonder which backend would you use for such task? Actually I think about MongoDB or Cassandra. But maybe you know better software for that task? Maybe something different then NoSQL database?
Web application is written in Ruby on Rails so support for Ruby would be nice but that's definitely not the most important.
Sounds like you need to analyse your specific requirements.
It may be that the best solution is to split / partition / shard a conventional database and then push the data up from there.
Depending on what your tolerance for data loss is, there are a lot of options. If you choose a system which has single-server durability, a major source of write bottleneck will be fdatasync() (assuming you use hard drives to store your data on).
If you can tolerate syncing less often than on every commit, then you may be able to tune your database to commit at timed intervals.
Depending on your table, index structure etc, I'd expect that you can get rather a lot of inserts with a "conventional" db (e.g. postgresql), if you manage it correctly and tune the durability (if it supports that) to your liking.
Sharding this into several instances of course will enable you to scale this up. However, you need to be mindful of operational requirements (i.e. what happens if some of the instances are down). Talk to your Ops team about what they're comfortable managing.

Resources