Architecture Setup and Database - database

I am trying to create a social app in which users can follow their friends and their personalised feed is real time as well.
Question : Is Graph Database the best option to cater to such problem. What is the experience when the data reaches millions. Also, what is the right way to proceed for feeds, do we keep Kafka stream for each user? How do I start with the whole setup with respect to over engineering, a start point and the flow.

As usual, it depends 100% on how you use these technologies.
Neo4j (a graph database) can store fairly large amounts of data:
A graph database with a quadrillion nodes? Such a monstrous entity is
beyond the scope of what technologist are trying to do now. But with
the latest release of the Neo4j database from Neo Technology, such a
graph is theoretically possible.
There is effectively no limit to the sizes of graphs that people can
run with Neo4j 3.0, which was announced today, says Neo Vice President
of Products Philip Rathle.
“Before Neo4j 3.0, graph sizes were limited to tens of billions of
records,” Rathle says. “Even though they may not have tens of billions
of data items to actually store in a graph, just having a ceiling made
them nervous.”
By adopting dynamically sized pointers, Neo4j can now scale up to run
the biggest graph workloads that customers can throw at it. The
company expects some of its customers will begin to put that extra
capacity to use, for things such as crunching IoT data, identifying
fraud, and generating product recommendations.
Source: https://www.datanami.com/2016/04/26/neo4j-pushes-graph-db-limits-past-quadrillion-nodes/
Start with something simple, Neo4j sounds like a good starting point. Once you start hitting bottlenecks or scaling issues, then you can start looking at other solutions. It's very hard to predict where your bottlenecks will be without real-world data.
Realtime feeds at scale is hard to build, first define how realtime do you want it to be. Is 1 minute still considered realtime? Maybe 5 minutes?
The number you choose here will directly affect your technology choices.
Either way, more information is needed to give a more detailed answer.

Related

What is the right database technology for this simple outlined BI tool use case?

Reaching out to the community to pressure test our internal thinking.
We are building a simplified business intelligence platform that will aggregate metrics (i.e. traffic, backlinks) and text list (i.e search keywords, used technologies) from several data providers.
The data will be somewhat loosely structured and may change over time with vendors potentially changing their response formats.
Data volume may be long term 100,000 rows x 25 input vectors.
Data would be updated and read continuously but not at massive concurrent volume.
We'd expect to need to do some ETL transformations on the gathered data from partners along the way to the UI (e.g show trending information over the past five captured data points).
We'd want to archive every single data snapshot (i.e. version it) vs just storing the most current data point.
The persistence technology should be readily available through AWS.
Our assumption is our requirements lend themselves best towards DynamoDB (vs Amazon Neptune or Redshift or Aurora).
Is that fair to assume? Are there any other questions / information I can provide to elicit input from this community?
Because of your requirement to have a schema-less structure, and to version each item, DynamoDB is a great choice. You will likely want to build the table as a composite Partition/Sort key structure, with the Sort key being the Version, and there are several techniques you can use to help you locate the 'latest' version etc. This is a very common pattern, and with DDB Autoscaling you can ensure that you only provision the amount of capacity that you actually need.

What software should I use for graph distributed storing and processing?

Problem in a nutshell:
There's a huge amount of input data in JSON format. Like right now it's about 1 Tb, but it's going to grow. I was told that we're going to have a cluster.
I need to process this data, make a graph out of it and store it in a database. So every time I get a new JSON, I have to traverse the whole graph in a database to complete it.
Later I'm going to have a thin client in a browser, where I'm going to visualize some parts of the graph, search in it, traverse it, do some filtering, etc. So this system is not high load, just a lot of processing and data.
I have no experience in distributed systems, NoSQL databases and other "big data"-like stuff. During my little research I found out that there are too many of them and right now I'm just lost.
What I've got on my whiteboard at the moment:
Apache Spark's GraphX (GraphFrames) for distributed computing on top of some storage (HDFS, Cassanda, HBase, ...) and processor (Yarn, Mesos, Kubernetes, ...).
Some graph database. I think it's good to use a graph query language like Cipher in neo4j or Gremlin in JanusGraph/TitanDB. Neo4j is good, but it has clustering only in EE and I need something open source. So now I'm thinking about the latter ones, which have Gremlin + Cassandra + Elasticsearch by default.
Maybe I don't need any of these, just store graph as adjacency matrix in some RDBMS like Postgres and that's it.
Don't know if I need Spark in 2 or 3. Do I need it at all?
My chief told me to check out Elasticsearch. But I guess I can use it only as an additional full-text search engine.
Thanks for any reply!
Let us start with a couple of follow-up questions :
1Tb is not a huge amount of data if that is also (close to) the total amount of data. Is it ? How much new data are you expecting and at what rate will it arrive.
Why would you have to traverse the whole graph if each JSON is merely referring to a small part of the graph ? It's either new data or an update of existing data (which you should be able to pinpoint), isn't it ?
Yes, that's how you use a graph database ...
The rest sort of depends on your answer on 1). If we're talking about IOT numbers of arriving events (tens of thousands per second ... sustained) you might need a big data solution. If not, your main problem is getting the initial load done and do some easy sailing from there ;-).
Hope this helps.
Regards,
Tom

What database is good enough for logging application?

I am writing a web application with nodeJS that can be used by other applications to store logs and accessed later in a web interface or by applications themselves providing an API. Similar to Graylog2 but schema free.
I've already tried couchDB in which each document would be a log doc but since I'm not really using revisions it seems to me I'm not using its all features. And beside that I think if the logs exceeds a limit it would be pretty hard to manage in couchDB.
What I'm really looking for, is a big array of logs that can be sorted, filtered, searched and capped on. Then the last events of it accessed. It should be schema free and writing to it should be non-blocking.
I'm considering using Cassandra(I'm not really familiar with it) due to the points here said. MongoDB seems good here too, since Graylog2 uses in mongoDB, in here it has some good points about it.
I've already have seen this question, but not satisfied with the answers.
Edit:
For some reasons I can't use Cassandra in production, now I'm trying MongoDB.
One more reason to use mongoDB :
http://www.slideshare.net/WombatNation/logging-app-behavior-to-mongo-db
More edits:
It is similar to graylog2, but the difference I want to make that instead of having a message field, having fileds defined by the client, which is why I want it to be schema free, and because of that, I may need to query in the user defined fields. We can build it on SQL, but querying on the user defined fields would be reinventing wheel. Same goes with files.
Technically what I'm looking for is to get rich statistical data in the end, or easy debugging and a lot of other stuff that we can't get out of the logs.
Where shall it be stored and how shall it be retrieved?
I guess it depends on how much data you are dealing with. If you have a huge amount (terabytes and petabytes per day) of logs then Apache Kafka, which is designed to allow data to be PULLED by HDFS in parallel, is a interesting solution - still in the incubation stage. I believe if you want to consume Kafka messages with MongoDb, you'd need to develop your own adapter to ingest it as a consumer of a particular Kafka topic. Although MongoDb data (e.g. shards and replicas) is distributed, it may be a sequential process to ingest each message. So, there may be a bottleneck or even race conditions depending on the rate and size of message traffic. Kafka is optimized to pump and append that data to HDFS nodes using message brokers FAST. Then once it is in HDFS you can map/reduce to analyze your information in a variety of ways.
If MongoDb can handle the ingestion load, then it is an excellent, scalable, real-time solution to find information, particularly documents. Otherwise, if you have more time to process data (i.e. batch processes that take hours and sometimes days), then Hadoop or some other Map Reduce database is warranted. Finally, Kafka can distribute that load of messages and hookup that fire-hose to a variety of consumers. Overall, these new technologies spread the load and huge amounts of data across cheap hardware using software to manage failure and recover with a very low probability of losing data.
Even with a small amount of data, MongoDb is a nice option to traditional relational database solutions which require more overhead of developer resources to design, build and maintain.
General Approach
You have a lot of work ahead of you. Whichever database you use, you have many features which you must build on top of the DB foundation. You have done good research about all of your options. It sounds like you suspect that all have pros and cons but all are imperfect. Your suspicion is correct. At this point it is probably time to start writing code.
You could just choose one arbitrarily and start building your application. If your guess was correct that the pros and cons balance out and it's all about the same, then why not simply start building immediately? When you hit difficulty X on your database, remember that it gave you convenience Y and Z and that's just life.
You could also establish the fundamental core of your application and implement various prototypes on each of the databases. That might give you true insight to help discriminate between the databases for your specific application. For example, besides the interface, indexing, and querying questions, what about deployment? What about backups? What about maintenance and security? Maybe "wasting" time to build the same prototype on each platform will make the answer very clear for you.
Notes about CouchDB
I suppose CouchDB is "NoSQL" if you say so. Other things which are "no SQL" include bananas, poems, and cricket. It is not a very meaningful word. We have general-purpose languages and domain-specific languages; similarly CouchDB is a domain-specific database. It can save you time if you need the following features:
Built-in web API: clients may query directly
Incremental map-reduce: CouchDB runs the job once, but you can query repeatedly at no cost. Updates to the data set are immediately reflected in the map/reduce result without full re-processing
Easy to start small but expand to large clusters without changing application code.
Have you considered Apache Kafka?
Kafka is a distributed messaging system developed at LinkedIn for
collecting and delivering high volumes of log data with low latency.
Our system incorporates ideas from existing log aggregators and
messaging systems, and is suitable for both offline and online message
consumption.

how to gain a high performance with a very big database

I alway wondered how could a very big site like facebook to be faster than any other sites ,though the very big large amount of data which stored everyday ..
what they are using to store information and if I use sql server to store e.g news feed is that ok or what (the news feed will be stored in a separate table which called News) .
in the other hand what could happen if I joined many huge tables with each other - it should be slow (maybe) or it doesn't matter how big the table is !?
thanx :)
When you talk about scaling at the size of Facebook, is a whole different ball park. Latest estimates put Facebook datacenter at about 60000 servers (sixty thousand). Only the cache is estimated to be at about 30 TB (terabytes) ina a masive Memcached cluster. Although their back end is stil MySQL, is used as a pure key-value store, according to publicly available information:
Facebook uses MySQL, but
primarily as a key-value persistent
storage, moving joins and logic onto
the web servers since optimizations
are easier to perform there (on the
“other side” of the Memcached layer).
There are various other technologies in use there:
HipHop to compile PHP into native code
Haystack for media (photo) storage
BigPipe for HTTP delivery
Cassandra for Inbox search
You can also watch this year SIGMOD 2010 key address Building Facebook: Performance at big scale. They even present their basic internal API:
cache_get ($ids,
'cache_function',
$cache_params,
'db_function',
$db_params);
So if you connect the dots you'll see that at such scale you no longer talk about a 'big database'. You talk about huge clusters of services, key-value storage partitioned across thousands of servers, many technologies used together and so on and so forth.
As a side note, you can also see a pretty good presentation of MySpace internals. Although the technology stack is completely different (Microsoft .Net and SQL Server based, with a huge emphasis on message passing via Service Broker) there are similar points in how they approach storage. To sum up: application layer partitioning.
It depends, Facebook is very fast because they have a server farm, so queries are optimised and each single query hits many servers.
In regards to huge tables, they can be fast as long as you have enough physical memory to index whatever you need to search on. Having correct index's can improve database performance hugely (When it comes to retrieving data).
As long as it makes sense to join many huge tables together into one then yes, but if they're separate, and not related then no. If you provide more details on what kind of tables you would be looking to merge, we might be able to help you more.
According to link text and other pages Facebook uses a technique called Sharding.
It simply uses a bunch of databases with a small portion of the site on each database. A simple algorithm for deciding which database to use could be using the first letter in the username as an index for the database. One database for 'a', one for 'b', etc. I'm sure Facebook has a more advanced scheme than that, but the principle is the same.
The result is many small independent databases that are small enough to handle the load. Facebook and all other major sites has all sorts of similar tricks to make the sites fast and responsive.
They continuously monitor the sites for performance and other metrics and come up with solutions to the issues the find.
I think the monitoring part is more important to the performance success than the actual techniques used to gain the performance. You can not make a fast site by blindly throw some "good performance spells" at it. You have to know where and why you have bottlenecks before you can remove them.
Depends what the performance bottleneck is. One problem is often using the wrong technology for the problem, eg using a relational DB when an object DB or document store would be better, or vice versa of course.
Some people try and use the same DB for everything which is not always the answer. Sometimes it is useful to have multiple denormalizations of the same data for different purposes.
Thinking about the nature of the data and how it is written, read, queried etc is important. You can put all write-once data in one DB and optimize that db for that. Other data that is written frequently could be stored on a db optimized for that.
Distribution techniques can also assist with upscaling.

The difficulty of choosing right database for analytics

I need some help deciding which database we should choose for our project. We are developing a web application that collects data about user's behavior and analyses that (bad explanation, but I can't provide much more detail; web analytics data is one of our core datasets). We have estimated that we will insert approx 200 million rows per week into database + data calculated from that raw data. The data must be retained for at least six months.
I have spent last week and half gathering information about different solutions, but there seems to be so many that I feel lost. Most promising ones I found are Cassandra, Hbase and Hive. I also looked at MongoDb, Redis and some others, but they looked like they suited different needs or community wasn't that active.
The whole app will be run in Amazon's EC2. As a startup company pay-as-you-go pricing model fits us like a glove. The easier the database is to manage in the cloud, the better.
Scalability is important. The amount of data we will generate varies quite much and will grow over time.
We can't pay huge licensing fees. Otherwise we would probably use something like http://www.vertica.com/.
We need to do all sorts of analysis on data, and the easier they are write the better. I thought about using Map/Reduce for the task; Hbase seems to have better support for this than Cassandra, and Hive has it's own query language. Real-time analysis isn't needed; we can calculate results once a day and shovel those back to database for fast retrieval.
Compression support would be nice, but not necessary (disk space is cheap :).
I also though about using MySql (because we will use that for all the user information etc. anyway), but scaling will be much harder in the future and I think at some point we would have to move to some other db anyway. We are also more than willing to commit some time and effort to push the selected database forward in terms of development.
We have decided to go on with Hadoop(& Hive/Hbase) as our primary data store. Main reasons for this are:
It is proven technology, and many big sites are using it (Facebook...).
Lot's of documentation around and even Hadoop books have been written.
Hive provides nice SQL-like query language and command line, so even guys who don't know Java/Python/etc. can write queries easily.
It's free and community people seem to be helpful :)

Resources