Is Graph database really good for monitoring IT enterprise? - database

I have a simple question about graph database utility to monitor IT enterprise.
Many graph database vendors are suggesting that graph databases are optimal for monitoring IT resources. Such like Ref 1 or Ref 2
In order to monitor IT resources, it is necessary to manage log data from IT enterprise effectively.
However, there are many books for graph database modeling that leads us to model by the following time tree method.
I think that the time tree method will generate too much edges and will not be good for the performance of the graph database.
So... Is Graph database really good for monitoring IT enterprise ?

In my opinion, it seems that the questioner is confused with "the monitor the IT asset as an example of the excellent aspects to trace the complex relationship" and "the monitor log data for device history management".
For an IT enterprise that provides numerous services, if one logical or physical system fails, its impact can affect other assets associated with that asset.
(This is similar to the disease spreading in networks.)
The graph database is very appropriate for configuring the system to track it. (please find the following figure)
enter image description here
Next, let's talk about the writer's question, it is a little more fundamental, in order to identify the cause of a specific malfunction in IT asset management, analysis of log data of asset should also be done.
In this part, I understand how to model time series data on a graph database.
In conclusion, RDB is better than GDB in the area of log data analysis for history management. (To be specific, Time series database is the best.)
The reason is also for the same reason as the questioner mentioned.
(It is true that Graphdb is better than rdb in terms of edge travers, but paradoxically, modeling to generate too much edge to this blind faith that is rather exerts an adverse effect on performance. This is called Dense network problem.)
Therefore, it is recommended to implement the log data analysis function by constructing it in time series database and linking it by using DB Link, FDW, or using multimodel db of GDB + RDB such as AgensGraph, Oracle or OrientDB.

Related

Graph database vs. RDB with link/bridge tables

I work in the fraud/AML (anti-money laundering) field, and we are exploring using a graph database to unearth hidden connections and links. I've read a fair amount abut graph databases lately (mostly neo4j, but I think the concepts are similar across different products?), and from what I can tell, they seem to be well-suited to this domain. The issue is that I'm having a hard time getting buy-in from tech management, as they seem to think that we can do the same things with our existing data reporting model, which is in Hadoop, and is essentially a data warehouse which has specific tables that provide many-to-many link tables between the core tables (I believe Kimball calls them 'bridge' tables?).
In a way, they seem to provide the same functionality as the relationship tables in a graph DB. Given that we have already constructed the link tablesin Hadoop, would a graph database provide any performance advantage for the kinds of things we may want to do (e.g. How is Customer A connected to Customer B), or have we largely negated any performance advantage of a graph DB by building all of the link tables?
On similar hardware platforms, a relational database will never be able to keep up with a well constructed graph database when performing "path-between" queries. Never.
Every graph database product has its own internal storage representation, but they are all fundamentally designed to store nodes and edges and support navigational queries across those nodes and edges. Without the addition of new graph-support features, relational database will struggle to provide graph-like capabilities.
The other advantage of using a native graph database is that the graph query languages are specifically designed to support path-between queries. In Objectivity/DB, a massively scalable and distributable object/graph database, we can use the DO query language to find all of the paths between two entities up to a specified number of degrees apart in milliseconds or seconds. A DO query might look like the following:
Match p = (:Account { accountId = "1234"})
-[*..100]->
(:Account { accountId = "5678"})
return p;
Here, we are saying: Find all paths (p) from Account 1234 to Account 5678, where they are between 1 and 100 degrees apart.
To create and execute this same query in a relational database would be much more complicated (without the addition of graph features to the database) and the execution of a query like this in a relational database would be much more resource intensive (memory, cpu, I/O).
If you have the opportunity to explore graph database for your project, make sure you understand your scalability and data distribution requirements. That information will be key to selecting the correct product.
Disclaimer: I am the Director of Field Operations for Objectivity.

What database is good enough for logging application?

I am writing a web application with nodeJS that can be used by other applications to store logs and accessed later in a web interface or by applications themselves providing an API. Similar to Graylog2 but schema free.
I've already tried couchDB in which each document would be a log doc but since I'm not really using revisions it seems to me I'm not using its all features. And beside that I think if the logs exceeds a limit it would be pretty hard to manage in couchDB.
What I'm really looking for, is a big array of logs that can be sorted, filtered, searched and capped on. Then the last events of it accessed. It should be schema free and writing to it should be non-blocking.
I'm considering using Cassandra(I'm not really familiar with it) due to the points here said. MongoDB seems good here too, since Graylog2 uses in mongoDB, in here it has some good points about it.
I've already have seen this question, but not satisfied with the answers.
Edit:
For some reasons I can't use Cassandra in production, now I'm trying MongoDB.
One more reason to use mongoDB :
http://www.slideshare.net/WombatNation/logging-app-behavior-to-mongo-db
More edits:
It is similar to graylog2, but the difference I want to make that instead of having a message field, having fileds defined by the client, which is why I want it to be schema free, and because of that, I may need to query in the user defined fields. We can build it on SQL, but querying on the user defined fields would be reinventing wheel. Same goes with files.
Technically what I'm looking for is to get rich statistical data in the end, or easy debugging and a lot of other stuff that we can't get out of the logs.
Where shall it be stored and how shall it be retrieved?
I guess it depends on how much data you are dealing with. If you have a huge amount (terabytes and petabytes per day) of logs then Apache Kafka, which is designed to allow data to be PULLED by HDFS in parallel, is a interesting solution - still in the incubation stage. I believe if you want to consume Kafka messages with MongoDb, you'd need to develop your own adapter to ingest it as a consumer of a particular Kafka topic. Although MongoDb data (e.g. shards and replicas) is distributed, it may be a sequential process to ingest each message. So, there may be a bottleneck or even race conditions depending on the rate and size of message traffic. Kafka is optimized to pump and append that data to HDFS nodes using message brokers FAST. Then once it is in HDFS you can map/reduce to analyze your information in a variety of ways.
If MongoDb can handle the ingestion load, then it is an excellent, scalable, real-time solution to find information, particularly documents. Otherwise, if you have more time to process data (i.e. batch processes that take hours and sometimes days), then Hadoop or some other Map Reduce database is warranted. Finally, Kafka can distribute that load of messages and hookup that fire-hose to a variety of consumers. Overall, these new technologies spread the load and huge amounts of data across cheap hardware using software to manage failure and recover with a very low probability of losing data.
Even with a small amount of data, MongoDb is a nice option to traditional relational database solutions which require more overhead of developer resources to design, build and maintain.
General Approach
You have a lot of work ahead of you. Whichever database you use, you have many features which you must build on top of the DB foundation. You have done good research about all of your options. It sounds like you suspect that all have pros and cons but all are imperfect. Your suspicion is correct. At this point it is probably time to start writing code.
You could just choose one arbitrarily and start building your application. If your guess was correct that the pros and cons balance out and it's all about the same, then why not simply start building immediately? When you hit difficulty X on your database, remember that it gave you convenience Y and Z and that's just life.
You could also establish the fundamental core of your application and implement various prototypes on each of the databases. That might give you true insight to help discriminate between the databases for your specific application. For example, besides the interface, indexing, and querying questions, what about deployment? What about backups? What about maintenance and security? Maybe "wasting" time to build the same prototype on each platform will make the answer very clear for you.
Notes about CouchDB
I suppose CouchDB is "NoSQL" if you say so. Other things which are "no SQL" include bananas, poems, and cricket. It is not a very meaningful word. We have general-purpose languages and domain-specific languages; similarly CouchDB is a domain-specific database. It can save you time if you need the following features:
Built-in web API: clients may query directly
Incremental map-reduce: CouchDB runs the job once, but you can query repeatedly at no cost. Updates to the data set are immediately reflected in the map/reduce result without full re-processing
Easy to start small but expand to large clusters without changing application code.
Have you considered Apache Kafka?
Kafka is a distributed messaging system developed at LinkedIn for
collecting and delivering high volumes of log data with low latency.
Our system incorporates ideas from existing log aggregators and
messaging systems, and is suitable for both offline and online message
consumption.

Suggest: Non RDBMS database for a noob

For a new application based on Erlang, Python, we are thinking of trying out a non-RDBMS database(just for the sake of it). Some of the databases I've researched are Mongodb, CouchDB, Cassandra, Redis, Riak, Scalaris). Here is a list of simple requirements.
Ease of development - I need to make a quick proof-of-concept demo. So the database needs to have good adapters for Eralang and Python.
I'm working on a new application where we have lots of "connected" data. Somebody recommended Neo4j for graph-like data. Any ideas on that?
Scalable - We are looking at a distributed architecture, hence scalability is important.
For the moment performance(in any form) isn't exactly on top of my list, and I don't think we'll be hitting the limitations of any of the above mentioned databases anytime soon.
I'm just looking for a starting point for non-RDBMS database. Any recommendations?
We have used Mnesia in building an Enterprise Application. Mnesia when in a mode where the tables are Fragmented performs at its best because it would not have table size limits. Mnesia has performed well for the last 1 year and is still on. We have around 15 million records per table on the average and around 24 tables in a given database Schema.
I recommend mnesia Database especially the one that comes shipped within Erlang 14B03 at the Erlang.org website. We have used CouchDB and Membase Server (http://www.couchbase.com)for some parts of the system but mnesia is the main data storage (primary storage). Backups have been automated very well and the system scales well against increasing size of data yet tables running under many checkpoints. Its distribution, auto-replication and Complex Data Model enabled us to build the application very quickly without worrying about replication, scalability and fail-over / take-over of systems.
Mnesia Scales well and it's schema can be configured and changes while the database is running. Tables can be moved, copied, altered e.t.c while the system is live. Generally, it has all features of powerful systems built on top of Erlang/OTP. When you google mnesia DBMS, you will get a number of books and papers that will tell you more.
Most importantly, our application is Web based, powered by Yaws web server (yaws.hyber.org) and we are impressed with Mnesia's performance. Its record look up speeds are very good and the system feels so light yet renders alot of data. Do give mnesia a try and you will not regret it.
EDIT: To quickly use it in your application, look at the answer given here
Ease of development - I need to make a quick proof-of-concept demo. So the database needs to have good adapters for Eralang and Python.
Riak is written in Erlang => speaks Erlang natively
I'm working on a new application where we have lots of "connected" data. Somebody recommended Neo4j for graph-like data. Any ideas on that?
Neo4j is great for "connected" data. It has Python bindings, and some Erlang adapters How to Use Neo4j From Erlang. Thing to note, Neo4j is not as easy to Scale Out, at least for free. But.. it is fully transactional ( even JTA ), it persists things to disk, it is baked into Spring Data.
Scalable - We are looking at a distributed architecture, hence scalability is important.
For the moment performance(in any form) isn't exactly on top of my list, and I don't think we'll be hitting the limitations of any of the above mentioned databases anytime soon.
I believe given your input, Riak would be the best choice for you:
Written in Erlang
Naturally Distributed
Very easy to develop for/with
Lots of features ( secondary indicies, virtual nodes, fully modular, pluggable persistence [LevelDB, Bitcask, InnoDB, flat file, etc.. ], extremely reliable, built in full text search, etc.. )
Has an extremely passionate and helpful community with Basho backing it up

Transforming (Synchronizing) Data between SQL to HBase

We are overhauling our product by completely moving from Microsoft and .NET family to open source (well one of the reasons is cost cutting and exponential increase in data).
We plan to move our data model completely from SQL Server (relational data) to Hadoop (the famous key-Value pair ecosystem).
In the beginning, we want to support both versions (say 1.0 and new v2.0). In order to maintain the data consistency, we plan to sync the data between both systems, which is a fairly challenging task and error prone, but we don't have any other option.
A bit confused where to start from, I am looking up to the community of experts.
Any strategy/existing literature or any other kind of guidance in this direction would be greatly helpful.
I am not entirely sure how your code is structured, but if you currently have a data or persistence layer, or at least a database access class where all your SQL is executed through, you could override the save functions to write changes to both databases. If you do not have a data layer, you may want to considering writing one before starting the transition.
Otherwise, you could add triggers in MSSQL to update Hadoop, not sure what you can do in Hadoop to keep MSSQL in-sync.
Or, you could have a process that runs every x minutes, that manually syncs the two databases.
Personally, I would try to avoid trying to maintain two databases of record. Moving changes from a new, experimental database to your stable database seems risky. You stand the chance of corrupting your stable system. Instead, I would write a convertor to move data from your relational DB to Hadoop. Then every night or so, copy your data into Hadoop and use it for the development and testing of your new system. I think test users would understand if you said your beta version is just a test playground, and won't effect your live product. If you plan on making major changes to your UI and fear some will not want to transition to 2.0, then you might be trying to tackle too much at once.
Those are the solutions I came up with... Good luck!
Consider using a queuing tool like Flume (http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/) to split your input between both systems.

Database selection for a web-scale analytics application

I want to build a web-application similar to Google-Analytics, in which I collect statistics on my customers' end-users, and show my customers analysis based on that data.
Characteristics:
High scalability, handle very large volume
Compartmentalized - Queries always run on a single customer's data
Support analytical queries (drill-down, slices, etc.)
Due to the analytical need, I'm considering to use an OLAP/BI suite, but I'm not sure it's meant for this scale. NoSQL database? Simple RDBMS would do?
These what I am using at work in a production environnement and it works like a charm.
I copled three things
PostgreSQL + LucidDB + Mondrian (More generally the whole Pentaho BI suite components)
PostgreSQL : I am not going to describe postgresql, really strong open source RDBMS will let you do - certainly - everything you need. I use it to store my operational data.
LucidDB : LucidDB is an Open source column-store database. Highly scalable and will provide a really gain of processing time compare to PostgreSQL for retrieving a large amount of data. It is not optimized for transaction processing but for intensive reads. This is my Datawarehouse database
Mondrian : Mondrian is an Open Source R-OLAP cube. LucidDB made it easy to connect those two programs together.
I would recommend you to look at the whole Pentaho BI Suite, it worth it, you might want to use some of there components.
Hope I could help,
There are two main architectures you could opt for for true web-scale:
1. "BI" architecture
Event journaller (e.g. LWES Journaller) or immutable event store (e.g. HDFS) feeds
Analytics/column-store database (e.g. Greenplum, InfiniDB, LucidDB, Infobright) feeds
Business intelligence reporting tool (e.g. Microstrategy, Pentaho Business Analytics)
2. "NoSQL" architecture
(Optional) Event journaller or immutable event store feeds
NoSQL database (e.g. Cassandra, Riak, HBase) feeds
A custom analytics UI (e.g. using D3.js)
The immutable event store or journaller is there because in most cases you want to be batching your analytics events and doing bulk updates to your database (even with something like HDFS) - rather than doing an atomic write for every single page view etc.
For SnowPlow, our open-source analytics platform built on Hadoop and Hive, the event logs are all collected on S3 first before being batch loaded into Hive.
Note that the "NoSQL architecture" will involve a fair bit more development work. Remember that with either architecture, you can always shard by customer if the volumes grow truly epic (billions of rows per customer) - because there's no need (I'm guessing) for cross-customer analytics.
I'd say that having put in place OLAP analysis is always nice and then has great potential for sophisticated data analysis using MDX.
What do you mean by large volume ?
Where are your customer user information?
What kind of front-end and reporting are you going to use?
Cheers.
Disclaimer : I'll make some publicity for my own solution - have a look to www.icCube.com and contact me for more details

Resources