data modelling tools used by technology startups - data-modeling

I am familiar with data modelling tools used in enterprise software and IT organizations.
Want to know which data modelling tools are used by technology startups in particular, preferably with added info on - why.
Early stage and high growth startups, have different imperatives that influence engineering and development choices. Hence the question.

If you are using a MySQL database, MySQL Workbench is a good tool (Opensource).
Otherwise you can use other data modelers as Navicat, ERBuilder or Dezign which support multiple RDBMS.
These three tools are easy to use.

Related

Which community edition graph database supports high-available cluster and has good online query performance?

I am currently building a knowledge graph for an e-commerce company, and it mainly consists of the product category hierarchies, properties, and relations among them. Besides the common relational queries, we care about the following points very much:
Master-slave cluster support. This graph database will be used for online search query processing, so high availability is crucial to us. The data volume won't be as big as millions of nodes, so we don't need a distributed cluster that can span data across multiple machines. Still, rather we may need multiple machines that can be read simultaneously, and the service won't go down even if one of the machines is offline.
Fast online query performance. Reasoning about relations can be done offline, so the performance is not that important. But we need to do a lot of online queries like "find the nodes whose property P equals to value V", so we need good performance for online query processing. This database will be read-intensive and won't be changed very much after it's initialization.
Community and documentation. Since our team is new to the field of a graph database, so we expect user-friendly documentation for deployment and development and an active community for solving problems.
Based on the requirements above, I investigated some candidates:
Neo4j. We first tried Neo4j since it's the most popular one in the field. Actually, I liked it, especially the Cypher query language. But we are about to abandon it because the community edition does not support any cluster, and currently, we don't have the budget to pay for the enterprise edition.
OrientDB. OrientDB is like the second most popular one on the market, and it seems to support cluster in its community edition. I use the word "seems" because it is not clearly stated on its website. Can anyone clear this out? Besides, I found a negative article about OrientDB which makes me hesitate: http://orientdbleaks.blogspot.jp/2015/06/the-orientdb-issues-that-made-us-give-up.html
Titan. Titan is also great, but since its original company has been acquired and its original developers are developing a different product, its future development and maintenance are in doubt.
ArangoDB. This one seems to be very fast, according to the performance report(https://www.arangodb.com/2015/10/benchmark-postgresql-mongodb-arangodb/), but I don't know if its online query processing ability is good enough, and its support for the cluster is also unknown to me.
As for documentation and community, I really have no idea since these are the kind of things that you only get to know after you start doing it.
To sum up, based on my requirements, I think OrientDB and ArangoDB maybe my candidates, but I don't know which one to choose because of the points I stated above. Or perhaps is there any other right candidate that I'm missing?
Thanks.
Max working for ArangoDB here. ArangoDB does not only do online queries for graphs, but due to its multi-model nature you can mix graph queries with document queries (using secondary indexes), key lookups and joins. It has a sophisticated query engine with an optimizer that is fully aware of the ArangoDB cluster structure and can optimize and distribute query executions across all instances.
In a cluster, sharding, synchronous replication and self-healing are all fully automatic with configurable parameters. Deployment of an ArangoDB cluster is particularly simple (literally two clicks) on Apache Mesos or DC/OS, but is also relatively straightforward with other orchestration frameworks. ArangoDB on DC/OS additionally allows you to scale up and down via the graphical user interface or REST API calls, and failed tasks are automatically replaced.
As to the performance, all our benchmarks show a very good performance, the just released Version 3.1 even has vertex centric indexes, which is particularly important for graph queries.
We do our best to provide extensive documentation, which you find at https://www.arangodb.com/documentation/ . We have a user manual, a manual for our query language AQL as well as one for the HTTP/REST API. Furthermore, we have tutorials, frequently asked questions, a "Cookbook" for standard tasks, and we try to answer questions on StackOverflow and github issues in a timely manner.
All of this is included in the Community Edition, which is available with the Apache 2.0 open source license.
If you have more questions, do not hesitate to reach out to our team or to me personally.
OrientDB Community Edition is a free open source software, built upon by a community of developers and is constantly improving. Features such as horizontal scaling, fault tolerance, clustering, sharding and replicating aren’t disabled in OrientDB community.
For more information about cluster, take a look at the official OrientDB guide: http://orientdb.com/docs/last/Tutorial-Clusters.html
Hope it helps.
Regards
Neo4j enterprise edition can be used under the AGPL license. I am surprised a lot of people arn't aware this. If you are using Neo4j Enterprise as a server and communicating with it via REST or bolt protocol (Apache Licensed), then you don't have to worry about releasing the code of the system connecting to it under AGPL.
If you are using it embedded, then you to adhere to AGPL, but then why would you need Neo4j enterprise in that situation?
Remember to clone and compile Neo4j Enterprise from github if you plan on using it under AGPL, don't download trial.
Neo Technology gives great support and that is what you are essentially paying for for the enterprise subscription.

What can an RDBMS do that Neo4j (and graph databases) cant?

“A Graph Database –transforms a–> RDBMS”
The Neo4j site seems to imply that whatever you can do in RDBMS, you can do in Neo4j.
Before choosing Neo4j as a replacement for an RDBMS, I need some doubts answered.
I am interested in Neo4j for
ability to do quickly modify data "schema"
ability to express entities naturally instead of relations and normalizations
...which leads to highly expressive code (better than ORM)
This is a NoSQL solution I am interested in for it's features, not high performance.
Question: Does Neo4j present any issues that may make it unsuitable as a RDBMS replacement?
I am particularly concerned about these:
is there any DB feature I must implement in application logic? (For example, you must implement joins at application layer for a few NoSQL DBs)
Are the fields "indexed" to allow a lookup faster than O(n)?
How do I handle hot backups and replication?
any issues with "altering" schema or letting entities with different versions of the schema living together?
This is an extremely broad topic covering everything from modeling and implementation to IT and support. It's impossible to really answer all those questions here, especially without details on your situation. However, you seem to be exploring options and avenues. So, I'll just pass on some general food for thought as someone that's implemented a number of systems.
Everybody seems to think their new database paradigm is a replacement for relational databases. So, take those claims with a grain of salt.
I like to think in terms of 3 fundamental models: Relational, Document, and Graphing. Depending on your problem space one or even more of these is the right answer. I would not do financial transactions in anything but relational (SQL Based). If you are building a CMS, then a Document DB is the way to go. If my application is modeling networks (roads, people, connections, networks etc.) I use Neo4J.
As far as production quality, there are solid options in each category. Relational has a bunch. For document databases I'd go MongoDB or a higher level JCR system like Apache Jackrabbit. For graphing, I only have experience with Neo4j and it is rock solid for me.
Whatever you do, don't buy into the hype that "We have the one technology that solves all your problems." It's not there and it narrows your thinking.
I 'm convinced Neo4j is a good replacement for relational databases by now.
It is ACID compliant
Though the community version lacks some features like hot backups, the enterprise edition has
You can get support for it
At first sight (and in the new releases where you don't need a START clause) its query language CYPHER can do almost anything SQL can
but
it's harder to find a CYPHER developer than a SQL one
and it does not have an equivalent optimizer: it matters more than with SQL how you write the query
Though it supports replication and Neo explicitly markets it as a big data product, I can't confirm it is scalable enough and I did not study security aspects.
In recent releases (younger that the question above), one can define indexes on labels, which work like indexes on tables in a relational DB, allowing for O(log(n)) lookups.
(fyi: Neo4j has no tables, but each node(~=row) can have different labels, comparable to gmail labels. This is more flexible: you don't have to chose whether you put cars and bicycles in one for vehicles table or not: a bicycle would have both a :vehicle and a :bicycle label.)
To answer the original question: Neo4j does hardly support for schema enforcement. Neo advices implementing automated consistency tests on your database, which you run on your acceptance test instance as part of your release cycle.
Using an enterprise db such as oracle will give you many, many features which may or may not be part of neo. These include:
ACID transactions
High availability / backups / standby
ability to use sql to get data in the most efficient way using a cost based optimizer - the db determines the best way to retrieve the data based on your latest statistics
Scalability, partitioning
support
security
If you are going to implement most of the functionality of your application in code by yourself and don't require the structure and advanced features offered by an rdbms or if your data structures are better suited to a graph based db then by all means trial neo. There is a reason that most corporate apps use a one of the traditional rdbms servers but this may not always be the case in the future

Real World Experience of db4o and/or Eloquera Database

I am evaluating two object databases, db4o (http://www.db4o.com) and Eloquera Database (http://eloquera.com) for a coming project. I have to choose one. My basic requirement is scalability, multi user support and easy type evolution for RAD.
Please share your real world experience.
If you have both, can you compare these two? Which do you prefer?
For the last 2 years I've been using DB4O, and I'm now switching to Eloquera.
My reasons, in order:
I'm building a commercial product, and the royalty based licensing on DB4O is WAY to high; DB4O said we could "talk about it", but I'm a very small development shop and giving away a huge chunk of each sale I make just doesn't make any sense when there's a perfectly good alternative.
I'm using the Db4oTool.exe to modify my assmeblies in a post-build step, and it really slows down the build process. Eloquera doesn't need to modify my assemblies.
I found a bug in the DB4O code, and it took many many months before it was integrated into their codebase. I have found bugs in Eloquera and they fixed them in a day or two
DB4O is not yet on .NET 4 (although they finally have an early beta). DB4O is the ONLY thing holding me back from using VS2010 (and .NET 4). I tried migrating to VS2010 but VS2010 automatically converts all unit tests to .NET 4, so all of my persistence related unit tests immediately failed.
DB4O is not really designed to be thread-safe.
DB4O has features and many API features that are obviously ported from Java.
Robert
Eloquera ( www.eloquera.com ) originally designed and developed for use in the Web environment and it’s designed as native .NET application in C#.
Eloquera wasn’t ported from Java as many other databases.
Eloquera natively as part of architecture supports:
Simultaneous user access
Security settings
Has genuine C/S architecture, has desktop mode available.
Max database size 1TB+, in a large data scale Eloquera maintains the fast query response; it has patents pending technologies including virtual file system, indexing, and adaptive cache. Eloquera has state of the art reflection written in MSIL that allows Eloquera to outperform many databases that use Microsoft’s standard reflection.
Supports in-memory database for the fast data processing
Since most of the users in the Web come from relational database world it was natural for Eloquera to support SQL and LINQ
EF support is due next month
Unlike some databases Eloquera does not put blindly objects in the database, if you change fields from int;int; to long; it will not keep querying with a wrong results because it still sees two int;int; - it will notify the user to update the definition
Eloquera provides a native indexing for properties and fields. Most of the databases do not provide properties indexing.
I might argue with Carl regarding DB4O the easiest database on the market, since Eloquera can do the same things from API perspective.
Eloquera is younger than Versant and still has some enterprise features coming.
Last month Eloquera R&D department got engaged with Eloquera Parallel Server to provide horizontal scaling that arguably will be magnitude cheaper than Versant’s VOD.
Some of the distinguished points
Eloquera is FREE for commercial use. You are not required to pay any royalties. All features above you have for FREE.
Eloquera has a commercial support available.
Eloquera is designed for the modern world with modern architecture. It was not adapting from time to time to market needs. It is natural part of Eloquera’s architecture.
If you are interested to hear user experiences with db4o, I suggest you also ask in our db4o user forums.
While db4o was originally developed for embedded use in applications with limited resources (and now runs very well on constrained platforms like Android, CompactFramework and Silverlight) I know that we do have many users that are happily using db4o for web applications.
Indeed there is some correctness to the db4o-bashing-post by leatrop: The db4o server core currently only allows one thread to enter for storing and querying tasks in a particular database.
However there are a couple of ways to make db4o applications scale very well:
Since the setup costs for db4o databases is very low (one single API call) it is possible to work with multiple databases. You can use the db4o replication system (dRS) to distribute objects between multiple databases. It is also possible to create backups of db4o databases while they are running and to replicate these backups to multiple machines. The approach of using multiple databases (for timeslices of data or for different usecases in your application) can be very nice for backup and debugging purposes. You don't need to copy the entire database if you want to test only some aspects of your live app.
If you still find that db4o does not scale good enough for concurrent users or database sizes, you can later switch to our high-end object database Versant VOD. It was built to run in the cloud and it has a proven track record to work for thousands of concurrent users with multi-terabyte databases. VOD for .NET also comes with a LINQ provider, so the interfaces of db4o and VOD are compatible.
My recommendation: Start with db4o. It is the easiest object database to get started with and to develop with. Just store any object with one line of code, without setting up schemas or mapping files. Use LINQ to query (or native queries, if you work with Java).
db4o is open source and it's free (under the GPL).
I'm creating a 2nd generation Social Media Platform completely based on Javafx and Db4o. We are able to do things with db4o that would be impossible with any other database. Semantic OWL Ontologies and Complex relationships with Objects and Our User Definable Canvas make Db4o an amazing fit for us. We have no worries about scaling either and have found several solutions. Carl is one of the most intelligent people in software. This fact is obvious when you learn about his product.
Mike Tallent
CEO
Objectwheel

Data Mining - Predictive Analysis

We are looking at acquiring Data Mining software to primarily run predictive analysis processes.
How does SQL Server Data Mining solution compares to other solutions like SPSS from IBM?
Since SQL Server DM is included in SQL Server Enterprise license - what would be the justification to spend extra couple 100K to buy separate software just to do DM?
I would look into open source options as well, including R, RapidMiner, Weka
I would recommend checking out the Rexer survey, as it shows popularity and satisfaction measures for a variety of data mining products:
http://www.kdnuggets.com/2010/03/f-annual-rexer-analytics-data-miner-survey-results.html
Depending on what you are looking to accomplish, and obviously your budget, there are certainly some great things being done in R. Check out Rattle for R and Revolution Computing.
I am a big fan of SPSS, and unfortunately have not used their Modeler package, but it seems like it may be worth considering. I have used SAS Enterprise Miner, and while it is powerful, I am not a big fan.
I haven't dabbled with Weka that much, but I found RapidMiner to have a steep learning curve, but does have alot of capability.
If you want to keep everything in the Microsoft stack check out www.predixionsoftware.com which is planning the release of a disruptive Excel add-in as an update to the current MS DM add-ins.
You might want want to give KNIME a try before paying for something else. Works well with databases and is excellent for exploratory analysis.
I would suggest to check open-source data mining software. There are some very good open-source software that are free.
I Would start by building some data mining models in SSAS using both Multidimensional and Tabular, and then get an account for Google Analytics. I built a social networking website that was set up where members had to join and used Google Analytics to start building reporting dashboards and have probably built near a thousand. Good starting point, R is good, Omni used to be the top dawg but Adobe bought them, clicktracks, quilk view, Sisense, Tableau, Actuate, however I would wait and see how the product Microsoft releases is. Chances are it will set itself apart like they have in the BI market and shot up to 2nd in market share and 1st in growth in the database market.

What is the production ready NonSQL database?

With the rising of non-sql database usage in high traffic website, I'm interested to use it for my project. Now I've heard several names like Voldermort, MongoDB and CouchDB. But which are among these NonSQL database that is production ready? I've seen the download pages and it seems that none of them is production ready because is not version 1.0 yet. Is there any other names other than these 3 that is recommendable to be used in production?
What do you mean by production ready? As far as I know, all of them are being used on live systems.
You should make your choice based on how the features they provide fit your needs.
You can also add Tokyo Cabinet to the list as well as the mnesia database provided by the Erlang VM.
I think you need to start out from your project requirements to see what kind of database you really need. There are many non-relational DBMS:s out there and they differ a lot in what kind of problems they are good at solving. I think the article Should you go Beyond Relational Databases? by Martin Kleppmann is a good starting point for finding out what you need. There's also a lot of stackoverflow threads on similar topics, these are my favorites:
The Next-gen Databases
Non-Relational Database Design
When shouldn’t you use a relational
database?
Good reasons NOT to use a relational
database?
When you have narrowed down what you actually need you can take a deeper look into the alternatives to see which DBMS are production ready for your use case. Production readiness isn't a yes/no thing: people may successfully deploy some solution that for example lacks in tool support - in another project this could be a no-go.
As for version numbers different projects have a different take on this, so you can't just compare the version numbers. I'm involved in the graph database project Neo4j and even if it has been in production use for 5+ years by now we still haven't released a version 1.0 final yet.
I'm tempted to answer "use SIRA_PRISE".
It's definitely non-SQL.
And its current version is 1.2, meaning that someone like you must definitely assume it's "production-ready".
But perhaps I shouldn't be answering at all.
Nice article comparing rdbms with 'next gen' and listing some providers:
Is the Relational Database Doomed?
http://readwrite.com/2009/02/12/is-the-relational-database-doomed
I will suggest you to use Arangodb.
ArangoDB is a multi-model mostly-memory database with a flexible data model for documents and graphs. It is designed as a “general purpose database”, offering all the features you typically need for modern web applications.
ArangoDB is supposed to grow with the application—the project may start as a simple single-server prototype, nothing you couldn’t do with a relational database equally well. After some time, some geo-location features are needed and a shopping cart requires transactions. ArangoDB’s graph data model is useful for the recommendation system. The smartphone app needs a lean API to the back-end—this is where Foxx, ArangoDB’s integrated Javascript application framework, comes into play.
Another unique feature is ArangoDB’s query language AQL — it makes querying powerful and convenient. AQL enables you to describe complex filter conditions and joins in a readable format, much in the same way as SQL.
You can model your data in several ways:
in key/value pairs
as collections of documents
as graphs with nodes, edges, and properties for both
You can access data in ArangoDB:
using the general HTTP REST API via curl/wget, or your browser
via the ArangoDB shell (“arangosh”)
using a programming language specific client library
Server requirements for ArangoDB:
ArangoDB runs on Linux, OS X and Microsoft Windows.
It runs on 32bit and 64bit systems, though using a 32bit system will limit you to using only approximately 2 to 3 GB of data with ArangoDB.

Resources