I have worked in a project with OrientDB graph database. I've managed to fill the database and perform the queries in it without problems. But after I needed to run my queries using the distributed feature from OrientDB, and I came with an important (maybe trivial) doubt.
I've managed to use the distributed mode also without problems using 3 differente machines, but I wanted to be sure that OrientDB is really storing my database within the 3 machines that I've used. Is there any way to check that?
When I was researching for this answer, I came to the conclusion that OrientDB replicates the entire database across all the machines, is that correct? The goal to use the distributed architecture was to improve performance, but if OrientDB works with replication, and I run one query in a specific machine, the query will be processed using all machines, or only one?
To be short, I want to know if OrientDB when using the distributed mode, distributes the vertex and edges across the machines, and process the queries using all the machines?
I've read the entire documentation : http://orientdb.com/docs/2.0/orientdb.wiki/Distributed-Architecture.html and could not find a clear explanation for this questions.
Thanks in advance!
OrientDB, by default, replicates the entire DB on all the servers. What you're looking for is called "Sharding". OrientDB supports manual sharding (automatic in the future), that means you (the application) can decide where to store the vertices/edges.
Related
I'm looking for a portable database solution I can use with a website that is designed to handle service outages. I need to nightly retrieve a list of users from SQL Server and upsert their details into a portable database. It's roughly about 250,000 users (and growing) and each one has probably 25 fields that are required. Of those fields, i'd say less than 5 need to be searched on. The rest just need retrieving.
The idea is, in times of a service outage, we can use a website that's designed to work from the portable database rather than SQL Server. Our long term goal, is to move to the cloud and handle things in an entirely different way, but for the short term this is our aim.
The website is going to be a .Net Core web api so will be being accessed by multiple users in multiple threads. The website will only ever need read access, it will not be updating these details what-so-ever.
To keep the portable database up-to-date i'm thinking of having another application that just runs nightly to update the data. Our business is 24 hours (albeit quieter overnight), so there is a potential this updater is in use while the website is in use. While service outage would assume the SQL Server is down, this may not be the case. There are other factors in play that could cause what we would describe as outages. This will be the only piece of software updating the database.
I've tried using LiteDB but I couldn't get it working in a way that worked with my concurrency requirements. It did seem to do some of the job, and was easy to get running. However, i'd often run into locked files due to the nature of web api. I did work out a solution for that, but then the updater app couldn't access the database file.
Does anyone have any recommendations I can look into?
Given the description of the problem (1 table, 250k rows with - I assume - relative fast growth rate) and requirements, I don't think a relational database is what you are looking for.
I think nosql databases, or, more specifically, document oriented databases are more fitted to meet your requirements. There are many choices: Mongo, Cassandra, CouchDB, ... the choice is yours.
Personally I have some experience with ElasticSearch (https://www.elastic.co/elasticsearch), that is quite easy to learn, is portable (runs on Linux, Windows, Containers, etc...), is scalable, and it is fast. I mean, really, really fast, you can get results in 10-20 milliseconds (even less, sometimes).
The NEST nuget package acts as a high level client for working with ElasticSearch (https://www.elastic.co/guide/en/elasticsearch/client/net-api/7.x/nest-getting-started.html)
I have five computers networked together. Among them one is master computer and another four are slave computers.
Each slave computer has its own set of data (a very big integer matrix). I want to run four different clustering programs in four different slaves. Then, take the results back into the master computer for further processing (such as visualization).
I initially thought to use Hadoop. But, I cannot find any nice way to convert the above problem (specifically the output results) into the Map Reduce framework.
Is there any nice open-source distributed computing framework by using which I can perform the above task easily?
Thanks in advance.
You should used YARN for manage multiple clusters or resources
YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.
Reference
It seems that you have already stored the data on each of the nodes, so you have already solved the "distributed storage" element of the problem.
Since each node's dataset is different, this isn't a parallel processing problem either.
It seems to me that you don't need Hadoop or any other big data framework. However, you can embrace the philosophy of Hadoop by taking the code to the data. You run the clustering algorithm on each node, and then handle the results in whatever way you need. A caveat would be if you also have a problem in loading the data and running the clustering algorithm on each node, but that is a different problem.
I am developing a Analytics tool similar to Google Analytics. That will store keywords, visits and pages in a database.
So the database can grow very quickly because I want to have many people using it.
How should I setup the database? One database for all the accounts and all the websites being monitored? Or it would be better to have one database for every account?
Also, I am planning to start with one dedicated server but I'm sure that I will need more than one server in the future so I have to build it keeping that in mind.
I also know that if I do multiple databases for every account then I will have to run upgrade scripts on all of them when the schema of the app will change.
What kind of database do you plan to use ? There is a BIG difference between relational (PostgreSQL, MySQL) and "NoSQL" (MongoDB, CouchDB)
I'm only going to talk about PostgreSQL on the relational side since it's the only database I have experience with.
First, I would keep everything in one database. There's no benefit in using a database per account.
Second, you should be absolutely sure you WILL outgrow a single machine. Given the kind of application you'll be dealing with a lot more writes than reads, so a master-slave replication will only serve for high availability, and multi-master replication with PostgreSQL is NOT easy.
From my last research the least painful way to do that was to use a tool like Postgres-XC which is designed to be write-scalable, but I have no idea how production-ready it is.
Another solution is using tools like Bucardo or SkyTools. No experience with SkyTools but I had a lot of trouble getting Bucardo to work last year.
The last solution is to do sharding. The naive way to shard is to do something like
shard number = id % 10. However using this you would need to rebalance your cluster whenever you add/remove a shard.
It would require that you write your application "shard-aware" so that you direct the queries to the correct shard.
Anyway like I said before, make sure you will NEED to shard/clusterize first.
Now for the "NoSQL" side, I have no experience with any of the solutions, but I do know that MongoDB and CouchDB handle sharding themselves so it's way easier with those solutions, however you give up quite a lot.
I'll expand a bit on Vincent's answer.
As for sharding we have had good experience with PL/Proxy. And with sharding you can outgrow single machine without issues (read or write).
As for replication Londiste from Skytools is very easy to set up and use. And with it you get PgQ, quite nice messaging solution for Postgres.
I am intrested to know a little bit more about databases then i currently know. I know how to setup a database backend for any webapp that i happen to be creating but that is all. For example if i was creating three different apps i would simply create three different databases and then configure each database for the particular app. This is all simple knowledge and i would now like to have a deeper understanding of how databases actually work.
Lets say that I developed an application for example that needed lot of space and processing power.This database would then have to be spread over numerous machines. How exactly would a database be spread across numerous machines and still be able to write records and then retreieve them. Would each table get their own machine and what software is needed to make sure that the different machines have all performed their transactions successfully.
As you can see i am quite a database ignoramus lol.
Any help in clearing this up would be greatly appreciated.
I don't know what RDBMS you're using but I have two book suggestions.
For theory (which should come first, in my opinion): Database in Depth: Relational Theory for Practitioners
For implementation: High Performance MySQL: Optimization, Backups, Replication, and More
I own both these books and they are both pretty great, especially the first one.
That's quite a broad topic... You might want to start with Multi-master replication, High-availability clustering and Massively parallel processing.
If you want to know about how to keep databases running with ever increasing load, then it's not a basic question. Several well known web companies are struggling to find the right way to make their database scalable.
Using memcached to cache database information is one way to decrease load on your database if your application is read-intensive. If you application is write-intensive then may be you would want to consider using a NOSQL datastore like MongoDB or Redis.
Database Design for Mere Mortals
This is the best book about the subject if you don't have any experience with databases. It's got historical background and practical examples. Most books often skip the historical stuff because they assume you know what a db is, or it doesn't matter, and jump right to the practical. This book gives you the complete picture.
With the rising of non-sql database usage in high traffic website, I'm interested to use it for my project. Now I've heard several names like Voldermort, MongoDB and CouchDB. But which are among these NonSQL database that is production ready? I've seen the download pages and it seems that none of them is production ready because is not version 1.0 yet. Is there any other names other than these 3 that is recommendable to be used in production?
What do you mean by production ready? As far as I know, all of them are being used on live systems.
You should make your choice based on how the features they provide fit your needs.
You can also add Tokyo Cabinet to the list as well as the mnesia database provided by the Erlang VM.
I think you need to start out from your project requirements to see what kind of database you really need. There are many non-relational DBMS:s out there and they differ a lot in what kind of problems they are good at solving. I think the article Should you go Beyond Relational Databases? by Martin Kleppmann is a good starting point for finding out what you need. There's also a lot of stackoverflow threads on similar topics, these are my favorites:
The Next-gen Databases
Non-Relational Database Design
When shouldn’t you use a relational
database?
Good reasons NOT to use a relational
database?
When you have narrowed down what you actually need you can take a deeper look into the alternatives to see which DBMS are production ready for your use case. Production readiness isn't a yes/no thing: people may successfully deploy some solution that for example lacks in tool support - in another project this could be a no-go.
As for version numbers different projects have a different take on this, so you can't just compare the version numbers. I'm involved in the graph database project Neo4j and even if it has been in production use for 5+ years by now we still haven't released a version 1.0 final yet.
I'm tempted to answer "use SIRA_PRISE".
It's definitely non-SQL.
And its current version is 1.2, meaning that someone like you must definitely assume it's "production-ready".
But perhaps I shouldn't be answering at all.
Nice article comparing rdbms with 'next gen' and listing some providers:
Is the Relational Database Doomed?
http://readwrite.com/2009/02/12/is-the-relational-database-doomed
I will suggest you to use Arangodb.
ArangoDB is a multi-model mostly-memory database with a flexible data model for documents and graphs. It is designed as a “general purpose database”, offering all the features you typically need for modern web applications.
ArangoDB is supposed to grow with the application—the project may start as a simple single-server prototype, nothing you couldn’t do with a relational database equally well. After some time, some geo-location features are needed and a shopping cart requires transactions. ArangoDB’s graph data model is useful for the recommendation system. The smartphone app needs a lean API to the back-end—this is where Foxx, ArangoDB’s integrated Javascript application framework, comes into play.
Another unique feature is ArangoDB’s query language AQL — it makes querying powerful and convenient. AQL enables you to describe complex filter conditions and joins in a readable format, much in the same way as SQL.
You can model your data in several ways:
in key/value pairs
as collections of documents
as graphs with nodes, edges, and properties for both
You can access data in ArangoDB:
using the general HTTP REST API via curl/wget, or your browser
via the ArangoDB shell (“arangosh”)
using a programming language specific client library
Server requirements for ArangoDB:
ArangoDB runs on Linux, OS X and Microsoft Windows.
It runs on 32bit and 64bit systems, though using a 32bit system will limit you to using only approximately 2 to 3 GB of data with ArangoDB.