As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I would like to remove sql dependency of small chunks of data that I load on (almost) each request on a web application. Most of the data is key-value/document structured, but a relational solution is not excluded. The data is not too big so I want to keep it in memory for higher availability.
What solution would you recommend?
The simplest and most widely used in-memory Key-value storage is MemcacheD. The introduction page re-iterates what you are asking for:
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
The client list is impressive. It's been for a long time. Good documentation. It has API for almost every programming language. Horizontal scaling is pretty simple. As my experience goes Memcached is good.
You may also want to look into MemBase.
Redis is perfect for this kind of data. It also supports some fundamental datastructures and provides operations on them.
I recently converted my Django forum app to use it for all real-time/tracking data - it's so good to no longer have the icky feeling you get when you do this kind of stuff (SET views = views + 1 and other writes on every page view) with a relational database.
Here's an example of using Redis to store data required for user activity tracking, including keeping an ordered set of last seen users up to date, in Python:
def seen_user(user, doing, item=None):
"""
Stores what a User was doing when they were last seen and updates
their last seen time in the active users sorted set.
"""
last_seen = int(time.mktime(datetime.datetime.now().timetuple()))
redis.zadd(ACTIVE_USERS, user.pk, last_seen)
redis.setnx(USER_USERNAME % user.pk, user.username)
redis.set(USER_LAST_SEEN % user.pk, last_seen)
if item:
doing = '%s %s' % (
doing, item.get_absolute_url(), escape(str(item)))
redis.set(USER_DOING % user.pk, doing)
If you don't mind the sql but want to keep the db in memory, you might want to check out sqlite (see http://www.sqlite.org/inmemorydb.html).
If you don't want the sql and you really only have key-value pairs, why not just store them in a map / hash / associative array and be done with it?
If you end up needing an in-memory database, H2 is a very good option.
One more database to consider: Berkeley DB. Berkeley DB allows you to configure the database to be in-memory, on-disk or both. It supports both a key-value (NoSQL) and a SQL API. Berkeley DB is often used in combination with web applications because it's embedded, easily deployed (it deploys with your application), highly configurable and very reliable. There are several e-Retail web sites that rely on Berkeley DB for their e-Commerce applications, including Amazon.com.
I'm not sure this is what you are looking for but you should look into a caching framework (something that may be included in the tools you are using now). With a repository pattern you ask for the data, there you check if you have it in cache by key. I you don't, you fetch it from the database, if you do, you fetch it from the cache.
It will depend on what kind of data you are handling so it's up to you to decide how long to keep data in cache. Perhaps a sliding timeout is best as you'll keep the data as long as the key keeps being request. Which means if the cache has data for a user, once the user goes away, the data will expire from the cache.
Can you shard this data? Is data access pattern simple and stable (does not change with changing business requirements)? How critical is this data (session context, for example, is not too hard to restore, whereas some preferences a user has entered on a settings page should not be lost)?
Typically, provided you can shard and your data access patterns are simple and do not mutate too much, you choose Redis. If you look for something more reliable and supporting more advanced data access patterns, Tarantool is a good option.
Please do check out this :
http://www.mongodb.org/
Its a really good No-SQL database with drivers and support for all major languages.
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I really interesting in non relational databases, but due the many reason familiar only with the small part of it. So I want to list all NoSQL technologies you use with basic use cases, pros and cons.
If you have some specific issues during the work with some technologies, interesting experience, etc. you are welcome to share it with community.
Personally I worked with:
Mongodb:
Usecases: For my opinion is one of the best if you need good aggregation features, automatic replication. Good in scale. Have many features which allow using it like everyday use database and if for some reason you don't want to use SQL solution - Mongo could be the great choice. Also mongo is great if you need dynamic queries. And also mongodb support indexing - it's also important feature.
Pros: Fast, good scale, easy to use, internal geospatial Indexes
Cons: Comparatively slow write operation, blocking atomic operation could make a lot of problems. Memory consuming process could "eat" all available memory.
Couchdb:
Usecases: I use it in Wiki liked project and I think for that cases is the perfect database. The fact that each document automatically saves in new revision during update helps to see all the changes. For accumulating, occasionally changing data, on which pre-defined queries are to be run.
Pros: Easy to use, REST oriented interface, versions.
Cons: Problem with performance when amount of docs is quite huge (more than half a million), a bit pure query features (could be solving with adding Lucene)
SimpleDB:
Usecases: This is dataservice from Amazon, the cheapest from the all stuff they provide. Very limited in features so the main use case is using it if you want to use Amazon service, but paying as less ass possible.
Pros: Cheap, all data stored like text - simple to operate, easy to use.
Cons: Very much limitation (document size, collections size, attribute count, attribute size). The way that all data stored like a text also creating additional problems during sorting by date or by number (because it use lexicographical sorting, which need some workaround when saving date or numbers).
Cassandra
Cassandra is perfect solution if writing is your main goal, it's designed to write a lot (in some cases writing could be faster then reading), so it's perfect for logging. Also it very useful for data analysis. Except that Cassandra have built in geographical distribution features.
Strengths Supported by Apache (good community and high quality), fast writing, no single point for failure. Easy to manage when scale (easy to deploy and enlarge cluster).
Weaknesses indexes implementation have problems, querying by index have some limitation, and if you using indexes inserting performance decrease. Problems with stream data transfering.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm gathering information for upcoming massive online game. I has my experience with MEGA MASSIVE farm-like games (millions of dau), and SQL databases was great solution. I also worked with massive online game where NoSQL db was used, and this particular db (Mongo) was not a best fit - bad when lot of connections and lot of concurrent writes going on.
I'm looking for facts, benchmarks, presentation about modern massive online games and technical details about their backend infrastructure, databases in particular.
For example I'm interested in:
Can it manage thousands of connection? May be some external tool can help (like pgbouncer for postgres).
Can it manage tens of thousands of concurrent read-writes?
What about disk space fragmentation? Can it be optimized without stopping database?
What about some smart replication? Can it tell that some data is missing from replica, when master fails? Can i safely propagate slave to master and know exactly what data is missing and act appropriately?
Can it fail gracefully? (like postgres for ex.)
Good reviews from using in production
Start with the premise that hard crashes are exceedingly rare, and when they occur
it won't be a tragedy of some information is lost.
Use of the database shouldn't be strongly coupled to the routine management of the
game. Routine events ought to be managed through more ephemeral storage. Some
secondary process should organize ephemeral events for eventual storage in a database.
At the extreme, you could imagine there being just one database read and one database
write per character per session.
Have you considered NoSQL ?
NoSQL database systems are often highly optimized for retrieval and
appending operations and often offer little functionality beyond
record storage (e.g. key–value stores). The reduced run-time
flexibility compared to full SQL systems is compensated by marked
gains in scalability and performance for certain data models.
In short, NoSQL database management systems are useful when working
with a huge quantity of data when the data's nature does not require a
relational model. The data can be structured, but NoSQL is used when
what really matters is the ability to store and retrieve great
quantities of data, not the relationships between the elements. Usage
examples might be to store millions of key–value pairs in one or a few
associative arrays or to store millions of data records. This
organization is particularly useful for statistical or real-time
analyses of growing lists of elements (such as Twitter posts or the
Internet server logs from a large group of users).
There are higher-level NoSQL solutions, for example CrouchDB, which has built-in replication support.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
When there is a web app will query some information frequently, how to improve the performance by cache the query result?
(The information is like top news in a website and my database is SQL Server 2008, the application is on tomcat.)
I can suggest the following:
In your database you can use idex views, please check: How to mimick Oracle Materialized Views on MS SQL Server?.
If you has used JPA or Hibernate it can cache Entities (objects).
http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/performance.html#performance-cache
http://en.wikibooks.org/wiki/Java_Persistence/Caching
If you're looking for a cache system that is foreign to database and ORM, maybe you can review MemCache or EHCache.
http://memcached.org/
http://ehcache.org/
An option but not recommended is that you manage a cache in your application, by example you can store at ServletContext (also know as ApplicationContext) the list of Countries, but you need to implement the business logic for cache (update, delete and insert objects), also you need to be careful with the Heap Memory.
You can use a combination of the above strategies it depends of the context of your business
Best regards,
Ernesto.
This is a pretty general question and as you'd expxect, there are many options.
Closest to the UI, your web platform might have 'content caching.' ASP.NET, for example, will cache portions of a page for specified periods of time.
You could use a caching tool like memcached and cache a recordset (or whatever the stand-alone Java data structure is).
Some ORM's provide caching too.
And (probably not finally) you could define structure in your database to 'cache' results like this by running complex queries and saving the results into tables that are queried more often but are cheaper to query.
Just some ideas.
The answer for a really big site is all of the above. We do all our queries via stored procs. That helps because the query is compiled and one execution plan is reused. We have a wicked ccomplicated table valued function. It's so expensive we built a cache table. The table has the same general foormat as the function but with two extras. One is an expire time. The other is a search key. The search key is the parameters that go into the function concatenated together. Whenever we're about to query that table we run a Proc to check if the data is stale. If it is we start a transaction delete the rows, and then run the function and insert the rows. This means we run the function maybe 2 or 3% of the times we used to and the proc call we make to check for staleness is much cheaper. Whenever the app updates the relevant data it goes and updates the cache rows as stale - but it doesn't delete them we leave that to the cache check function. Why? Well maybe nobody will need that data right now, so less db hit. Then we hit the second layer. We cache many recordsets in memcached. Including all of the procs that call that function, and many more. That actually happens in our asp layer, which we still have. ADO recordsets can be persisted to xml natively, which then goes into memcache as a string.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have used sql databases a fair bit and can see a lot of benefit in normalised databases that can be joined and searched and relationships built in them.
What are the advantages to the sort of 'object database' that google has in Appengine's datastore?
GAE's BigTable datastore is not object-oriented or even object-relational. It has more in common with a Hashmap than with a standard relational database like MySQL or Oracle. The main advantage is scalability and a tighter guarantee on the amount of time a query will take (sort of like CPU time). The scalability comes from the way records are distributed, if you setup your keys correctly then the data associated with those keys will be closer together physically (the data is distributed so there is no single point of failure).
As many NoSQL databases The main advantage of the Datastore is the flexibility nevertheless the programmer must forget everything about traditional SQL databases.
see this article in techrepublic.com about NoSQl databases
Data Model flexibility. The programmer doesn't have to worry about map the object model to relational model, just put your Entities in the Datastore.
Object relationship flexibility. The datastore supports multiple values for one single property, which let you stablish an 1-N relationship just like in the Object Oriented programming; I.e: inserting a List as a value of one property.
The rest of advantages/disadvantages comes from the PaaS (Platform as a service) model, wich means you only worry about write well code and google cares about the infrastructure and scalability. see PaaS in wikipedia
Technically it's a lot easier to program since the datastore is bundled with the SDK and easier to share source code and collaborate since you're getting all components from the same vendor rather than patching together an RDMS, a scripting engine and hosting.
Economically, the costeffectiveness GAE ha is a huge advantage since you only pay for what you use. With other services and other hosting you pay like a subscriber while with the model GAE has you pay per quota.
Programming-wise, everything is harder.
The advantages are in scalability, price, and administration. Considering that with many web-apps, programming is easier than administering/scaling/paying for it, GAE/datastore is well worth it.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have used Relational DB's a lot and decided to venture out on other types available.
This particular product looks good and promising: http://neo4j.org/
Has anyone used graph-based databases? What are the pros and cons from a usability prespective?
Have you used these in a production environment? What was the requirement that prompted you to use them?
I used a graph database in a previous job. We weren't using neo4j, it was an in-house thing built on top of Berkeley DB, but it was similar. It was used in production (it still is).
The reason we used a graph database was that the data being stored by the system and the operations the system was doing with the data were exactly the weak spot of relational databases and were exactly the strong spot of graph databases. The system needed to store collections of objects that lack a fixed schema and are linked together by relationships. To reason about the data, the system needed to do a lot of operations that would be a couple of traversals in a graph database, but that would be quite complex queries in SQL.
The main advantages of the graph model were rapid development time and flexibility. We could quickly add new functionality without impacting existing deployments. If a potential customer wanted to import some of their own data and graft it on top of our model, it could usually be done on site by the sales rep. Flexibility also helped when we were designing a new feature, saving us from trying to squeeze new data into a rigid data model.
Having a weird database let us build a lot of our other weird technologies, giving us lots of secret-sauce to distinguish our product from those of our competitors.
The main disadvantage was that we weren't using the standard relational database technology, which can be a problem when your customers are enterprisey. Our customers would ask why we couldn't just host our data on their giant Oracle clusters (our customers usually had large datacenters). One of the team actually rewrote the database layer to use Oracle (or PostgreSQL, or MySQL), but it was slightly slower than the original. At least one large enterprise even had an Oracle-only policy, but luckily Oracle bought Berkeley DB. We also had to write a lot of extra tools - we couldn't just use Crystal Reports for example.
The other disadvantage of our graph database was that we built it ourselves, which meant when we hit a problem (usually with scalability) we had to solve it ourselves. If we'd used a relational database, the vendor would have already solved the problem ten years ago.
If you're building a product for enterprisey customers and your data fits into the relational model, use a relational database if you can. If your application doesn't fit the relational model but it does fit the graph model, use a graph database. If it only fits something else, use that.
If your application doesn't need to fit into the current blub architecture, use a graph database, or CouchDB, or BigTable, or whatever fits your app and you think is cool. It might give you an advantage, and its fun to try new things.
Whatever you chose, try not to build the database engine yourself unless you really like building database engines.
We've been working with the Neo team for over a year now and have been very happy. We model scholarly artifacts and their relationships, which is spot on for a graph db, and run recommendation algorithms over the network.
If you are already working in Java, I think that modeling using Neo4j is very straight forward and it has the flattest / fastest performance for R/W of any other solutions we tried.
To be honest, I have a hard time not thinking in terms of a Graph/Network because it's so much easier than designing convoluted table structures to hold object properties and relationships.
That being said, we do store some information in MySQL simply because it's easier for the Business side to run quick SQL queries against. To perform the same functions with Neo we would need to write code that we simply don't have the bandwidth for right now. As soon as we do though, I'm moving all that data to Neo!
Good luck.
Two points:
First, on the data I've been working with the past 5 years in SQL Server, I've recently hit the scalability wall with SQL for the type of queries we need to run (nested relationhsips...you know...graphs). I've been playing around with neo4j, and my lookup times are several orders of magnitude faster when I need this kind of lookup.
Second, to the point that graph databases are outdated. Um...no. Early on, as people were trying to figure out how to store and lookup data efficiently, they created and played with graph and network style database models. These were designed so the physical model reflected the logical model, so their efficiency wasnt that great. This type of data structure was good for semi-structured data, but not as good for structured dense data. So, this IBM dude named Codd was researching efficient ways to arrange and store structured data and came up with the idea for the relational database model. And it was good, and people were happy.
What do we have here? Two tools for two different purposes. Graph database models are very good for representing semi-structured data and the relationships between entities (that may or may not exist). Relational databases are good for structured data that has a very static schema, and where join depths do not go very deep. One is good for one kind of data, the other is good for other kinds of data.
To coin the phrase, there is no Silver Bullet. Its very short sighted to say that graph database models are out of date and to use one gives up 40 years of progress. That's like saying using C is giving up all the technological progress we've gone through to get things like Java and C#. That's not true though. C is a tool that is needed for certain tasks. And Java is a tool for other tasks.
I've been using MySQL for years to manage engineering data, and it worked well, but one of the problems we had (but didn't realise we had) was that we always had to plan the schema up-front. Another problem we knew we had was mapping the data up to domain objects and back.
Now we've just started trying out neo4j and it looks like it is solving both problems for us. The ability to add different properties to each node (and relation) has allowed us to re-think our entire approach to data. It is like dynamic versus static languages (Ruby versus Java), but for databases. Building the data model in the database can be done in a much more agile and dynamic way, and that is dramatically simplifying our code.
And since the object model in code is generally a graph structure, mapping from the database is also simpler, with less code and consequently fewer bugs.
And as a additional bonus, our initial prototype code for loading our data into neo4j is actually performing faster than the previous MySQL version. I have no solid numbers on this (yet), but that was a nice additional feature.
But at the end of the day, the choice probably should be based mostly on the nature of your domain model. Does it map better to tables or graphs? Decide by doing some prototypes, load the data and play with it. Use neoclipse to look at different views of the data. Once you've done that, hopefully you know if you're on to a good thing or not.
Here is a good article that talks about the needs that non relational databases fill: http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php
It does a good job at pointing out (aside from the name) that relational databases arent flawed or wrong, its just that these days people are starting to process more and more data in mainstream software and web sites, and that relational databases just wont scale for these needs.
I am building an intranet at my company.
I am interested in understanding how to load data that was stored in tables (Oracle, MySQL, SQL Server, Excel, Access, various random lists) and loading it into Neo4J, or some other graph database. Specifcally, what happens when common data overlaps existing data already in the system.
Yes, I know some data is best modeled in RDBMS, but I have this idea itching me, that when you need to superimpose several distinct tables, the graph model is better than the table structure.
For instance, I work in a manufacturing environment. There is a major project we are working on and because of the complexity, each department has created a seperate Excel spreadsheet that has a BOM (Bill Of Materials) hierarchy in a column on the left and then several columns of notes and checks made by individuals who made these sheets.
So one of the problems is merging all these notes together into one "view" so that someone can see all the issues that need to be addressed in any particular part.
The second problem is that an Excel spreadsheet sucks at representing a hierarchial BOM when a common component is used in more than one subassembly. Meaning that, if someone writes a note about the P34 relay in the ignition subassembly, the same comment should be associated with the P34 relays used in the motor driver subassembly. This won't occur in the excel spreadsheet.
For the company intranet, I want to be able to search for anything easily. Such as data related to a part number, a BOM structure, a phone number, an email address, a company policy, or procedure. I want to even extend this to manage computer hardware assets, and installed software.
I envision that once the information network starts to get populated you can start doing cool traversals such as "I want to write an email to everyone working on the XYZ project". People will have been associated with the project because they will be tagged as creating and modifying the data within the XYZ project. So by using the XYZ project as a search key, a huge set with everything related to the XYZ project will be created. Including links to people who built the XYZ project. The people links will connect to their email addresses. So by their involvement in the XYZ project, they will be included in my email. This is in stark contrast to some secretary trying to maintain a list of people work on the project. We generate a lot of lists. We spend a lot of time maintaining lists and making sure they are up to date. And most of it doesn't add any value to our products.
Another cool traversal could report all the computers that have a certain piece of software installed, by version. That report could be used to generate tasks to remove extra copies of old software and to update people who need to have the latest copy. It would also be useful for license tracking.
might be a bit late, but there is a growing number of projects using Neo4j, the better known ones listed at Neo4j . Also NeoTechnology, the company behind Neo4j, has some references at their customers page
Note: I am part of the Neo4j team