Anyone using CouchDB? [closed] - database

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I've followed the CouchDB project with interest over the last couple of years, and see it is now an Apache Incubator project. Prior to that, the CouchDB web site was full of do not use for production code type disclaimers, so I'd done no more than keep an eye on it. I'd be interested to know your experiences if you've been using CouchDB either for a live project, or a technology pilot.

After 18 Months of prototypes, testing and waiting for CouchDb to get ready we moved an internal application over to CouchDB in December 2008. So far I'm very happy with that move. It gets rid of a lot of filesystem objects for us (PDFs and JPEGs, now stored as attachments in CouchDB). This enables us to get rid of NFS and easier cluster/replicate our frontend webservers.
To what degree CouchDB is ready for you depends very much on the culture of your organization. We have an in-house development team maintaining several internal Erlang applications. Since CouchDB is written in Erlang and the codebase is of quite decent quality we felt confident that we could fix show stopper issues in CouchDB should the need arise - or at least get our data back out. We also hired one of the CouchDB core team as an consultant - just in case.
But CouchDB for sure isn't 1.0 yet. There are crashes in the Web worker processes all the time (if you misuse them). Replication breaks for us and we don't get error messages about it. Documentation is still very lacking. Still I'm confident that it will not eat our data and development moves forward with reasonable pace.
To give you an idea about our application: currently our biggest database is about 512000 records taking 7.5 GB of diskspace.

I use the CouchDB to power a Facebook application (over 35k monthly active users). For a while it was using MySQL but after porting the entire project over from Perl to Erlang, I decided to go for the gold and migrate all of the data into CouchDB and use that instead.
CouchDB has been a great data store to work with. I think that it is on track to becoming a major player in web-based services.

I got to know one of the people (Jan) working on it a while ago (like 6 months) and have been playing with it ever since. I found the community around CouchDB to be both very knowledgable and helpful so that whenever I ran into an issue it was resolved in a matter of minutes or hours at least.
We just kicked off a project the other week which basically requires us to store data in the non-relational way and due to CouchDB's document oriented store we selected it as one of the technologies to use. So this is actually the first time that I will run it in production, but I'm still pretty confident about it. :)
Just an update here (2009-10-25):
Our first CouchDB install is 20 GB, it hosts 40 million records. It's been running in production since January 2009, and it's been great. Read (GET) speed is outstanding and we use it as a store for complex data, and then it's just pull.
Our second couchdb installment has two databases, one is 160,000,000+ documents (210 GB), and growing between 150,000-300,000 documents a day. The other is only 35,000,000 documents (7 GB). This setup has a lot more reads and writes and initial tests are performing very well.
View building on the 160,000,000 document database took roughly a week, but since then we upgraded to a larger Amazon EC2 instance and we are also getting ready to update to CouchDB 0.10.x (from 0.9.1) as this release includes a lot of performance improvements in view building.

I am using couchdb in a few scenarios, as a document store for http://devk.it (under development) and in a much larger scale as a template store for a distributed email delivery system.
CouchDB is very slick for what it does, but I was not able to get it to run at as high of a concurrency level as I would have expected. Also note that the maximum document size is fairly limiting at 1MB due to the hardcoded max input buffer size in mochiweb. You can however alter a header file and recompile to get around this limit.

I'm using CouchDB to store (and serve) article ratings on my blog. It's not exactly heavy traffic but it's been rock solid so far.
Also planning on adding comments sometime which I'll most likely also store in CouchDB.
I've found it quite easy to get started with, on OSX you can just download CouchDBX to get started quickly. I use a Sinatra backend with RestClient to interact with 'the couch' through straight HTTP verbs and such.
Great fun.

At the moment I'm working with CouchDB for a computer science thesis. I'm writing about my progresses and opinions on my blog, http://metalelf0dev.blogspot.com. I think the project is well done, but the existing documentation isn't organized as it should. A quick tutorial about the Futon web interface could be really useful for starters IMHO :)

I used couchdb twice in production. First was the wiki likes project and I think that couchdb was perfect candidate for that role. Saving the version of all docs helps a lot.
The second project was quite query loaded and idea was dumping social data first, then query it with various filters. It was looked like standard CouchDB query features looks a bit pure for our needs. But we add Lucene like a full text indexer and after that doing many queries during Lucene part. And that solution looks good enough.

Related

GAE Vs AWS 2012 [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Is GAE is a good option for the backend when compared with AWS? The information found mainly discusses the issues that GAE has resolved as of today. The mobile application under consideration deals with the images. Sharing and editing of images simultaneously with multiple users.
I am mainly concerned with the Scalability & flexibility in the implementation. The robust & compability layer, Storage and data analysis (analysis (identifying the patterns)of the data stored).
AWS lets use the popular open source technologies & tools and has a granular pricing. GAE is good to get to market really fast, no administration pain, and a free quota.
Can you please point out some important pros & cons that I must consider before taking the decision.
I think that GAE is good for its fast startup and for proof of concept. It is really simple and cheaper to begin with, but it locks you in to google.
If your idea works well and it becomes popular you can rewrite it using open source technology in future.
I have 25 GB appengine DB. Every 1-10 minutes I add record.
It costs $2.5 per week.
But originally it was more expensive to upload then I ever expected.
My upload script was uploading chunks 500 records per request.
Requests were ending in 10-15 seconds but logs showed datastore time much more like 5 minutes vs 15 real seconds!
Also upload servlet was waiting 99% of time doing nothing and I had to pay for that as well.
It took days to upload 15 GB of indexed data.
AppEngine has certain pricing risks
GAE is essentially for "throw-away" proof of concepts or "very very small" applications. I say this because, I would not invest is a large amount of money into a completely vendor locked system...Other people might but I wouldn't take that kind of risk since I'd be at Google's whim on their availability and pricing.
So if you have a large project or product, you're probably better off using EC2 since all it's providing is the infrastructure...there's no code requirements imposed on you.
That being said, if I had a small project I wanted to toss on the web for my friends, I'd definitely take advantage of GAE's free tier.
I think the biggest difference is that, in a general sense, EC2 hosts servers whereas GAE hosts code. If you're building a system where you're going to want to do things like tail logs, have cron jobs managed by a sys admin, use open source tools like rsync and have fine grained control over OS and configuration or co-locate services on one box, then EC2 is very compelling.
GAE is "upload your app and it works". Very cool in its own right but personally I'd rather deal with VMs in EC2 because it's a more natural dynamic for systems development for me at least.

Store large amount of images on multiple servers

I would like to know what is the best solution for storing large amount of images on multiple servers like google, facebook.
It seems that storing in filesystem is better then inside a database but what about using a noSQL DB like cassandra.
Do Google/Facebooke store the same image in multiple servers for the load balancing.
How does it work? What is the best solution?
Thx a lot
There's nothing wrong with the approach you're taking. As mentioned, there are caveats, however, the possibilities do exist, and a lot of people and companies are successfully storing files in Apache Cassandra.
zjffdu/cassandra-fs is the first solution i'd look into. Now, this was last developed 2 years ago, so I'd be a bit cautious on it working the first time, out of the box. Apache Cassandra is now at version 1.0.x, with 1.1.x on the way. 2 years ago, that was version 0.6.x maybe? A lot has changed & improved in 24 months.
semantico/cassandra-fs a fork ... last touched 7 months ago
favoritas37/cassandra-fs another fork ... last touched 3 months ago and indicates compatibility with the 1.0.5 branch of Cassandra
The principal behind this is to take a file, break it into a set of chunks and store those chunks as columns in a row. When retrieving, pull each column, reassemble the file and voila.
Cassandra FAQ: large file and blog storage
...files of around 64Mb and smaller can be easily stored in the database without splitting them into smaller chunks...
Lucene indexes in Cassandra
...its files are broken down into blocks (whose sizes are capped), where each block (see FileBlock) is stored as the value of a column in the corresponding row...
You'll get more positive feedback on the Cassandra mailing list and on the IRC channel.
Finally, this is from 2009, and written by folks at Facebook, which should go some way to help answer more of the fundamental questions you have: Cassandra - A Decentralized Structured Storage System.
Note, I know this is an old question, I just want to counter balance some misconceptions about cost as I'm doing this right now as a test.
Unlike what DavidB thinks, it does not cost millions - even if you were to run dedicated hosted hardware, you'd easily be under a couple thousand/month (BTDT, one of my clients is running a 8 node cluster for about $800/month). That said, that's a maintenance headache you want to avoid, and Cassandra on EC2 is far easier to deal with.
You could easily run a substantial production cloud on EC2 for less than $1000/month and you can do R&D clouds for less than $100/month (I spend about $52 last month for an 10 machine test cluster). I highly recommend using TurnKey Linux to manage & provision your R&D farm, as their tools will allow you to migrate instances from your desktop to pretty much any virtualized hosting platform in a few minutes (and vice versa). Plus they have really slick integration with EC2.
For really serious levels of traffic, Pintrest once stated they spend $15 to $50/hour depending on server load, auto-scaling to meet traffic demands, see http://www.theregister.co.uk/2012/04/30/inside_pinterest_virtual_data_center/ for details
The real cost is in setup and manage of your distributed Cassandra instance. Luckily, NetFlix has just release a ton of manage tools just for this. You can find them here: https://github.com/netflix - there are also a ton of interesting videos about NetFlix's use of AWS, particularly moving stuff from Cassandra to S3 - see their blog here http://techblog.netflix.com/2012/12/videos-of-netflix-talks-at-aws-reinvent.html
If you want to store in a "cloud" environment you're best going with a cloud solution that has the resources such as Google App Engine or Amazon Web Services. You're not going to be able to setup your own if that is the question. It will costs millions of dollars and resources to manage them. And yes, Google and Facebook use thousands of servers to distribute their data in "clouds".

Pros & Cons of Google App Engine [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
[An Updated List 21st Aug 09]
Help me Compile a List of all the Advantages & Disadvantages of Building an Application on the Google App Engine
Pros:
No need to buy servers or server space (no maintenance).
Makes solving the problem of scaling easier.
Free up to a certain level of consumed resources.
Cons:
Locked into Google App Engine ?
Developers have read-only access to the filesystem on App Engine.
App Engine can only execute code called from an HTTP request (except for scheduled background tasks).
Users may upload arbitrary Python modules, but only if they are pure-Python; C and Pyrex modules are not supported.
App Engine limits the maximum rows returned from an entity get to 1000 rows per Datastore call. (Update - App Engine now supports cursors for accessing larger queries)
Java applications may only use a subset (The JRE Class White List) of the classes from the JRE standard edition.
Java applications cannot create new threads.
Known Issues!! : http://code.google.com/p/googleappengine/issues/list
Hard limits
Apps per developer - 10
Time per request - 30 sec
Files per app - 3,000
HTTP response size - 10 MB
Datastore item size - 1 MB
Application code size - 150 MB
Update Blob store now allows storage of files up to 50MB
Pro or Con?
App Engine's infrastructure removes many of the system administration and development challenges of building applications to scale to millions of hits. Google handles deploying code to a cluster, monitoring, failover, and launching application instances as necessary.
While other services let users install and configure nearly any *NIX compatible software, App Engine requires developers to use Python or Java as the programming language and a limited set of APIs. Current APIs allow storing and retrieving data from a BigTable non-relational database; making HTTP requests; sending e-mail; manipulating images; and caching. Most existing Web applications can't run on App Engine without modification, because they require a relational database.
Pros:
Scalable
Easy and cheaper (in short term).
Nice option for start-ups/individuals.
Suitable for apps that just store and retrieve data.
Cons:
Not suitable for CPU intensive calculations. They are slower and expensive.
Scalability doesn't matter much cuz if an app works at Google scale then probably it makes enough money to run on its own servers.
They have lots of limitations thrown here and there, as a result deep data analysis is difficult. Like you cannot produce a social graph using GAE.
I would say its not meant for serious businesses and expensive in long run.
(A huge new) PRO: GAE now supports MySQL :
https://developers.google.com/cloud-sql/
Pros:
built-in ui for unified logs
built-in web interface for task queues
built-in indexes on list of primary objects.
Cons:
loose logs very fast
VERY expensive
VERY expensive
VERY expensive
Un-hackable. Scales because you're obligated to code in a way that scales.
Longer development cycles. Sometimes you just want to hack something together and throw it away after 5 hors. With appengine you have to proper code it and write a lot of stuff to make it sure it scales. You can't just do a "find . | grep .avi | xargs ffmpeg -compress ...." :)
You will loose hours trying to do the simplest tasks like sending push notifications to APNS (iPhone). Although it's fine if you only want to support android in the future.
Terrible to make cleanups on the database. It's a HUGE pain in the ass to fix rows in the database, mainly because terribly slow, but it also requires a lot of code to loop properly within it's time constraints.
It was a pain to port Lucene to work on it's "filesystem".
Slow for what you pay.
Even MORE expensive if your app has spikes of traffic. My app has those spikes if a user that has many followers makes an action and we have to push notifications to his followers. Because of that I have to keep 10 inactive servers always on ($$$$$) to handle spikes.
Appengine isn't too bad due to the fact that I have the option to burn $$$$ instead of being concerned about scalability and fixing bottlenecks to reduce server usage. Sometimes it worth it.
My advice to people starting new products is to go with hetzner.de which is where I host my other products servers. It's cheap and extremely hackable. I have one server at hetzner that is handling 3x more traffic than the product that I have on appengine. The difference in price is $100 a month versions $2700 a month!
I have system admin experience, so the bottom line is that I would never choose appengine over having my own ROOT server. Don't be that bored software engineer wanting to experiment new things instead of building great products!
Pro: Unlimited scalabity to your application and scales with demand.
Con: Not available in some countries (Argentina).
Edit
Available worldwide, but only through Google Groups for App Engine.
When assessing pros and cons, I think it is important to clarify the market for which one is representing. Developers looking for a cost-effective solution to help them with the steep part of their planned hockey-stick growth curve will weigh heavily the cons already listed. For a small business owner, however, GAE is a God-send. These folks most often are looking to "the cloud" as a means to more effectively run their business (i.e. sell physical product and services). For the SMB, GAE the pros already listed can be much more valuable compared to the hockey-stick seeking dev, whilst the cons weight in at a fraction of the devs' measure. I don't see the GAE team doing anything related to SMB positioning, so I guess answers like this are me just pulling on Superman's cape, and spitting into the wind. Really GAE should be absolutely ruling the SMB space now. If not (I have no insights re: user base), then its is a greatly lamentable failure.
I believe , GAE is yet to mature in terms of providing the basic features for serious business such as Datastore with complex primary key, java.awt.* support, these are just a few I'm naming.
Other than the free space and to build some "Hobby" websites, I strongly feel GAE is NOT the place java guys should looking into.
I'm having applications built on the JSP/Servlets and MySQL, thinking about migrating to GAE, but I find I will be spending more "value time" on the migration than just buying a space from some java hosting provider such as EATJ, etc (Sorry not marketing, just an experience).
Another big issue I've got is migration of my existing mySQL data into GAE, bulkupload is really pathetic and has no client support.
No support for Local Db to Server DB upload.
Once the GAE is ready with "all the Cons" mentioned by above, then I'll think we can look in to this migration.
You are force to own a cell phone line, and your country+carrier must be able to receive international SMSs.
(I hate cell phones, and my mom's or co-workers won't get the SMSs)
Con: No Other RDBMS or NoSQL databases are not possible ....
Con: All your base are belong to us
... On a serious note:
Con: You don't control the environment your application runs in. The same cons as with outsourcing any component. Fun for toys, not for business (yet) IMHO.
Various things like API for Google proprietary backends such as their database system and other 'lockdowns' and frameworks that mean your code is tied, in some loose sense to their system can create cost issues later if you want to migrate from GAE. Of course, you could abstract these.
I like GAE, AppJet and others. They are cool. But everything has its place. If you want freedom and the ability to control your language's modules, API, syntax/stdlib versions and whatnot ... don't relinquish control to a service provider.
The lack of standards for environments and specifications for what your app can expect worries me in the cloud arena.
common sense stuff really.
Con: Limited to Java and Python

Google App Engine as production platform [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
We are about to start working on new commercial web project and considering Google App Engine as a potential platform.
Questions:
Does Google App Engine is really
scalable and may be considered as a
production platform for commercial
project?
Is it more expensive (or
cheaper) than good hosting company
service in long run?
Is it
possible (and pretty cheap) to move
the app from Google App Engine to
independent server/farm (e.g. to use it
as a private system, to exploit our
own hardware etc.)?
Is there some
mechanism to deal with DDoS attacks?
Can I make a full backup of the
app data?
Sorry for such silly questions.
I'll answer question 1:
I'm in the pilot phase of a new web application on app engine. We've spent about a month writing code and getting things ready for our first customer. They went live last week. They love the software but a couple of days ago I started to get random deadline exceeded errors in the application. You look up a record or a list and it would come back in miliseconds. The next go it would take 30 seconds and come back with a deadline exceeded error.
The stack traces in the dashboard give random results. I've tried everything, even stripping the app down to a hello world. I put a log message into our django process request middleware, the first bit of our code that gets executed. It was showing that on the timeout requests it took 25 seconds from google getting the request to running our process_request code. I posted to the google forum and got nothing. I contacted someone at google and they answered back quickly but only said they would contact the team. Nothing since.
It is possible there's something I'm doing to cause this but I really doubt it. Google doesn't provide support so I'm basically out of luck.
If this was a full blown commercial application I'd be out of business.
tl;dr: google app engine has great promise but needs to mature and is not yet suitable for comercial production
Watch google IO (Whre among other they say that: "yep it's scallable".
That depends ... It can even be free for you (you pay for load that you've got).
You can move to Amazon for example using appdrop. It's also a good idea to use app-engine-patch.
... Good question. I realy do not know.
Use GAEBar.
It all depends on your needs.
For a project that has the need to scale from very few users to possible millions of users in short time, google app engine might be exactly what you're looking for.
However, note that you might be surprised by the limitations that GAE comes with. Datastore can amongst others not do full-text search or queries using the IN statement.
So be carefull to specify what needs your application will have, and what data you're going to store and search for.
This also means that moving your application from GAE to a separate server might be troublesome, since the database architecture will most likely be different.
My answers:
BuddyPoke runs on gae (probably the biggest app), check their millions numbers.
You don't pay until your app grows quite a bit
If you are familiar with python, web2py offers this feature with some limitations
Dos protection (java, python)
Gaebar, here a great article.
You're question #3 raises a red flag. If this is an important issue, I'd caution against App Engine at this time. I love the platform, and don't doubt that their will be viable migration paths to a self-hosted solution at some point, but not now. Things like appdrop prove it would be possible to do, but would the effort and investment be worth it? That's the question I'd ask. I'd love to know if somebody has successfully ported a real-world production app engine app to another host.
Backups should be easily scripted or there are tools like GAEbar as Bolotov mentioned.
Regarding cost, you can probably get tens (maybe hundreds) of thousands of objects (records) and decent traffic/use for free. Beyond that, I'm not sure about comparative hosting costs, sounds like a good area to do some research in (note to self).
Finally, Silfverstrom is right about limitations, especially around full-text search. There are some projects underway to tackle this, but probably nothing as robust as a mature RDBMS.
To update with some more recent info (2013), GAE now has a text search API. You can't search data in the database directly; you create searchable documents from your data, and add those to a searchable index. It's not terribly hard to do, but it's a hassle. In particular, whenever your data changes, you need to re-regenerate the changed documents and update them in the index.
It's also fairly easy to export data into Google Big Query, which makes it easy to do reporting.

Any scalable OLAP database (web app scale)?

I have an application that requires analytics for different level of aggregation, and that's the OLAP workload. I want to update my database pretty frequently as well.
e.g., here is what my update looks like (schema looks like: time, dest, source ip, browser -> visits)
(15:00-1-2-2010, www.stackoverflow.com, 128.19.1.1, safari) --> 105
(15:00-1-2-2010, www.stackoverflow.com, 128.19.2.1, firefox) --> 110
...
(15:00-1-5-2010, www.cnn.com, 128.19.5.1, firefox) --> 110
And then I want to ask what is the total visit to www.stackoverflow.com from a firefox browser last month.
I understand Vertica system can do this in a relatively cheap way (performance and scalability wise, but not cost-wise probably). I have two questions here.
1) Is there an open-source product that I can build upon to solve this problem? In particular, how well does a Mondrian system work? (scalability, and performance)
2) Is there an HBase or Hypertable base solution (obviously, a naked HBase/Hypertable can't do this) for this? -- but if there is a project based on HBase/Hypertable, scalability probably won't be an issue IMO)?
Thanks!
You can download a free edition (the single node edition) of the greenplum database. I haven't tried it myself but I think/guess it is a powerful beast. Read here: http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/
Another option is MongoDB, it is fast and free and you can write MapReduce functions with JavaScript to do analytics.
My reputation here is to low to add a hyperlink to mongodb, so you have to google . I can add only one hyper link per post.
The zohmg project aims to solve this problem using Hadoop and HBase.
Facebook also built Hive on-top of Hadoop. Pretty simple to get going - reasonable query API too.
http://mirror.facebook.net/facebook/hive/
Is your data model more complex than that? If it isn't you might be beter of just writing custom code for it. Then you can really tune it to your data. Real products have to offer a lot of flexibility, need a lot of complexiy to achieve that, and suffer in speed as a result.
Your question is not clear in one aspect: when you talk about scalable, what do you mean by that? Are you collecting data from lots of sites but only have a limited amount of query users, or do you also have a lot of users? That situation leads to a significantly different model.

Resources