I have a Postgres database with many millions of records that I want to migrate to CouchDB. I know how I want to represent the records as documents, each document will have 9 items (4 integers, 4 text strings, and a date string).
My question is: Do I really need to write something that's going to have to do millions and millions of POST requests to create my initial database from the existing data? I understand that CouchDB is generally fast but doing this over HTTP strikes me as extremely inefficient and time consuming to do over even localhost HTTP.
HTTP is the only API that I see, so is this normally what is done when someone has to create a database with a huge number of initial documents?
Thanks
Yes, it is done via http. It is not inefficient though, since you can create multiple documents in one request by using the _bulk_docs API.
Related
This is a request for a general recommendation about how to organize a data storage in my case.
I'm developing a Spring Boot app in Java to collect and save measurements, and provide access to the saved data via REST API. I expect to have around 10 millions measurements per hour and I need to store history for recent 2-3 months. Total amount of measurements stored can reach tens of billions. The data model is not sophisticated, there will be around ten tables. No editing is planned, only cleaning obsolete data and vacuuming. I'm planning to use Postgres as a DBMS.
Being stored, the data can be retrieved as such (using temporal or spatial filters) or used to create aggregated data products. Despite performance tuning, using indexes, and optimizing queries, data retrieval can take significant time, but this is for research purposes and I understand the price of having that amount of records. Up to this point things are clear.
On the other hand, the most recent measurements (e.g. collected during the last ten minutes) must be accessible immediately. Well, as fast as possible. This data must be served by the REST API and shown in a Front-End app as graphs updated in real-time. Obviously, retrieving last-minutes-data from a table with billions of records will take time that is unacceptable for representation.
What can be a typical solution for such situation?
So far I came up with an idea of using two datasources: Postgres for history and in-memory H2 for keeping recent data ready to be served. Thus I will have a small DB duplicating recent data in memory. With this approach I expect to re-use my queries and entity classes. Does this seem OK?
I found a multi-datasource solution that perfectly matches my case. The author of this article is dealing with a project "where an in-memory database was needed for the high performance and a persistent database for storage".
We have an ad search website and all the searches are being done through entity framework directly querying the sql server database.
It was working very well when the database had around 1000 ads, but now it is reaching 300k and lots of users searching. The searches now are very slow (using raw sql didn't help much) and I was instructed to consider Elasticsearch.
I've been some tutorials and I get the idea of how it works now, but what I don't know is:
Should I stop using sql server to store the ads and start using Elasticsearch instead? What about all the other related data? Is Elasticsearch an alternative to sql server?
Each Ad has some related data stored in different tables, how would I load it to Elasticsearch? As a single json element?
I read a lot of "billions of data" handled by Elasticsearch, so I don't think I would have performance problems with 300k rows in it, correct?
Would anybody explain me better these questions?
1- You could still use it; you don't want to search over the complete database, rigth? Just over the ads. It works with a no-sql format, so it is very scalable. It also works with json's so you have an easy form to access it.
2- When indexing data, you should try to add the complete necessary data in the same document(sql row), which is a single json, but in a limited way. Storage is cheap, but computing time isn't.
To index your data, you could either use filebeat, a program a bit similar to logstash, or create your own solution like, making a program that reads data from your db, and then passes it to elasticsearch in bulks.
3- Correct, 300k rows is a small quantity, but it also depends on the memory from where you are hosting elasticsearch.
Hope this helps.
If you had to make provision for 80 million records (one for each page on the internet) and store the relationships between those records (which is 80 billion to the nth power), which database would be the best for this?
I've started this project thinking we will only map a portion of the internet, but unfortunately it has gone far beyond the limits of mysql. I need a better way to keep track of this data. The frontend is PHP, but I suppose the backend can be anything, as long as it can handle that amount of data?
i won't say there is the one holy database for your needs, maybe it could be better for your company to split your database in logical parts to handle the amount of data in a better way. maybe you could outsource some data into file system as you won't need anything everytime in your database.
if you scan the interwebs, you probably save the html, css or any big data you crawl for into your filesystem while you save connections and everything meta related into your database. but i really think you'd mentioned that already.
the best advice i want to give here is to make sure, your structure of your database is whatever fits your processes the best before think about switching the database. if you really need to switch (as mysql would not give you more performance), there will be mongodb and/or webscalesql. webscale seems to be used by facebook to handle the amount of their data.
a big question would be if you just can improve your performance by improve your hardware. you should check that too, AFTER you checked your structure and processes!
I'm used to working with mysql but for my next series of projects CouchDB (NoSQL) seems to be the way to go, basically to avoid EAV in mysql and to embrace all the cool features it has to offer.
After lots of investigation and reading documentation etc, there is one thing I don't seem to understand quite well.
Lets assume I host three web applications on my server and thus need three databases accordingly. For instance one is a webshop with product and invoice tables, one is a weblog with article and comment tables and another one is a web based game with game stats tables (simplification obviously).
So I host multiple sites on one installation of mysql, and each application I run on my server gets its own database with tables, fields and content.
Now, with CouchDb I want do the exact same thing. The problem seems to be that creating a database in CouchDb, is more similar to creating a table in mysql. I.e. I create databases called 'comments', 'articles' etc. for my weblog and inside I create a document per article or a document per comment.
So my question is: how can I separate my data from multiple web applications on one CouchDB installation?
I think I am doing something fundamentally wrong here but hopefully one of you guys can help me get on the right track.
In CouchDB, there's no explicit need to separate unrelated data into multiple databases. If you've constructed your documents and views correctly, only relevant data will appear in your queries.
If you do decide to separate your data into separate databases, simply create a new database.
$ curl -X PUT http://localhost:5984/somedb
{"ok":true}
From my experience with couchdb, separating unrelated data into different databases is very important for performance and also a no-brainer. The view generation is a painful part of couchdb. Everytime the database is updated, the views (think of them as indexes in a traditional relational sql db) have to be regenerated. This involves iterating every document in the database. So if you have say 2 million documents of type A, and you have 300 documents of type, B. And you need to regenerate a view the queries type B, then all 2 million and 300 hundred enumerations will be performed during view generation and it will take a long time (it might even do a read-timeout).
Therefore, having multiple databases is a no-brainer when it comes to keeping views (how you query in couchdb, an obviously important and unavoidable feature) updated.
#Zombies is extremely right about performance. CouchDB isn't suited to perform on a lot of documents in a single database. If you need to perform on, let's say, more than 5000 documents, MongoDB will outperfom CouchDB.
Views in CouchDB are essential, but painful, with limited JavaScript options to build your queries (don't even think about document references or nested objects). Considering having multiples databases for different documents is quite the solution. Some people will say something like:
CouchDB is a NoSQL database, and as such you should not need to order your documents nor filtering them using something else than views. NoSQL database core feature is the ability to store scheme-less documents [...]
And I find it very annoying when you need to find a workaround to performance and querying. You should not mind creating a few databases to separate your data if it allows you to split your data, it will still be on a 'single CouchDB installation'. Don't forget that CouchDB is suited for small databases. The smallest a database will be, the fastest your query will be, the better the performance will be.
(I do not know if there are any english mistakes, pardon me if so)
EDIT
Some companies like ArangoDB made a comparison between themselves, MongoDB and CouchDB, and it is confirming my saying about the number of documents. This is the result:
There are a lot of other resources on their website. On the other hand, this statement was a personnal experience, and from benchmarking them for my internship, with a .PHP benchmarking software I found on the Internet. The results are provided below:
I hear a lot about couchdb, but after reading some documents about it, I still don't get why to use it and how.
Could you clarify this mystery for me?
It's a non-relational database, open-source, distributed (incremental, bidirectional replication), schema-free. A CouchDB database is a collection of documents; each document is a bunch of string "keys" and corresponding "values" (which can be numbers, strings, lists, dates, ...). You can have indices, queries, views.
If a relational DB feels confining to you (you find schemas too rigid, can't spread the DB engine work around a very large numbers of servers, etc), CouchDB is worth considering (it's one of the most interesting of the many non-relational DBs that are emerging these days).
But if all of your work happily fits in a relational database, that's what you probably want to continue using for production work (even though "playing around" with some non-relational DB is still well worth your time, just for personal growth and edification, that's quite different from transferring huge production systems over from a relational DB!-).
It sounds like you should be reading Why CouchDB
To quote from wikipedia
It is not a relational database management system. Instead of storing data in rows and columns, the database manages a collection of JSON documents. The documents in a collection need not share a schema, but retain query abilities via views.
CouchDB provides a different model for data storage than a traditional relational database in that it does not represent data as rows within tables, instead it stores data as "documents" in JSON format.
This difference in data storage model is what differenciates CouchDB from products like MySQL and SQL Server.
In terms of programatic access to CouchDB, it exposes a REST API which you can access by sending HTTP requests from your code
I hope this has been somewhat helpful, though I acknowlege it may not be given my minimal familiarity with the product
I'm far from an expert(all I've done is play around with it some...) but here's how I'm thinking of using it:
Usually when I'm designing an app I've got a bunch of app servers behind a load balancer. Often times, I've got sticky sessions so that each user will go back to the same app server during that session. What I'm thinking of doing is have a couchdb instance tied to each app server.
That way you can use that local couchdb to access user preferences, product data...whatever data you've got that doesn't have to be perfectly up to date.
So...now you've got data on these local CouchDBs. CouchDB allows replication. So, every fixed time period, merge the data back(every X seconds?) into it's peers to keep them up to date.
As a whole you shouldn't have to worry about conflicts b/c each appserver has it's own CouchDB and users are attached to the appserver, and you've got eventual consistency because you've got replication.
Does that answer your question?
A good example is when you say have to deal with people data in either a website or application. If you set off wishing to design the data and keep the individuals' information seperate, that makes a good case for CouchDB, which stores data in documents rather than relational tables. In a production deployment, my users may end up adding adhoc data about 10% of the people and some other funny details for another selected 5%. In a relational context, this could add up to loads of redundancy but not for CouchDB.
And it's not just about the fact that CouchDB is non-relational: if you're too focus on that, you're missing the point. CouchDB is plugged into the web, all you need to start with is HTTP for creating and making queries (GET/PUT/POST/DELETE...), and it's RESTful, plus the fact that it's portable and great for peer to peer sharing. It can also serve up web applications in what is termed as 'CouchApps', where CouchDB totally holds the images, CSS, markup as data stored under special documents called design documents.
Check out this collection of videos introducing non-relational databases, the one on CouchDB should give you a better idea.