Which is the best database to use to index the internet? - database

If you had to make provision for 80 million records (one for each page on the internet) and store the relationships between those records (which is 80 billion to the nth power), which database would be the best for this?
I've started this project thinking we will only map a portion of the internet, but unfortunately it has gone far beyond the limits of mysql. I need a better way to keep track of this data. The frontend is PHP, but I suppose the backend can be anything, as long as it can handle that amount of data?

i won't say there is the one holy database for your needs, maybe it could be better for your company to split your database in logical parts to handle the amount of data in a better way. maybe you could outsource some data into file system as you won't need anything everytime in your database.
if you scan the interwebs, you probably save the html, css or any big data you crawl for into your filesystem while you save connections and everything meta related into your database. but i really think you'd mentioned that already.
the best advice i want to give here is to make sure, your structure of your database is whatever fits your processes the best before think about switching the database. if you really need to switch (as mysql would not give you more performance), there will be mongodb and/or webscalesql. webscale seems to be used by facebook to handle the amount of their data.
a big question would be if you just can improve your performance by improve your hardware. you should check that too, AFTER you checked your structure and processes!

Related

Best practice to implement cache

I have to implement caching for a function that processes strings of varying lenghts (a couple of bytes up to a few kilobytes). My intention is to use a database for this - basically one big table with input and output columns and an index on the input column. The cache would try to find the string in the input column and get the output column - probably one of the simplest database applications imaginable.
What database would be best for this application? A fully-featured database like mysql or a simple one like sqlite3? Or is there even a better way by not using a database?
Document-stores are made for this. I highly recommend Redis for this specific problem. It is a "key-value" store, meaning it does not have relations, it does not have schemas, all it does is map keys to values. Which sounds like just what you need.
Alternatives are MongoDB and CouchDB. Look around and see what suites you best. My recommendation stays with Redis though.
Reading: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Joe has some good recommendations for data stores that are commonly use for caching. I would say Redis, Couchbase (not CouchDB though - it goes to disk fairly frequently/not that fast from my experience) and just plain Memcached.
MongoDB can be used for caching, but I don't think it's quite as tuned for pure caching like something like Redis is. Mongo can hit the disk quite a bit.
Also I highly recommend using time to live (TTL) as your main caching strategy. Just give a value some time to expire and then re-populate it later. It is a very hard problem to pro-actively find all instances of some data in a cache and refresh it.

Storage and fast access to huge picture data

I am currently involved in designing a face matching system and we have to deal with more than 3 Million images. But I have a vague idea about how to store those images and access them fast enough to achieve the highest possible performance.
MySQL Server:
This is easy because I am familiar with it. But doubt that the performance would be not acceptable. Of course I have not tried it. As I have read, there is a new datatype called filestream where we can store images and access them faster. Another option is, I can store the file path in the database and access the image after querying for the image path.
Other
I also have an idea for NoSQL solution. I have no experience in it. As found MongoDB is a good option for NoSQL the most popular and can be used as a file structure as well.
I am thinking of using good amount of concurrency, which should increase the higher concurrent data access as well.
Can somebody give me a heads up on this issue and the the best database technology that is available.
Edit:
UseCase: User gives a image of a person as a input and the system has to provide with a most possible set of matched of the face images in the database.
I thought about process the images separately as the number or cores(X) the images are queued in different queues which are going to be used by the application for image processing.
Thanks in advance.

Storing a small number of images: blob or fs?

I'm adding some functionality to my site so that users can upload their own profile pictures, so I was wondering about whether to store them in the database as a BLOB, or put them in the file system.
I found a question similar to this here: Storing images in DB: Yea or Nay, but the answers given were geared more towards people expecting many many thousands or even millions of images, whereas I'm more concerned about small images (JPEGs up to maybe 150x150 pixels), and small numbers of them: perhaps up to one or two thousand.
What are the feelings about DB BLOB vs Filesystem for this scenario? How do clients go with caching images from the DB vs from the filesystem?
If BLOBs stored in the DB are the way to go - is there anything I should know about where to store them? Since I imagine that a majority of my users won't be uploading a picture, should I create a user_pics table to (outer) join to the regular users table when needed?
Edit: I'm reopening this question, because it's not a duplicate of those two you linked to. This question is specifically about the pros/cons of using a DB or FS for a SMALL number of images. As I said above, the other question is targeted towards people who need to store thousands upon thousands of large images.
To answer parts of your question:
How do clients go with caching images from the DB vs from the filesystem?
For a database: Have a last_modified field in your database. Use the Last-Modified HTTP header so the client's browser can cache properly. Be sure to send the appropriate responses when the browser requests for an image "if newer" (can't recall what it's called; some HTTP request header).
For a filesystem: Do the same thing, but with the file's modified time.
If BLOBs stored in the DB are the way to go - is there anything I should know about where to store them? Since I imagine that a majority of my users won't be uploading a picture, should I create a user_pics table to (outer) join to the regular users table when needed?
I would put the BLOB and related metadata in its own table, with some kind of relation between it and your user table. Doing this will make it easier to optimize the table storage method for your data, makes things tidier, and leaves room for expandability (e.g. a general "files" table).
I once faced a similar question with a small DMS for pdf files. The scenario was different from yours: A maximum of may be 100 files with sizes up to 10 MB each - not what you expect for profile pictures. But the answer a friend gave me back then applies to your case as well:
Use each storage system for what it is designed to do.
Store data in a database. Store files in a file system.
This is not the ultimate answer(*), but its a good rule of thumb for starters.
I have never heard of the Windows FS being slow and sometimes unreliable, as Aaron Digulla states in his answer. If there are such problems, this certainly needs to be factored in. But for avatar pictures, it does not strike me as important.
(*) I know, I know, 42...
DB is optimized for latency, transactions, etc.
Image storage is optimized for read latency, storage cost, etc.
A blob store is ideal for storing millions of images. I work on SeaweedFS. It was based on Facebook's design for storing their user photos.
What would be more convenient, from the perspective of serving them, writing the code to serve them, backup procedures, etc.? You want the right answer for you, not the right answer for someone else.
From my point of view anything what may be left outside of database should stay outside. It may be file system or separate tables which you do not replicate or backup every day. It makes database much lighter, it grows slower and it easier to understand and maintain.
If you are on MSSQL make sure that blobs are stored in separate data file. Not in PRIMARY as everything else.
On Windows, put as much as you can in the database. The filesystem is somewhat slow and sometimes even unreliable.
On Linux, you have more options. Here, you should consider moving big files into a filesystem and just keep the name in the DB. If you use a modern filesystem like Ext3 or ReiseFS, you can even create many small files with pretty good performance.
You also need to take into account how you can access the data. If you have everything in the DB, you have one access path, need not worry about another set of permissions, but you have to deal with the extra complexity of reading/writing BLOBs. In many DBs, BLOBs can't be searched.
On the filesystem, you can run other tools on your data which isn't possible if the files are stored in a DB.
I would store them in the database:
Backup/restore is easy (if you backup files and also the database, point-in-time recovery is more complicated)
Transactions in the db mean you should never end up pointing at a file-name that is not there
Less chance someone is going to figure out a sneaky way of putting a script onto your server via a dodgy image upload hack
Since you are talking about a small number of images, ease of use/administration should take preference over performance issues which are debated in the linked questions.
I think there is a managability advantage storing them in the database; they can be backed up and restored consistently with the other data - you won't forget to delete obsolete ones (well, you might, but it's a bit less likely), and if you migrate the database to another machine, the images go with it.

Extreme Sharding: One SQLite Database Per User

I'm working on a web app that is somewhere between an email service and a social network. I feel it has the potential to grow really big in the future, so I'm concerned about scalability.
Instead of using one centralized MySQL/InnoDB database and then partitioning it when that time comes, I've decided to create a separate SQLite database for each active user: one active user per 'shard'.
That way backing up the database would be as easy as copying each user's small database file to a remote location once a day.
Scaling up will be as easy as adding extra hard disks to store the new files.
When the app grows beyond a single server I can link the servers together at the filesystem level using GlusterFS and run the app unchanged, or rig up a simple SQLite proxy system that will allow each server to manipulate sqlite files in adjacent servers.
Concurrency issues will be minimal because each HTTP request will only touch one or two database files at a time, out of thousands, and SQLite only blocks on reads anyway.
I'm betting that this approach will allow my app to scale gracefully and support lots of cool and unique features. Am I betting wrong? Am I missing anything?
UPDATE I decided to go with a less extreme solution, which is working fine so far. I'm using a fixed number of shards - 256 sqlite databases, to be precise. Each user is assigned and bound to a random shard by a simple hash function.
Most features of my app require access to just one or two shards per request, but there is one in particular that requires the execution of a simple query on 10 to 100 different shards out of 256, depending on the user. Tests indicate it would take about 0.02 seconds, or less, if all the data is cached in RAM. I think I can live with that!
UPDATE 2.0 I ported the app to MySQL/InnoDB and was able to get about the same performance for regular requests, but for that one request that requires shard walking, innodb is 4-5 times faster. For this reason, and other reason, I'm dropping this architecture, but I hope someone somewhere finds a use for it...thanks.
The place where this will fail is if you have to do what's called "shard walking" - which is finding out all the data across a bunch of different users. That particular kind of "query" will have to be done programmatically, asking each of the SQLite databases in turn - and will very likely be the slowest aspect of your site. It's a common issue in any system where data has been "sharded" into separate databases.
If all the of the data is self-contained to the user, then this should scale pretty well - the key to making this an effective design is to know how the data is likely going to be used and if data from one person will be interacting with data from another (in your context).
You may also need to watch out for file system resources - SQLite is great, awesome, fast, etc - but you do get some caching and writing benefits when using a "standard database" (i.e. MySQL, PostgreSQL, etc) because of how they're designed. In your proposed design, you'll be missing out on some of that.
Sounds to me like a maintenance nightmare. What happens when the schema changes on all those DBs?
http://freshmeat.net/projects/sphivedb
SPHiveDB is a server for sqlite database. It use JSON-RPC over HTTP to expose a network interface to use SQLite database. It supports combining multiple SQLite databases into one file. It also supports the use of multiple files. It is designed for the extreme sharding schema -- one SQLite database per user.
One possible problem is that having one database for each user will use disk space and RAM very inefficiently, and as the user base grows the benefit of using a light and fast database engine will be lost completely.
A possible solution to this problem is to create "minishards" consisting of maybe 1024 SQLite databases housing up to 100 users each. This will be more efficient than the DB per user approach, because data is packed more efficiently. And lighter than the Innodb database server approach, because we're using Sqlite.
Concurrency will also be pretty good, but queries will be less elegant (shard_id yuckiness). What do you think?
If you're creating a separate database for each user, it sounds like you're not setting up relationships... so why use a relational database at all?
If your data is this easy to shard, why not just use a standard database engine, and if you scale large enough that the DB becomes the bottleneck, shard the database, with different users in different instances? The effect is the same, but you're not using scores of tiny little databases.
In reality, you probably have at least some shared data that doesn't belong to any single user, and you probably frequently need to access data for more than one user. This will cause problems with either system, though.
I am considering this same architecture as I basically wanted to use the server side SQLLIte databases as backup and synching copy for clients. My idea for querying across all the data is to use Sphinx for full-text search and run Hadoop jobs from flat dumps of all the data to Scribe and then expose the results as webservies. This post gives me some pause for thought however, so I hope people will continue to respond with their opinion.
Having one database per user would make it really easy to restore individual users data of course, but as #John said, schema changes would require some work.
Not enough to make it hard, but enough to make it non-trivial.

Storing a file in a database as opposed to the file system?

Generally, how bad of a performance hit is storing a file in a database (specifically mssql) as opposed to the file system? I can't come up with a reason outside of application portability that I would want to store my files as varbinaries in SQL Server.
Have a look at this answer:
Storing Images in DB - Yea or Nay?
Essentially, the space and performance hit can be quite big, depending on the number of users. Also, keep in mind that Web servers are cheap and you can easily add more to balance the load, whereas the database is the most expensive and hardest to scale part of a web architecture usually.
There are some opposite examples (e.g., Microsoft Sharepoint), but usually, storing files in the database is not a good idea.
Unless possibly you write desktop apps and/or know roughly how many users you will ever have, but on something as random and unexpectable like a public web site, you may pay a high price for storing files in the database.
If you can move to SQL Server 2008, you can take advantage of the FILESTREAM support which gives you the best of both - the files are stored in the filesystem, but the database integration is much better than just storing a filepath in a varchar field. Your query can return a standard .NET file stream, which makes the integration a lot simpler.
Getting Started with FILESTREAM Storage
I'd say, it depends on your situation. For example, I work in local government, and we have lots of images like mugshots, etc. We don't have a high number of users, but we need to have good security and auditing around the data. The database is a better solution for us since it makes this easier and we aren't going to run into scaling problems.
What's the question here?
Modern DBMS SQL2008 have a variety of ways of dealing with BLOBs which aren't just sticking in them in a table. There are pros and cons, of course, and you might need to think about it a little deeper.
This is an interesting paper, by the late (?) Jim Gray
To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
In my own experience, it is always better to store files as files. The reason is that the filesystem is optimised for file storeage, whereas a database is not. Of course, there are some exceptions (e.g. the much heralded next-gen MS filesystem is supposed to be built on top of SQL server), but in general that's my rule.
While performance is an issue, I think modern database designs have made it much less of an issue for small files.
Performance aside, it also depends on just how tightly-coupled the data is. If the file contains data that is closely related to the fields of the database, then it conceptually belongs close to it and may be stored in a blob. If it contains information which could potentially relate to multiple records or may have some use outside of the context of the database, then it belongs outside. For example, an image on a web page is fetched on a separate request from the page that links to it, so it may belong outside (depending on the specific design and security considerations).
Our compromise, and I don't promise it's the best, has been to store smallish XML files in the database but images and other files outside it.
We made the decision to store as varbinary for http://www.freshlogicstudios.com/Products/Folders/ halfway expecting performance issues. I can say that we've been pleasantly surprised at how well it's worked out.
I agree with #ZombieSheep.
Just one more thing - I generally don't think that databases actually need be portable because you miss all the features your DBMS vendor provides. I think that migrating to another database would be the last thing one would consider. Just my $.02
The overhead of having to parse a blob (image) into a byte array and then write it to disk in the proper file name and then reading it is enough of an overhead hit to discourage you from doing this too often, especially if the files are rather large.
Not to be vague or anything but I think the type of 'file' you will be storing is one of the biggest determining factors. If you essentially talking about a large text field which could be stored as file my preference would be for db storage.
Interesting topic.
There is no absolutely one correct answer to this question.
There are few key elements to consider:
What’s your database engine?
What’s the route of file from database to end user and/or backwards?
What are the security requirements?
If files are meant for public audience and accessible via website, you shouldn’t even consider storing files in database. Use some smart indexing for files instead.
If files are containing highly sensitive information, then it might be worth of storing these into database. But you have to implement proper safe gateways too.
If performance is crucial, it’s better do not store files in database.
Backup and restoring and migrating of database might become a nightmare if database grows big just because of files. If you are DBA, then you would like to kill the person who “invented” an idea to put files into database.
I recommend to use storing files into database at last option, when there is absolutely no any better alternative available.

Resources