Strategies to building a database of 30m images [closed] - database

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Summary
I am facing the task of building a searchable database of about 30 million images (of different sizes) associated with their metadata. I have no real experience with databases so far.
Requirements
There will be only a few users, the database will be almost read-only, (if things get written then by a controlled automatic process), downtime for maintenance should be no big issue. We will probably perform more or less complex queries on the metadata.
My Thoughts
My current idea is to save the images in a folder structure and build a relational database on the side that contains the metadata as well as links to the images themselves. I have read about document based databases. I am sure they are reliable, but probably the images would only be accessible through a database query, is that true? In that case I am worried that future users of the data might be faced with the problem of learning how to query the database before actually getting things done.
Question
What database could/should I use?

Storing big fields that are not used in queries outside the "lookup table" is recommended for certain database systems, so it does not seem unusual to store the 30m images in the file system.
As to "which database", that depends on the frameworks you intend to work with, how complicated your queries usually are, and what resources you have available.
I had some complicated queries run for minutes on MySQL that were done in seconds on PostgreSQL and vice versa. Didn't do the tests with SQL Server, which is the third RDBMS that I have readily available.
One thing I can tell you: Whatever you can do in the DB, do it in the DB. You won't even nearly get the same performance if you pull all the data from the database and then do the matching in the framework code.
A second thing I can tell you: Indexes, indexes, indexes!

It doesn't sound like the data is very relational so a non-relational DBMS like MongoDB might be the way to go. With any DBMS you will have to use queries to get information from it. However, if your worried about future users, you could put a software layer between the user and DB that makes querying easier.
Storing images in the filesystem and metadata in the DB is a much better idea than storing large Blobs in the DB (IMHO). I would also note that the filesystem performance will be better if you have many folders and subfolders rather than 30M images in one big folder (citation needed)

Related

Should databases be separated based on size and load? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
I'm developing a web backend with two modules. One handles a relatively small amount of data that doesn't change often. The other handles real-time data that's constantly being dumped into the database and never gets changed or deleted. I'm not sure whether to have separate databases for each module or just one.
The data between the modules is interconnected quite a bit, so it's a lot more convenient to have it in a single database.
But anything fails, I need the first database to be available for reads as soon as possible, and the second one can wait.
Also I'm not sure how much performance impact the constantly growing large database would have on the first one.
I'd like to make dumps of the data available to public, and I don't want users downloading gigabytes that they don't need.
And if I decide to use a single one, how easy is it to separate them later? I use Postgres, btw.
Sounds like you have a website with its content being the first DB, and some kind of analytics being the second DB.
It makes sense to separate those physically (as in on different servers). Especially if one of those is required to be available as much as possible. Separating mission critical parts from something not that important is a good design. Also, smaller DB means shorter recovery times from a backup, if such need to arise.
For the data that is interconnected, if you need remote lookup from one DB into another, Foreign Data Wrappers may help.

What type of database is the best for storing array or object like data [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm just curious what the best method would be if I'm trying to have a bot running on my Node server that I could play Blackjack against.
But for multiple connected clients via sockets, each connected socket will have their own bot to play against but I need some way to keep the bots available cards for each time they send a POST request with whatever card they pull out of their deck.
I figured MySQL would get messy really quickly because I cannot just store an array or an object and splice out each card as it gets used, but I'm not really familiar with which database would specialize in this kind of use.
If I didn't make any sense, basically:
I need to store cards for the bot (but for each connected users session) not just 1 deck for 1 person but multiple decks for multiple people.
I'm not asking you to write any code for me, just point me in the direction of which database would be ideal for this kind of setup.
I was thinking maybe Redis or MongoDB?
Redis would probably be fastest, especially if you don't need a durability guarantee - most of the game can be played out using Redis' in-memory datastore, which is probably gonna be faster than writing to any disk in the world. Perhaps periodically, you can write the "entire game" to disk. If the project is not meant for commercial purposes, i.e. computer errors aren't gonna cause players to lose money, this is definitely an enticing choice.
MongoDB is popular, especially easy to get started with Node, and is definitely faster than most relational SQL solutions, but transactions may be a problem. For a prototype or proof-of-concept projec, it should do fine. But you may also want to look at other "NoSQL" solutions as well.
Cassandra is another popular document-oriented DB, and many people prefer it over MongoDB, for various reasons - most notably, for better scalability.
The choice really highly depends on how you model your data. In your current scenario, I know you want to simply store an object/array, which sounds like you are basically going the way of the aggregated document (MongoDB). You are, in effect, "denormalizing" the entire DB into an aggregate, and performing reads/writes on the entire object every single time in order to achieve consistency. This is a prevalent technique in MongoDB and other document-oriented DBs. But do note that this solution only works because you are not operating across partitions. Think about what happens when you have multiple servers serving the application writing to a separate DB cluster.
You've really got to analyze and decide for yourself what is the best way to model data, if scalability is a concern. Would it be a better model to NOT continually write to this array? For example, generate the sequence of cards once, store it in DB as a Game, and only do reads on it to draw cards? Then, each player's move can be stored as a very succinct data structure Hit referencing a card from the Game. Although the data becomes very relational (back to old school SQL), but the writes are much smaller, and your server never gets into a lock state waiting for players to release the Game object. It may or may not work for your use case, but think about how to model the data for maximum reads and minimum independent writes.
Personally (IMO), if this project is for fun, I'd go with Redis as an in-memory cache layer where most reads/writes happens, and write the game logs into Cassandra. But if this is serious business and I need some real consistency guarantees, I'd probably go back to relational DBs, with a Redis cache layer to speed up reads.
Because there is no one correct answer, the only advice anyone can give is to weigh your application's persistence needs against the strengths/weaknesses of each DB solution, and do a hell lot of research before making an important decision like "which technology to use for persistence". For example, there may be long-term problems with MongoDB that you overlooked - if you'd just Google "MongoDB problems" or "MongoDB sucks". Hell, there may even be long-term problems with all current NoSQL offerings with regards to transactions or consistency.

How to decide whether I need to transition away from sqlite [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm creating a website using django. It's just about done but it hasn't gone live yet. I'm trying to decide whether SQLite is good enough for the site or if it would be worthwhile to use PostgreSQL now, at the beginning, rather than risk needing to transition to it later. (In this post, I mention PostgreSQL because that's the other contender for me. I'm sure similar analysis could be made with MySQL or Oracle.)
I could use some input from people about how they decide upon what database to use with their django projects.
Here's what I currently understand about this:
From my experience, SQLite is super easy. I don't need to worry about installing some other dependency for it and it pretty much just works out of the box with django.
From my research online, it seems that SQLite is able to handle quite a bit of load before becoming a performance bottleneck.
Here's what I don't know:
What would be involved with me transitioning to PostgreSQL from SQLite? Again, I'm currently in a development-only phase and therefore would not need to transition any database data from SQLite. Is it pretty much just a matter of installing PostgreSQL on the server and then tweaking the settings.py file to use it? I doubt it, but would any of my django code need to change? (I don't have any raw SQL queries - my database access is limited to django's model API.)
From a performance perspective, is PostgreSQL simply better in every way over SQLite? Or does SQLite have certain advantages over PostgreSQL?
Performance aside, does using PostgreSQL offer other deployment benefits over SQLite?
Essentially I'm thinking that SQLite is good enough for my little site. What's the odds of it becoming really popular? Probably not that great. SQLite is working for me now and would require no changes from my end. However, I'm concerned that maybe using PostgreSQL from the beginning would be easy and that I'll kick myself a year from now for not making the transition. I'm torn though - if I go to PostgreSQL, perhaps it will be unnecessary hassle on my part with no benefit.
Does anyone have general guidelines for deciding between SQLite and something else?
Thanks!
Here are a few things to consider.
SQLite does not allow concurrent writes. If an insert or an update is issued, the entire database is locked, and even readers are not allowed during a short moment of the actual update. If your application is going to have many users that update its state (posting comments, adding likes, etc), this is going to become a bottleneck. Unpleasant slowdowns will time to time occur even with a relatively small number of users.
SQLite does not allow several processes to efficiently access a database. There can only be one writing process, even if you have multiple CPUs, and even then the locking mechanism is very inefficient. To ensure data integrity you'd need to jump through many hoops, and every update will be horribly slow. Postgres can reorder locks optimally, lock tables at row level or even update without locking, so it will run circles around SQLite performance-wise, unless your database is strictly read-only.
SQLite does not allow data partitioning or even putting different tables to different tablespaces; everything lives in one file. If you have a "hot" table that is touched very often (e.g sessions, authorization, stats), you cannot tweak its parameters, put it on an SSD, etc. You can use a separate database for that, though, if relational integrity is not crucial.
SQLite does not have replication or failover features. If your app's downtime is going to cost you money, you'd better have a hot backup database server, ready to take over if the main server goes down. With Postgres, this is relatively painless; with SQLite, hardly.
SQLite does not have an online backup and point-in-time recovery capability. If data you receive from users is going to cost you money (e.g. merchant orders or user data under a SLA), you better back up your data regularly, even continuously. Postgres, of course, can do this; SQLite cannot.
In short: when your site stops being a toy, you should have switched already. You should have switched some time before your first serious load spike, to iron out any obvious problems.
Fortunately, Django ORM makes switching very easy on the Python side: you mostly change the connection string in settings.py. On the actual database side you'll have to do a bit more: profile your most important queries, tweak certain column types and indexes, etc. Unless you know how to cook Postgres yourself, seek help of people who know; databases have many non-obvious subtleties that influence performance significantly. Deploying Postgres is definitely more tricky than SQLite (though not realy hard); the result is way more functional when it comes to operation / maintenance under load.

Large data - storage and query [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
We have a huge data of about 300 million records, which will get updated every 3-6 months.We need to query this data(continously, real time) to get some information.What are the options - a RDBMS(mysql) , or some other option like Hadoop.Which will be better?
300M records is well within the bounds of regular relational databases and live querying should be no problem if you use indexes properly.
Hadoop sounds like overkill unless you really need highly distributed and redundant data, and it will also make it harder to find support if you run into trouble or for optimizations.
Well, I have a few PostgreSQL databases with some tables with more than 700M records and they are updated all the time.
A query in those tables works very fast (a few milliseconds) and without any problems. Now, my data is pretty simple, and I have indexes on the fields I query.
So, I'd say, it will all depends on what kind of queries you'll be making, and if you have enough money to spend on fast disks.
As others said, modern RDBMS can handle such tables, depending on the queries and schema (some optimizations would have to be made). If you have a good key to split the rows by (such as a date column), then partioniong/sharding techniques will help you split the table into several small ones.
You can read more on those and other scaling techniques in a question I asked sometime ago here - Scaling solutions for MySQL (Replication, Clustering)
300 million records should pose no problems to a top-end RDBMS like Oracle, SQL Server, DB2. I'm not sure about mySQL, but I'm pretty sure it gets used for some pretty big databases these days.
300 Million does not really count as huge these days :-).
If you are mostly querying, and, you know more or less what form the queries will take then MySQL tables with the appropriate indexes will work just fine.
If you are constantly appying updates at the same time as you are running queries then choose PostgreSQL as it has better concurrency handling.
MS SQLServer, Sybase, Oracle and DB2 will all handle these volumes with ease if your company prefers to spend money.
If on the other hand you intend to do truly free format queries on unstructured data then Hadoop or similar would be a better bet.

DataModel for Workflow/Business Process Application [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
What should be the data model for a work flow application? Currently we are using an Entity Attribute Value based model in SQL Server 2000 with the user having the ability to create dynamic forms (on asp.net), but as the data grows performance is getting down and hard to generate report and worse if too many users concurrently query the data (EAV).
As you have probably realized, the problem with an EAV model is that tables grow very large and queries grow very complex very quickly. For example, EAV-based queries typically require lots of subqueries just to get at the same data that would be trivial to select if you were using more traditionally-structured tables.
Unfortunately, it is quite difficult to move to a traditionally-structured relational model while simultaneously leaving old forms open to modification.
Thus, my suggestion: consider closing changes on well-established forms and moving their data to standard, normalized tables. For example, if you have a set of shipping forms that are not likely to change (or whose change you could manage by changing the app because it happens so rarely), then you could create a fixed table and then copy the existing data out of your EAV table(s). This would A) improve your ability to do reporting, B) reduce the amount of data in your existing EAV table(s) and C) improve your ability to support concurrent users / improve performance because you could build more appropriate indices into your data.
In short, think of the dynamic EAV-based system as a way to collect user's needs (they tell you by building their forms) and NOT as the permanent storage. As the forms evolve into their final form, you transition to fixed tables in order to gain the benefits discussed above.
One last thing. If all of this isn't possible, have you considered segmenting your EAV table into multiple, category-specific tables? For example, have all of your shipping forms in one table, personnel forms in a second, etc. It won't solve the querying structure problem (needing subqueries) but it will help shrink your tables and improve performance.
I hope this helps - I do sympathize with your plight as I've been in a similar situation myself!
Typically, when your database schema becomes very large and multiple users are trying to access the same information in many different ways, Data Warehousing, is applied in order to reduce major load on the database server. Unlike your traditional schema where you are more than likely using Normalization to keep data integrity, data warehousing is optimized for speed and multiple copies of your data are stored.
Try using the relational model of data. It works.

Resources