Large data - storage and query [closed] - database

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
We have a huge data of about 300 million records, which will get updated every 3-6 months.We need to query this data(continously, real time) to get some information.What are the options - a RDBMS(mysql) , or some other option like Hadoop.Which will be better?

300M records is well within the bounds of regular relational databases and live querying should be no problem if you use indexes properly.
Hadoop sounds like overkill unless you really need highly distributed and redundant data, and it will also make it harder to find support if you run into trouble or for optimizations.

Well, I have a few PostgreSQL databases with some tables with more than 700M records and they are updated all the time.
A query in those tables works very fast (a few milliseconds) and without any problems. Now, my data is pretty simple, and I have indexes on the fields I query.
So, I'd say, it will all depends on what kind of queries you'll be making, and if you have enough money to spend on fast disks.

As others said, modern RDBMS can handle such tables, depending on the queries and schema (some optimizations would have to be made). If you have a good key to split the rows by (such as a date column), then partioniong/sharding techniques will help you split the table into several small ones.
You can read more on those and other scaling techniques in a question I asked sometime ago here - Scaling solutions for MySQL (Replication, Clustering)

300 million records should pose no problems to a top-end RDBMS like Oracle, SQL Server, DB2. I'm not sure about mySQL, but I'm pretty sure it gets used for some pretty big databases these days.

300 Million does not really count as huge these days :-).
If you are mostly querying, and, you know more or less what form the queries will take then MySQL tables with the appropriate indexes will work just fine.
If you are constantly appying updates at the same time as you are running queries then choose PostgreSQL as it has better concurrency handling.
MS SQLServer, Sybase, Oracle and DB2 will all handle these volumes with ease if your company prefers to spend money.
If on the other hand you intend to do truly free format queries on unstructured data then Hadoop or similar would be a better bet.

Related

In my oracle DB table there are 21 indexes, is that good or bad [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Actually my application faces slowness and while digging the code I found this, please suggest.
It seems to be a lot, but whether it is or not depends heavily on the table structure.
If you've a hundred columns, it may not be a lot of indices.
Of course if you have a hundred columns, your database design is probably pretty bad.
That out of the way, a high number of indices probably leads to poor insert and update performance as all those indices have to be updated with every transaction that hits the table.
Also, you're more likely to face corrupt indices if you have a lot of them, which can/will affect performance.
Anyhow, it's an indication to take a good hard look at your database design and see what can be improved.
But simply blaming your performance issues on the database is not something we can confirm or deny. There are far more factors, network for example (I several years ago had a severe performance problem that was caused neither by the application or the database but by a very slow network in between the application server and database server causing the results of database queries to take 10+ seconds to reach the application server, even if those queries themselves only took milliseconds).
The answer is, it depends.
Tables with lots of columns might need a lot of indexes. Especially if the table is a crucial one which is referenced by many different queries. On the other hand such a table points towards a poor data model which has been kludged by encrusting the table with indexes instead of addressing the real problem and re-modelling the table into two or more tables.
Generally lots of indexes cause performance with inserts and deletes, and to a lesser extent updates, because all of the affected indexes have to be synchronised. A multiplicity of indexes doesn't necessarily lead to poor SELECT performance. That is, provided your statistics are up to date so the Optimizer is making clever decisions about which index to use. Remember that for some queries the cleverest choice might be to use no index and go for a Full Table Scan.
Beyond that I agree with #david. There's no value in guesses about your particular problem. There are many possible reasons why your "application faces slowness". You need to trace your application to find the bottlenecks. Once you know where it spends most of the time you will know where to start investigating.

I want to move data from SQL server DB to Hbase/Cassandra etc.. How to decide which bigdata database to use? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I need to develop a plan to move data from SQL server DB to any of the bigdata databases? Some of the questions that I have thought of are :
How big is the data?
What is the expected growth rate for this data?
What kind of queries will be run frequently? eg: look-up, range-scan, full-scan etc
How frequently the data moved from source to destination?
Can anyone help add to this questionnaire?
Firstly, How big is the data doesn't matter! This point barely can be used to decide on which NoSQL DB to use as most NoSQL DBs are made for easy scalability & storage. So all that matters is the query you fire rather than how much data is there. (Unless of course you intend to use it for storage & access of very small amounts of data because they would be a little expensive in many of the NoSQL DBs) Your first question must be Why consider NoSQL? Can't RDBMS handle it?
Expected growth-rate is a considerable parameter but then again not so valid, since most of the NOSQL DBs support storage of large amounts of data (without any scalability issues).
The most important one in your list is What kind of queries will be run?
This matters most since the RDBMS stores data as tuples and its easier to select tuples & output them with smaller amounts of data. Its faster at executing * queries(as its row-wise storage). But coming to NoSQL, most DBs are columnar or Column-oriented DBMS.
Row-oriented system : As data is inserted into the table, it is assigned an internal ID, the rowid that is used internally in the system to refer to data. In this case the records have sequential rowids independent of the user-assigned empid.
Column-oriented systems : A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.
Comparisons between row-oriented and column-oriented databases are typically concerned with the efficiency of hard-disk access for a given workload, as seek time is incredibly long compared to the other bottlenecks in computers.
How frequently the data will be moved/accessed? is again a good question as accesses are costly and few of the NoSQL DBs are very slow the first time a query is shot(Eg: Hive).
Other parameters you may consider are :
Are update of rows(data in the table) required? (Hive has problems with updation, you usually have to delete and insert again)
Why are you using the database? (Search, derive relationships or analytics, etc) What type of operations would you want to perform on the data?
Will it require relationship searches? Like in case of Facebook Db(Presto)
Will it require aggregations?
Will it be used to relate various columns to derive insights?(like analytics to be done)
Last but a very important one, Do you want to store that data on HDFS(Hadoop distributed File System) as files or your DB's specific storage format or anything else? This is important since your processing depends on how your data is stored, whether it can be accessed directly or needs a query call which may be time consuming , etc.
couple more pointers
Type of no-sql DB that suits your requirement. i.e. key-value, document, column family and graph databases
CAP theorem to decide which is more critical amongst Consistency, Availability and Partition tolerance

Strategies to building a database of 30m images [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Summary
I am facing the task of building a searchable database of about 30 million images (of different sizes) associated with their metadata. I have no real experience with databases so far.
Requirements
There will be only a few users, the database will be almost read-only, (if things get written then by a controlled automatic process), downtime for maintenance should be no big issue. We will probably perform more or less complex queries on the metadata.
My Thoughts
My current idea is to save the images in a folder structure and build a relational database on the side that contains the metadata as well as links to the images themselves. I have read about document based databases. I am sure they are reliable, but probably the images would only be accessible through a database query, is that true? In that case I am worried that future users of the data might be faced with the problem of learning how to query the database before actually getting things done.
Question
What database could/should I use?
Storing big fields that are not used in queries outside the "lookup table" is recommended for certain database systems, so it does not seem unusual to store the 30m images in the file system.
As to "which database", that depends on the frameworks you intend to work with, how complicated your queries usually are, and what resources you have available.
I had some complicated queries run for minutes on MySQL that were done in seconds on PostgreSQL and vice versa. Didn't do the tests with SQL Server, which is the third RDBMS that I have readily available.
One thing I can tell you: Whatever you can do in the DB, do it in the DB. You won't even nearly get the same performance if you pull all the data from the database and then do the matching in the framework code.
A second thing I can tell you: Indexes, indexes, indexes!
It doesn't sound like the data is very relational so a non-relational DBMS like MongoDB might be the way to go. With any DBMS you will have to use queries to get information from it. However, if your worried about future users, you could put a software layer between the user and DB that makes querying easier.
Storing images in the filesystem and metadata in the DB is a much better idea than storing large Blobs in the DB (IMHO). I would also note that the filesystem performance will be better if you have many folders and subfolders rather than 30M images in one big folder (citation needed)

How to decide whether I need to transition away from sqlite [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm creating a website using django. It's just about done but it hasn't gone live yet. I'm trying to decide whether SQLite is good enough for the site or if it would be worthwhile to use PostgreSQL now, at the beginning, rather than risk needing to transition to it later. (In this post, I mention PostgreSQL because that's the other contender for me. I'm sure similar analysis could be made with MySQL or Oracle.)
I could use some input from people about how they decide upon what database to use with their django projects.
Here's what I currently understand about this:
From my experience, SQLite is super easy. I don't need to worry about installing some other dependency for it and it pretty much just works out of the box with django.
From my research online, it seems that SQLite is able to handle quite a bit of load before becoming a performance bottleneck.
Here's what I don't know:
What would be involved with me transitioning to PostgreSQL from SQLite? Again, I'm currently in a development-only phase and therefore would not need to transition any database data from SQLite. Is it pretty much just a matter of installing PostgreSQL on the server and then tweaking the settings.py file to use it? I doubt it, but would any of my django code need to change? (I don't have any raw SQL queries - my database access is limited to django's model API.)
From a performance perspective, is PostgreSQL simply better in every way over SQLite? Or does SQLite have certain advantages over PostgreSQL?
Performance aside, does using PostgreSQL offer other deployment benefits over SQLite?
Essentially I'm thinking that SQLite is good enough for my little site. What's the odds of it becoming really popular? Probably not that great. SQLite is working for me now and would require no changes from my end. However, I'm concerned that maybe using PostgreSQL from the beginning would be easy and that I'll kick myself a year from now for not making the transition. I'm torn though - if I go to PostgreSQL, perhaps it will be unnecessary hassle on my part with no benefit.
Does anyone have general guidelines for deciding between SQLite and something else?
Thanks!
Here are a few things to consider.
SQLite does not allow concurrent writes. If an insert or an update is issued, the entire database is locked, and even readers are not allowed during a short moment of the actual update. If your application is going to have many users that update its state (posting comments, adding likes, etc), this is going to become a bottleneck. Unpleasant slowdowns will time to time occur even with a relatively small number of users.
SQLite does not allow several processes to efficiently access a database. There can only be one writing process, even if you have multiple CPUs, and even then the locking mechanism is very inefficient. To ensure data integrity you'd need to jump through many hoops, and every update will be horribly slow. Postgres can reorder locks optimally, lock tables at row level or even update without locking, so it will run circles around SQLite performance-wise, unless your database is strictly read-only.
SQLite does not allow data partitioning or even putting different tables to different tablespaces; everything lives in one file. If you have a "hot" table that is touched very often (e.g sessions, authorization, stats), you cannot tweak its parameters, put it on an SSD, etc. You can use a separate database for that, though, if relational integrity is not crucial.
SQLite does not have replication or failover features. If your app's downtime is going to cost you money, you'd better have a hot backup database server, ready to take over if the main server goes down. With Postgres, this is relatively painless; with SQLite, hardly.
SQLite does not have an online backup and point-in-time recovery capability. If data you receive from users is going to cost you money (e.g. merchant orders or user data under a SLA), you better back up your data regularly, even continuously. Postgres, of course, can do this; SQLite cannot.
In short: when your site stops being a toy, you should have switched already. You should have switched some time before your first serious load spike, to iron out any obvious problems.
Fortunately, Django ORM makes switching very easy on the Python side: you mostly change the connection string in settings.py. On the actual database side you'll have to do a bit more: profile your most important queries, tweak certain column types and indexes, etc. Unless you know how to cook Postgres yourself, seek help of people who know; databases have many non-obvious subtleties that influence performance significantly. Deploying Postgres is definitely more tricky than SQLite (though not realy hard); the result is way more functional when it comes to operation / maintenance under load.

Sorting on database server or application server in n-tier architecture [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Suppose I am developing an application with a single database server and multiple application servers, where it is cheap and easy to add application servers but difficult to scale the database. Suppose I want to retrieve some information from the database that needs to be sorted. All else being equal, it seems like I should prefer to sort on the application servers, since that shifts the load away from the database, which is hard to scale.
Now there are certainly some cases in which sorting on the database server is a no-brainer:
Sorting is necessary in order to obtain the correct result set. For example, if I want the top N according to some criterion, I obviously have to sort before I even know which rows I want. Sorting on the application server isn't an option here (unless I'm willing to suck down the entire table, which is typically not what I want to do).
There is an index that supports my sort order. In this case sorting on the database server is essentially free.
But other than that, am I generally correct to prefer sorting on the application server? Are there some cases I should consider in addition to those listed above?
My instinct is to sort the data on the database server, as that is one of it's prime functions and it is likely extremely efficient at it. The danger however is that the data may get resorted anyway at the client level, thus wasting processes.
If you have a database server that is so stressed it can no longer sort data quickly, you have bigger problems.
If the majority of the queries running on a server have been optimized, if the schema is rational, and the indexes are in place, a database server can do an enormous amount of work without even breaking a sweat.
I'll supplement Jaimal's comment with my own experience using the PostgreSQL DBMS. If you have a large shared buffer pool and you can prepare your statements which you are concerned about sort performance, you get a high-performance cache "for free" from your DBMS. If your queries cannot be prepared but you can limit the attributes you need in the result set, you can make an index on those attributes with your sort predicate. If you cannot perform any of these optimizations on the back end, then sorting in the application server will work well.
Regarding performance differences between sorting in an application and in the DBMS, I would expect the application language to have some overhead depending on it's object model. For example, I would expect sorting 1000000 Ruby objects versus 1000000 PostgreSQL tuples would show that the database is faster.
I believe that you are right. In the absence of an index the database has no performance advantage over sorting on your application server. In fact, on your app server you have control over which sorting algorithm you use, so in principle you could use something like radix sort (O(n) time) rather than quicksort if it applies to your case.
If your data doesn't change to often (you're willing to cache data) and you have a limited number of possible result sets, then you could sort on the database, but cache the result set or cache an array of keys for the result set saving having to do always perform the same sort of the same data.

Resources