Optimizing Performance of a Large Dataset in SQL Server - sql-server

I'm using SQL Server running on an Azure VM with 8 SSDs. The SSDs are grouped together in Storage Spaces as 1 disk - in order to increase the capacity and also to combine the IOPS/Throughput. But the "combine the IOPS" part just doesn't seem to be working as far as I can tell by all of my tests/benchmarks (the "combine the throughput" part is working though). In fact, it looks like the SSD performance (IOPS) are better on 1 single disk than the whole 8-physical-disk virtual disk. So, I'm thinking about just forgetting about Storage Spaces and splitting up my data across 8 disks.
But what would be the best way to do that? (I don't have much experience with mulitple files, or filegroups, or partitioning tables, and that sort of thing.)
Just make 8 mdf files (1 on each disk) and let SQL Server redistribute the data across all of these files? If so, I would like to know how SQL Server knows which disk a given record is on. Would doing this speed things up?
And maybe split up the ldf files too?
What about multiple filegroups? I really don't know what the practical difference is between multiple files and filegroups.
What about splitting up the big tables somehow by using a partitioning function? Would that help, since now, maybe, SQL Server would "have a better idea" of where (in which file) a given record would be - since that is defined by a partition function?
Please don't try to close this question because it seems very general or open-ended. Life is tough enough as it is. This is a very good question. And I'm sure there are a lot of people out there who could give very helpful, experienced answers to this which would help a lot of people. Just because there might not be one exact answer to this question, doesn't mean it's a bad question. And anyway, if you think about it, there IS one best answer to this question - there is a best way to do things in this - very common - situation.

The details you are asking in a single thread require too much of depth research. The use case varies from project to project.
I recommend you to go in-depth on Storage: Performance best practices for SQL Server on Azure VMs, Microsoft's official document. Go through the Checklist details. Refer the disk type most suitable for your use case based on IOPS. You will get the answers to all your queries within this document.

Related

What arguments to use to explain why SQL Server is far better than a flat file

The higher-ups in my company were told by good friends that flat files are the way to go, and we should switch from SQL Server to them for everything we do. We have over 300 servers and hundreds of different databases. From just the few I'm involved with we have > 10 billion records in quite a few of them with upwards of 100k new records a day and who knows how many updates... Me and a couple others need to come up with a response saying why we shouldn't do this. Most of our stuff is ASP.NET with some legacy ASP. We thought that making a simple console app that tests/times the same interactions between a flat file (stored on the network) and SQL over the network doing large inserts, searches, updates etc along with things like network disconnects randomly. This would show them how bad flat files can be, especially when you are dealing with millions of records.
What things should I use in my response? What should I do with my demo code to illustrate this?
My sort list so far:
Security
Concurrent access
Performance with large amounts of data
Amount of time to do such a massive rewrite/switch and huge $ cost
Lack of transactions
PITA to map relational data to flat files
NTFS doesn't support tons of files in a directory well
Lack of Adhoc data searching/manipulation
Enforcing data integrity
Recovery from network outage
Client delay while waiting for other clients changes to commit
Most everybody stopped using flat files for this type of storage long ago for good reason
Load balancing/replication
I fear that this will be a great post on the Daily WTF someday if I can't stop it now.
Additionally
Does anyone know if anything about HIPPA could be used in this fight? Many of our records are patient records...
Data integrity. First, you can enforce it in a database and cannot in a flat file. Second, you can ensure you have referential integrity between different entities to prevent orphaning rows.
Efficiency in storage depending on the nature of the data. If the data is naturally broken into entities, then a database will be more efficient than lots of flat files from the standpoint of the additional code that will need to be written in the case of flat files in order to join data.
Native query capabilities. You can query against a database natively whereas you cannot with a flat file. With a flat file you have to load the file into some other environment (e.g. a C# application) and use its capabilities to query against it.
Format integrity. The database format is more rigid which means more consistent. A flat file can easily change in a way that the code that reads the flat file(s) will break. The difference is related to #3. In a database, if the schema changes, you can still query against it using native tools. If the flat file format changes, you have to effectively do a search because the code that reads it will likely be broken.
"Universal" language. SQL is somewhat ubiquitous where as the structure of the flat file is far more malleable.
I'd also mention data corruption. Most modern SQL databases can have the power killed on the server, or have the server instance crash and you won't (shouldn't) loose data. Flat files aren't really that way.
Also I'd mention search times. Perhaps even write a simple flat file database with 1mil entries and show search times vs MS SQL. With indexes you should be able to search a SQL database thousands of times faster.
I'd also be careful how quickly you write off flat files. Id go so far as saying "it's a good idea for many cases, but in our case....". This way you won't sound like you're not listening to the other views. Tact in situations like this is a major thing to consider. They may be horribly wrong, but you have to convince your boss of that.
What do they gain from using flat files? The conversion process will be hundreds of hours - hours they pay for. How quickly can flat files generate a positive return on that investment? Provide a rough cost estimate. Translate the technical considerations into money (costs), and it puts the problem in their perspective.
On top of just the data conversion, add in the hidden costs for duplicating a database's capabilities...
Indexing
Transaction processing
Logging
Access control
Performance
Security
Databases allow you to easily index your data to be able to particular records or groups of records by searching any number of different columns.
With flat files you have to write your own indexing mechanisms. There is no need to do all that work again when the database does it for you already.
If you use "text files", you'll need to build an interface on top of it which Microsoft has already done for you and called it SQL Server.
Ask your managers if it makes sense to your company to spend all these resources building a home-made database system (because really that's what it is), or would these resources be better spent focusing on the business.
Performance: SQL Server is built for storing conveniently searchable data. It has optimized data structures in memory built with searching/inserting/deleting in mind. Usage of the disk is lowered, as data regularly queried is kept in memory.
Business partners: if you ever plan to do B2B with 3rd party companies, SQL Server has built-in functionality for it called Linked Servers. If you have only a bunch of files, your business partner will give up on you as no data interconnection is possible. Unless you want to re-invent the wheel again, and build an interface for each business partner you have.
Clustering: you can easily cluster servers in SQL Server for high availability and speed, a lot more than what's possible with text based solution.
You have a nice start to your list. The items I would add include:
Data integrity - SQL engines provide built-in mechanisms (relationships, constraints, triggers, etc.) that make it very simple to reduce the amount of "bad" data in your system. You would need to hand code all data constraint separately if you use flat files.
Add-Hoc data retrieval - SQL engines, through the use of SELECT statements, provide a means of filtering and summarizing your data with very little code. If you are using flat files, considerably more code is needed to get the same results.
These items can be replicated if you want to take the time to build a data engine, but what would be the point? SQL engines already provide these benefits.
I don't think I can even start to list the reasons. I think my head is going to explode. I'll take the risk though to try to help you...
Simulate a network outage and show what happens to one of the files at that point
Demo the horrors of a half-committed transaction because text files don't pass the ACID test
If it's a multi-user application, show how long a client has to wait when 500 connections are all trying to update the same text file
Try to politely explain why the best approach to making business decisions is to listen to the professionals who you are paying money and who know the domain (in this case, IT) and not your buddy who doesn't have a clue (maybe leave out that last bit)
Mention the fact that 99% (made up number) of the business world uses relational databases for their important data, not text files and there's probably a reason for that
Show what happens to your application when someone goes into the text file and types in "haha!" for a column that's supposed to be an integer
If you are a public company, the shareholders would be well served to know this is being seriously contemplated. "We" all know this is a ridiculous suggestion given the size and scope of your operation. Patient records must be protected, not only from security breaches but from irresponsible exposure to loss - lives may depend up the data. If the Executives care at all about the patients, THIS should be their highest concern.
I worked with IBM 370 mainframes from '74 onwards and the day that DB2 took over from plain old flat files, VSAM and ISAM was a milestone day. Haven't looked back to flat-file storage, except for streaming data, in my 25 years with RDBMSs of 4 flavors.
If I owned stock in "you", dumping it in a hurry the moment the project took off would seem appropriate...
Your list is a great start of reasons for sticking with a database.
However, I would recommend that if you're talking to a technical person, to shy away from technical reasons in a recommendation because they might come across as biased.
Here are my 2 points against flat file data storage:
1) Security - HIPPA audits require that patient data remain in a secure environment. The common database systems (Oracle, Microsoft SQL, MySQL) have methods for implementing HIPPA compliant security access. Doing so on a flat-file would be difficult, at best.
Side note: I've also seen medical practices that encrypt the patient name in the database to add extra layers of protection & compliance to ensure even if their DB is compromised that the patient records are not at risk.
2) Reporting - Reporting from any structured database system is simple and common. There are hundreds of thousands of developers that can perform this task. Reporting from flat-files will require an above-average developer. And, because there is no generally accepted method for doing reporting off of a flat-file database, one developer might do things different than another. This could impact the talent pool able to work on a home-grown flat-file system, and ultimately drive costs up by having to support that type of a system.
I hope that helps.
How do you create a relational model with plain text files?
Or are you planning to use a different file for each entity?
Pro file system:
Stable (less lines of code = less bugs, easier to understand, more reliable)
Faster with huge data blobs
Searching/sorting is somewhat slow (but sort can be faster than SQL's order by)
So you'd chose a filesystem to create log files, for example. Logging into a DB is useless unless you need to do complex analysis of the data.
Pro DB:
Transactions (which includes concurrent access)
It can search through huge amounts of records (but not through huge blobs of data)
Chopping the data in all kinds of ways with queries is easy (well, if you know your SQL and the special "oddities" of your DB)
So if you need to add data rarely but search it often, select parts of it by certain criteria or aggregate values, a DB is for you.
NTFS does not support mass amounts of .txt files well. Depending on how a flat file system is developed, the health of a harddrive can become an issue. A lot of older file systems use mass amount of small .txt files to store data. It's bad design, but tends to happen as a flat file system gets older.
Fragmentation becomes an issue, and you lose a text file here and there, causing you to lose small amounts of data. Health of a hard drive should not be an issue when it comes to database design.
This is indeed, on the part of your employer, a MAJOR WTF if he's seriously proposing flat files for everything...
You already know the reasons (oh - add Replication / Load Balancing to your list) - what you need to do now is to convince him of them. My approach on this would two fold.
First of all, I would write a script in whatever tool you currently use to perform a basic operation using SQL, and have it timed. I would then write another script in which you sincerely try to get a flat text solution working, and then highlight the difference in performance. Give him both sets of code so he knows you aren't cheating.
Point out that technology evolves, and that just because someone was successful 20 years ago, this does not automatically entitle them to a credible opinion now.
You might also want to mention the scope for errors in decoding / encoding information in text files, that it would be trivial for someone to steal them, and the costs (justify your estimate) in adapting the current code base to use text files.
I would then ask serious questions of management - foremost amongst them, and I would ask this DIRECTLY, is "Why are you prepared to overrule your technical staff on technical matters" based on one other individual's opinion - especially when said individual is not as familiar with our set up as we are...
I'd also then use the phrase "I do not mean to belittle you, but I seriously feel I have to intervene at this point for the good of the company..."
Another approach - turn the tables - have Mr. Wonderful supply arguments as to why text files are the way forward. You'll then either a) Learn something (not likely), or b) Be in a position to utterly destroy his arguments.
Good luck with this - I feel your pain...
Martin
I suggest you get your retalliation in first, post on Daily WTF now.
As to your question: a business reason would be why does your boss want to rewrite all your systems. From scratch as you would, effectively, have to write your own database system.
For a development reason, you would lose access to the SQL server ecosystem, all the libraries, tools, utilities.
Perhaps the guy that suggested this is actually thinking of going into competition with your company.
Simplest way to refute this argument - name a fortune 500 company that processes data on this scale using flat files?
Now name a fortune 500 company that doesn't use a relational database...
Case closed.
Something is really fishy here. For someone to get the terminology right ( "flat file" ) but not know how overwhelmingly stupid an idea that is, it just doesn't add up. I would be willing to be your manager is non-technical, but the person your manager is talking to is. This sounds more like a lost in translation problem.
Are you sure they don't mean no-SQL, as if you are in a document centric environment, moving away from a relational database actually does make sense in some regards, while still having many of the positives of a tradition RDBMS.
So, instead of justifying why SQL is better than flat files, I would invert the problem and ask what problems flat files are meant to solve. I would put odds on money that this is a communication problem.
If its not and your company is actually considering replacing its DB with a home grown flat file system off the recommendation of "a friend", convincing your manager why he is wrong is the least of your worries. Instead, dust off and start circulating your resume.
•Amount of time to do such a massive
rewrite/switch and huge $ cost
It's not just amount of time it is the introduction of new bugs. A re-write of these proportions would cause things that currenty work to break.
I'd suggest a giving him a cost estimate of the hours to do such a rewrite for just one system and then the number of systems that would need to change. Once they have a cost estimate, they will run from this as fast as they can.
Managers like numbers, so do a formal written decision analysis. Compare the two proposals by benefits and risks, side by side with numeric values. When you get to cost 0 to maintain and 100,000,000 to convert they will get the point.
The people that doesn't distinguish between flat files and sql, doesnt understand all arguments that you say before.
The explanation must simple as possible, something like this:
SQL is a some kind of search/concurrency wrapper around the flat files.
All the problems that exist currently, will stay even the company going to write the wrapper from zero.
Also you must to give some other way to resolve the current problems, use smart words like advanced BLL or install/uninstall scripting environment. :)
You have to speak executive. Without saying it, make them realize they're in way over their heads here. Here's some ammunition:
Database theory is hardcore computer science. We're talking about building a scalable system that can handle millions of records and tolerate disasters without putting everyone out of business.
This is the work of PhD-level specialists. They've been refining the field for a good 20 years now, and the great thing about that is this: it allows us to specialize in building business systems.
If you have to, come right out and say that this just isn't done in the enterprise. It would be costly and the result would be inferior. It's exactly the kind of wheel that developers love to reinvent, and in my opinion the only time you should is if the result is going to be a product or service that you can sell. And it won't be.

What settings for a read database, and what settings for a write database?

I am implementing replication for a project I am developing, and would like to replicate changes in the Write database to the Read database.
While this isn't a problem, I want to tune one database for reading from, and the other to writing to, so they would have different settings.
Is there any resource/guide which will tell me what concepts to look into? I'm not looking for a how to guide (then again, at this level, these tasks are very involved to have guides to).
Thanks
Index your databases differently. You probably need different indexes (maybe fewer indexes) to suppor the process of writing to the Write database than you do with the read database. If an index is only used for reading, then leave it off the Write database.
I'm no expert on this, and my thinking might be fuzzy, but consider the hardware/memory/and even RAID configurations. I can't remember.... would one RAID configuration be more suited for writing and another for reading, or is that wrong...?
The most obvious difference will be the differing indexes required.
Disk IO pattern will also be different but don't forget that the read database is also being written to by the replication procedure, you can't just optimise it completly for read.
Other differences may also be evident in things like optimum memory configuration and the amount of CPU horespower the 2 servers require. Your first step will be to get some idea of the sort of workload each server will have to handle, and how much work.
Although I haven't got any specific links, but the microsoft site does have several papers on sizing SQL server hardware. Once you know the workload the 2 servers will have to handle you should be able to use the same guide to size and get ideas for configuration of both of them.

Hiring a SqlServer OLTP Specialist, What experience or requirements should I look for? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I'm running into Db performance issues with an OTLP project I'm working on. Another developer and I have reached the end of our accumulated performance knowledge and seek out an individual to join the team to help us speed up our application.
For some background we've done schema changes to denormalize pieces of the data, optimized every query, ran multiple database tuning advisers to get our indexes just right, tuned MSSql's server options.
We don't need somebody to come in and tell us joins can be slow and what deadlocking is, we need somebody who knows what to do after exhausting all the steps listed above.
Anybody have any tips or experiences hiring OLTP DBA's to share? What kind of questions can we ask the DBA during the interview process?
Its an odd situation to be in, we know we need somebody who knows more than the current team, but we don't know what questions to ask because we don't know what the next steps are. Does that make sense?
Ok, this tells me something:
For some background we've done schema changes to denormalize pieces of the data, optimized every query, ran multiple database tuning advisers to get our indexes just right, tuned MSSql's server options.
You've already matched or exceeded what 90% of people who call themselves DBAs will be able to do.
The problem is that a lot of DBAs aren't really programmers, they are more on the system administration side of things. You need a DBA programmer who is not only really good at TSQL, but who knows your other programming language(s) as well.
I spend a significant portion of my time on database tuning issues like this, and the solution often involves significant redesigns all the way from the front end back to the database schema. You can't solve these problems in isolation, and without complete control (and total understanding) of the entire architecture, you'll never get the performance you need.
You might be the best person for the job, and it might be smarter for you to hire somebody to take busy work off your plate so you can concentrate on the OLTP performance problems.
You have to be vary careful here, you can end up hiring a Guru DBA, have him improve the database performance significantly and still have issues with your application that are rooted in its architecture.
A few ideas:
Take the most complex query you optimized give it to the candidate DBA with QA and get him/her to optimise it again. Have them describe what they did and how they did it.
Make sure this person understands hardware, when you would use multiple filegroups, raid arrays, data partitioning, 64 vs 32 bit performance etc...
Look for someone who also has some software architecture background.
Ask them a few harder SQL server questions eg. What is the OVER statement? Are GUIDs good primary keys and why, is int identity preferable?
De-tune your DB back a little to before your own optimization and give it to them. See if they can tune it to perform as good or better compared to the changes you made.
Ask why they chose that technique.
Take up references and find out about the environments they've come from which are most likely to match your own.
A good DBA will be able to tell you in the interview at a high level what steps they would take next. You should pay more attention to the thought processes here, rather than the solution to the problem. When the DBA has given some solutions, go back and quiz them and ask them why the problem should be solved in those ways.
This method will very quickly distinguish the men from the boys, as it were.
How close are you to the max performance of the db? It is very easy to create an OLTP problem that is unsolvable with this technology. As Eric said, total architecture redesign might be in order. More ram, just add more ram :)
Certainly without seeing the database it is hard to say what could be the best way to optimize. Given what you have done, likely you need to hire a database designer - one with experience designing and tuning databases in the size range that you have. In asking how they would approach the problem see if the interviewer would look at first the poorly performing queries and run profiler to see what is happening and to identify them. The person should be able to answer specific questions on parameter sniffing and how to avoid it, what are the methods that can be used to avoid cursors, why do statistics need to be updated, what makes a query saregable. There are some common things that I would look at in performance tuning. Is the network maxed out (Sometimes it isn't the database), is the overall design poorly thought out, are you using SQl code that in general performs badly? If all your searches allow a wildcard as the first character for instance, it isn't even possible to get them to be fast. If your joins are on multi-column natural keys, they are slower than they should be. Are you storing more than one piece of information in a field causing lots of manipulation to get data back out? Are you using cursors? Are you using functions? Are you reusing code when you shouldn't be? Are you always returning the minimum information needed? Are you closing connections? Are you getting deadlocks? Are your table rows too wide? Do you have too many records in particular tables (purging old records or putting them to an archive database can make a huge difference)? How much of your code is row oriented and not set oriented? These are examples of the kinds of things an experienced database person would look at and thus the kinds of things they should talk about in the interview.
Some bad code examples (you know what you have already optimized) can give you a good idea as to their approach to how they would tune. YOu want someone who is methodical and has depth of SQL knowledge.
There are some good books on performance tuning - I would suggest getting them and getting familar with them before interviewing.

which db should i select if performance of postgres is low

In a web app that support more than 5000 users, postgres is becoming the bottle neck.
It takes more than 1 minute to add a new user.(even after optimizations and on Win 2k3)
So, as a design issue, which other DB's might be better?
Most likely, it's not PostgreSQL, it's your design. Changing shoes most likely will not make you a better dancer.
Do you know what is causing slowness? Is it contention, time to update indexes, seek times?
Are all 5000 users trying to write to the user table at the same exact time as you are trying to insert 5001st user? That, I can believe can cause a problem. You might have to go with something tuned to handling extreme concurrency, like Oracle.
MySQL (I am told) can be optimized to do faster reads than PostgreSQL, but both are pretty ridiculously fast in terms of # transactions/sec they support, and it doesn't sound like that's your problem.
P.S.
We were having a little discussion in the comments to a different answer -- do note that some of the biggest, storage-wise, databases in the world are implemented using Postgres (though they tend to tweak the internals of the engine). Postgres scales for data size extremely well, for concurrency better than most, and is very flexible in terms of what you can do with it.
I wish there was a better answer for you, 30 years after the technology was invented, we should be able to make users have less detailed knowledge of the system in order to have it run smoothly. But alas, extensive thinking and tweaking is required for all products I am aware of. I wonder if the creators of StackOverflow could share how they handled db concurrency and scalability? They are using SQLServer, I know that much.
P.P.S.
So as chance would have it I slammed head-first into a concurrency problem in Oracle yesterday. I am not totally sure I have it right, not being a DBA, but what the guys explained was something like this: We had a large number of processes connecting to the DB and examining the system dictionary, which apparently forces a short lock on it, despite the fact that it's just a read. Parsing queries does the same thing.. so we had (on a multi-tera system with 1000s of objects) a lot of forced wait times because processes were locking each other out of the system. Our system dictionary was also excessively big because it contains a separate copy of all the information for each partition, of which there can be thousands per table. This is not really related to PostgreSQL, but the takeaway is -- in addition to checking your design, make sure your queries are using bind variables and getting reused, and pressure is minimal on shared resources.
Please change the OS under which you run Postgres - the Windows port, though immensely useful for expanding the user base, is still not on a par with the (much older and more mature) Un*x ports (and especially the Linux one).
Ithink your best choice is still PostgresSQL. Spend the time to make sure you have properly tuned your application. After your confident you have reached the limits of what can be done with tuning, start cacheing everything you can. After that, start think about moving to an asynchronous master slave setup...Also are you running OLAP type functionality on the same database your doing OLTP on?
Let me introduce you to the simplest, most practical way to scale almost any database server if the database design is truly optimal: just double your ram for an instantaneous boost in performance. It's like magic.
PostgreSQL scales better than most, if you are going to stay with a relational db, Oracle would be it. ODBMS scale better but they have their own issues, as in that it is closer to programming to set one up.
Yahoo uses PostgreSQL, that should tell you something about is scalability.
As highlighted above the problem is not with the particular database you are using, i.e. PostgreSQL but one of the following:
Schema design, maybe you need to add, remove, refine your indexes
Hardware maybe you are asking to much of your server - you said 5k users but then again very few of them are probably querying the db at the same time
Queries: perhaps poorly defined resulting in lots of inefficiency
A pragmatic way to find out what is happening is to analyse the PostgeSQL log files and find out what queries in terms of:
Most frequently executed
Longest running
etc. etc.
A quick review will tell you where to focus your efforts and you will most likely resolve your issues fairly quickly. There is no silver bullet, you have to do some homework but this will be small compared to changing your db vendor.
Good news ... there are lots of utilities to analayse your log files that are easy to use and produce easy to interpret results, here are two:
pgFouine - a PostgreSQL log analyzer (PHP)
pgFouine: Sample report
PQA (ruby)
PQA: Sample report
First, I would make sure the optimizations are, indeed, useful. For example, if you have many indexes, sometimes adding or modifying a record can take a long time.
I know there are several big projects running over PostgreSQL, so take a look at this issue.
I'd suggest looking here for information on PostgreSQL's performance: http://enfranchisedmind.com/blog/2006/11/04/postgres-for-the-win
What version of PG are you running? As the releases have progressed, performance has improved greatly.
Hi had the same issue previously with my current company. When I first joined them, their queries were huge and very slow. It takes 10 minutes to run them. I was able to optimize them to a few milliseconds or 1 to 2 seconds. I have learned many things during that time and I will share a few highlights in them.
Check your query first. doing an inner join of all the tables that you need will always take sometime. One thing that I would suggest is always start off with the table with which you can actually cut your data to those that you need.
e.g. SELECT * FROM (SELECT * FROM person WHERE person ilike '%abc') AS person;
If you look at the example above, this will cut your results to something that you know you need and you can refine them more by doing an inner join. This is one of the best way to speed up your query but there are more than one way to skin a cat. I cannot explain all of them here because there are just too many but from the example above, you just need to modify that to suite your need.
It depends on your postgres version. Older postgres does internally optimize the query. on example is that on postgres 8.2 and below, IN statements are slower than 8.4's.
EXPLAIN ANALYZE is your friend. if your query is running slow, do an explain analyze to determine which of it is causing the slowness.
Vacuum your database. This will ensure that statistics on your database will almost match the actual result. Big difference in the statistics and actual will result on your query running slow.
If all of these does not help you, try modifying your postgresql.conf. Increase the shared memory and try to experiment with the configuration to better suite your needs.
Hope this helps, but of course, these are just for postgres optimization.
btw. 5000 users are not much. My DB contains users with about 200k to a million users.
If you do want to switch away from PostgreSQL, Sybase SQL Anywhere is number 5 in terms of price/performance on the TPC-C benchmark list. It's also the lowest price option (by far) on the top 10 list, and is the only non-Microsoft and non-Oracle entry.
It can easily scale to thousands of users and terabytes of data.
Full disclosure: I work on the SQL Anywhere development team.
We need more details: What version you are using? What is the memory usage of the server? Are you vacuuming the database? Your performance problems might be un-related to PostgreSQL.
If you have many reads over writes, you may want to try MySQL assuming that the problem is with Postgres, but your problem is a write problem.
Still, you may want to look into your database design, and possibly consider sharding. For a really large database you may still have to look at the above 2 issues regardless.
You may also want to look at non-RDBMS database servers or document oriented like Mensia and CouchDB depending on the task at hand. No single tool will manage all tasks, so choose wisely.
Just out of curiosity, do you have any stored procedures that may be causing this delay?

How would you approach this data processing task? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a file containing 250 million website URLs, each with an IP address, page title, country name, server banner (e.g. "Apache"), response time (in ms), number of images and so on. At the moment, these records are in a 25gb flat file.
I'm interested in generating various statistics from this file, such as:
number of IP addresses represented per country
average response time per country
number of images v response time
etc etc.
My question is, how would you achieve this type and scale of processing, and what platform and tools wuld you use(in a reasonable time)?
I am open to all suggestions, from MS SQL on Windows to Ruby on Solaris, all suggestions :-) Bonus points for DRY (don't repeat yourself), I'd prefer not to write a new program each time a different cut is required.
Any comments on what works, and what's to be avoided would greatly be appreciated.
Step 1: get the data into a DBMS that can handle the volume of data. Index appropriately.
Step 2: use SQL queries to determine the values of interest.
You'll still need to write a new query for each separate question you want answered. However, I think that is unavoidable. It should save you replicating the rest of the work.
Edited:
Note that although you probably can do a simple upload into a single table, you might well get better performance out of the queries if you normalize the data after loading it into the single table. This isn't completely trivial, but will likely reduce the volume of data. Making sure you have a good procedure (which will probably not be a stored procedure) for normalizing the data will help.
Load the data into a table in a SQL Server (or any other mainstream db) database, and then write queries to generate the statistics you need. You would not need any tools other than the database itself and whatever UI is used to interact with the data (e.g. SQL Server Management Studio for SQL Server, TOAD or SqlDeveloper for Oracle, etc.).
If you happen to use Windows, take a look at Log Parser. It can be found as a standalone download and also is included as part of the IIS Reource Kit.
Log Parser can read your logs and upload them to the Database.
Database Considerations:
For your Database Server you will want something that is fast (Microsoft SQL Server, IBM's DB2, PostgreSQL or Oracle). mySQL might be useful too but I have not experience with large Databases with it.
You will want all the memory you can afford. If you will be using the Database with regularity I'd say 4 GB at least. It can be done with less but you WILL notice big difference in performance.
Also, go for multicore/multi cpu servers if you can afford it and, again, if you will be using this Database with regularity.
Another recommendation is to analyze the king of queries you will be doing and plan the indexes accordingly. Remember: Every index you create will require additional storage space.
Of course, turn off the indexing or even destroy de indexes before masive data load operations. That will make the load lots faster. Re-index or re-create the indexes after the data load operation.
Now, if this Database will be an ongoing operation (i.e. is not just to investigate/analyze something and then discard it) you may want design a Database Schema with catalog and detail tables. This is called Database Normalization and the exact amount of normalization you will want depends on the usage pattern (data load operations versus query operations). An experienced DBA is a must if this Database will be used on an ongoing basis and have performance requirements.
P.S.
I will take the risk to include something obvious here but...
I think you may be interested in a Log Analyzer. These are computer programs that generate statistics from Web Server log files (some can analyze also ftp, sftp and mail server log files).
Web Log Analyzers generate reports with the statistics. Usually the reports are generated as HTML files and include graphics. There is a fair variety on depth analysis and options. Some are very customizable and some are not. You will find both commercial products and Open Source.
For the amount of data you will be managing, double check each candidate product and take a closer look on speed and ability to handle it.
One thing to keep in mind when you're importing the data is to try to create indexes that will allow you to do the kinds of queries you want to do. Think about what sort of fields will you be querying on and what those queries might look like. That should help you decide what indexing you will need.
25GB of flat file. I don't think writing any component on your own to read this file will be a good idea.
I would suggest that you should go for SQL import and take all the data to SQL Server. I agree that it would take ages to get this data in SQL Server, but once it is there you can do any thing you want with this data.
I hope once you put this data in DB, after that all you will get delta of information not 25 GB of flat file.
You haven't said how the data in your flat file is organised. The RDBMS suggestions are sensible, but presume that your flat file is formatted in some delimited way and a db import is a relatively simple task. If that is not the case then you first have the daunting task of decompiling the data cleanly into a set of fields on which you can do your analysis.
I'm going to presume that your data is not a nice CSV or TXT file, since you haven't said either way and nobody else has answered this part of the possible problem.
If the data have a regular structure, even without nice clean field delimiters you may be able to turn an ETL tool onto the job, such as Informatica. Since you are a techy and this is a one-off job, you should definitely consider writing some code of your own which does some regex comparisons for extraction of the parts that you want and spits out a file which you can then load into a database. Either way you are going to have to invest some significant effort in parsing and cleansing your data, so don't think of this as an easy task.
If you do write your own code then I would suggest you choose a compiled language and make sure you process the data a single row at a time (or in a way that buffers the reads into manageable chunks).
Either way you are going to have a pretty big job making sure that the results of any process that you apply to the data have been consistently executed, you don't want IP addresses turing up as decimal numbers in your calculations. On data of that scale it can be hard to detect a fault like that.
Once you have parsed it then I think that an RDBMS is the right choice to store and analyse your data.
Is this a one time thing or will you be processing things on a daily, weekly basis? Either way check out vmarquez's answer I've heard great things about logparser. Also check out http://awstats.sourceforge.net/ it's a full fledged web stats application.
SQL Server Analysis Services is designed for doing exactly that type of data analysis. The learning curve is a bit steep, but once you set up your schema you will be able to do any kind of cross-cutting queries that you want very quickly.
If you have more than one computer at your disposal, this is a perfect job for MapReduce.
Sounds like a job for perl to me. Just keep count of the stats you want. Use regex to parse the line. It would probably take less than 10 minutes to parse that size file. My computer reads through a 2 gig file (13 million lines) in about 45 seconds with perl.

Resources