Why shouldn't I give outsiders access to my database?

Why shouldn't I give outsiders access to my database? - database

Lots of sites today have APIs that allow users to get data from the site as XML or JSON using a GET HTTP request. Flickr and del.icio.us are example of sites with APIs. These APIs require the server to access the database, and then output the result as either XML or JSON.
Why do we need this translation though? Why not just create a user on the database (for example MySQL)? The user would be given limited access to the database, only being allowed to SELECT, and only certain tables and certain columns in those tables. Wouldn't this be a lot more efficient for the server (it wouldn't have to deal with the HTTP request), and it would be easier for developers, who could now access exactly the data they need, the way they need it.

Security considerations aside, so that you can change your database structure without affecting your clients. Also, poorly formed queries tie up your server, not the clients.

Can you prevent a malicious individual from crafting a super-complex SQL query that will peg your database's CPU at 100%? Can you prevent a lot of innocent programmers from crafting inefficient queries that will never be optimized that will do the same thing?

Coding to Contract - with APIs, you may change everything behind them without affecting outsiders use of them. Here you'd be tying them to not just MySQL but your schema
Caching - Allowing them any query almost removes any opportunity for caching that predictable queries over http that can be used. This is probably the number one way to remove the often number one bottleneck, the database.
Security - with this approach, it would be easy for a denial of service attack, even by accident. Not to mention the fact you'd have to give access to data layer, which is often put in a restricted zone where security can be tightened
Usability - not everyone is a developer or wants to understand a your internal domain. They probably prefer a pre baked straight forward and self-explaining API. An extreme example would be to give managers db privileges rather than reports.

An API:
Makes it easier to montior and control usage (implementing 'limited queries per X' for DB users may be harder)
Allows for presenting simpler structures to the user than may be used in the DB.
Means the user doesn't have to understand your DB structure.
Allows for DB portability. (Oh you've grown massive and now need to implement: sharding, move to bigtable, etc. - With an API the user doesn't need to know)
Allows for different (better? / variable?) caching of requests.
Means you don't have to pay for extra DB users (If that's how the DB is licensed.)

Portability too. Lets say for licensing reasons and scaling you make the business decision to move from MSSQL to MySql. Syntax ain't quite the same and your clients will all have to change their code.
Much better to just buffer it all off and keep the implementation abstracted away. Whose to say you're not persisting the state of the application using trained monkeys scratching marks on bottletops?

Security is the number 1 reason but I hope those reasons are obvious. The user tying up precious resources with bad queries is another good reason.
Beyond that though, why an abstraction layer?
Might you ever want to add some logging to database queries to diagnose speed or to help debug?
Might you ever go from MySQL to MS SQL or vice versa where SQL other than pure ANSI might break?
Should the customer really have to learn your schema rather than a more logical abstraction?
When a new programmer learns of normalization and can now see your whole schema including your carefully balanced denormalizations, do you want to put up with every uninformed criticism?
When a more experienced db person points out improvements, do you want to be stuck with your old schema?
Why to use an API is a question of why to use abstractions and my list here barely scratches the surface.

the web server gives you a buffer that you can control. if there is some bug in your sql server or whatever, you don't want it exposed directly to the internet. true, if the web server has bugs, it might be just as bad ... except you have that extra layer between the data and the world.
-don

It's not as much a 'why not' than a 'why should you' question. Handling HTTP requests is a small penalty for complete control over what all data you allow or disallow a user from accessing. Further, should the nature / quantity / security level of data change in future, you will be better off with a JSON / XML response than allowing total access.

The thing to bear in mind when you're thinking of security issues is that it's really hard to anticipate all of the possible vectors that someone could use to attack you. For instance, are you really sure you've gotten your database permissions set so that people can't mess things up?
Therefore, you want to try restricting actions to only what you know to be good, not just trying to restrict the things you know to be bad. This can be done with a web service that you have absolute control over, but it's difficult to allow somebody to access the database directly and be sure that you're secure.

API is a kind of Wrapper around of database. Users do not know anything about database internal representation of data, he only need to send a number of unified requests and get unified response on it. How and when data will be processed on the server - it's not his headache.

Related

Best practices to structure a database to be scaling-ready

I know this is a very generic and subjective question, so feel free to vote to close it if it does not meet the StackOverflow netiquette.. but for me, it's worth trying ;)
I've never built a high-traffic application since now, so I'm not aware (except for some reading on the web) about scaling practices.
How can I design a database that, when a scaling is needed, I dont have to refactor the database structure, or the application code?
I know that development (and optimization) should come step-by-step, optimize bottleneck as they happen, and is nearly impossible to design the perfect structure when you don't know how many users you'll have and how would they use the database (e.g. read/write ratio), I'm just looking for a good base to start.
What are the best practices for making a structure almost ready to be scaled with partitioning and sharding, and what hacks must be absolutely avoided?
Edit some detail about my application:
The application will run as a multisite behavior
I'll have a database for each application version (db_0_0_1, db_0_0_2, etc..)*
Every 'site' will have a schema inside a database* and a role that can access only his own schemas
Application code will be mostly PHP and few things (daemons and maintenance things) in Python
Web server will probably be Nginx and lighttpd or node.js as support for long-polling tasks (e.g. chat)
Caching will be done with memcached (plus apc for things strictly related to the php code, as it can be used outside php)

The question is really generic, but here are few tips:
Do not use any session variables (pg_backend_pid(), inet_client_addr()) or per-session control (SET ROLE, SET SESSION) in application code.
Do not use explicit transaction control (BEGIN/COMMIT/SET TRANSACTION) in application code. All such logic should be wrapped in UDFs. This enables stateless, statement-mode pooling which enables fastest possible DB pooling. (see pgbouncer docs, and pg wiki for more info)
Encapsulate all App<->Db communication in well defined DB API of UDFs - this will let you use PL/Proxy. If doing this with all SELECTs is too hard, do it at least for all data writes (INSERT/UPDATE/DELETE). Example: instead of INSERT INTO users(name) VALUES('Joe') you need SELECT create_user('Joe').
check your DB schema - is it easy to separate all data belonging to given user? (most probably this will be the partitioning key). All that's left is common, shared data which will need to be replicated to all nodes.
think of caching before you need it. what will be caching key? what will be cache timeout? will you use memcached?

Using LDAP server as a storage base, how practical is it?

I want to learn how practical using an LDAP server (say AD) as a storage base. To be more clear; how much does it make sense using an LDAP server instead of using RDBMS to store data?
I can guess that most you might just say "it doesn't" but there might be some reasons to make it meaningful (especially business wise);
A few points first;
Each table becomes a container entity and each row becomes a new entity as a child. Row entities contains attributes for columns. So you represent your data in this way. (This should be the most meaningful representation I think, suggestions are welcome)
So storing data like a DB server is possible but lack of FK and PK (not sure about PK) support is an issue. On the other hand it supports attribute (relates to a column) indexing (Not sure how efficient). So consistency of data is responsibility of the application layer.
Why would somebody do this ever?
Data that application uses/stores closely matches with the existing data in AD. (Users, Machines, Department Info etc.) (But still some customization is required to existing entity schema, and new schema definitions are needed for not very much related data.)
(I think strongest reason would be this: business related) Most mid-sized companies have very well configured AD servers (replicated, backed-up etc.) but they don't have such DB setup (you can make comment to this as much as you want). Say when you sell your software which requires a DB setup to these companies, they must manage their DB setup; but if you say "you don't need DB setup and management; you can just use existing AD", it sounds appealing.
Obviously there are many disadvantages of giving up using DB, feel free to mention them but let's assume they are acceptable. (I can mention more if question is not clear enough.)

LDAP is a terrible tool for maintaining most business data.
Think about a typical one-to-many relationship - say, customer and orders. One customer has many orders.
There is no good way to represent this data in an LDAP directory.
You could try having a mock "foreign key" by making every entry of that given object class have a "foreign key" attribute, but your referential integrity just went out the window. Cascade deletes are impossible.
You could try having a "customer" object that has "order" children. However, you've just introduced a specific hierachy - you're now tied to it.
And that's the simplest use case. Once you start getting into more complex relationships, you're basically re-inventing an RDBMS in a system explicity designed for a different purpose. The clue's in the name - directory.
If you're storing a phonebook, then sure, use LDAP. For anything else, use a real database.

For relatively small, flexible data sets I think an LDAP solution is workable. However an RDBMS provides a number significant advantages:
Backup and Recovery: just about any database will provide ACID properties. And, RDBMS backups are generally easy to script and provide several options (e.g. full vs. differential). Just don't know with LDAP, but I imagine these qualities are not as widespread.
Reporting: AFAIK LDAP doesn't offer a way to JOIN values easily, much the less do things like calculate summations. So you would put a lot of effort into application code to reproduce those behaviors when you do need reporting. And what application doesn't ultimately?
Indexing: looks like LDAP solutions have indexing, but again, seems hit or miss. Whereas seemingly all databases out there have put some real effort into getting this right.
I think any serious business system's storage should be backed up in the same fashion you believe LDAP is in most environments. If what you're really after is its flexibility in terms of representing hierarchy and ability to define dynamic schemas I'd suggest looking into NoSQL solutions or the Java Content Repository.

LDAP is very usefull for storing that information and if you want it, you may use it. RDMS is just more comfortable with ORM systems. Your persistence logic with LDAP will so complex.
And worth mentioning that this is not a standard approach -> people who will support the project will spend more time on analysis.
I've used this approach for fun, i generate a phonebook from Active Directory, but i don`t think that it's good idea to use LDAP as a store for business applications.

In short: Use the right tool for the right job.
When people see LDAP you already set an expectation on your system. Don't forget that the L Lightweight. LDAP was designed for accessing directories over a network.
With a “directory database” you can build a certain type of application. If you can map your data to a tree like data structure it will work. I surely would not want to steam videos from LDAP! You can probably hack something but I would prefer a steaming server..
There might be some hidden gotchas down the line if you use a tool not designed for what it is supposed to do. So, the downside is you'll have to test stuff that would have been a given in some cases.
It's not is not just a technical concern. Your operational support team might “frown” on your application as they would have certain expectations/preconceptions based on your applications architectural nature. Imagine their surprise if you give them CRM system (website + files and popped email etc.) in a LDAP server as database to maintain.

If I was in your position, I would steer towards one of the NoSQL db solutions rather than trying to use LDAP. LDAP is fine for things like storing user and employee information, but is terrible to interact with when you need to make changes. A NoSQL db will allow you to store your data how you want without the RDBMS overhead you would like to avoid.

The answer is actually easy. Think of CRUD (Create, Read, Update, Delete). If a lot of Read will be made in your system, you can think of using LDAP. Because LDAP is quick in read operations and designed so. If the other operations will be made more, the RDMS would be a better option.

An API which allows users to connect directly to the database

I've worked with many APIs and it's never usually an easy task. Messing about with POST requests and then trying to handle the XML is a pain. And I thought wouldn't it be easier for both user and developer if they could just directly interact with the database.
Is it possible to create a user which API users would connect to then assign that certain privilages? For example they would only be able to select from particular tables and columns. And basically make it so they can't do anything malicious or anything you don't want.
I realise that there is a lot more than just taking data so there would be certain limitations there however selecting is probably what goes on the most when it comes to API usage.
Is this a practical idea? Is it secure? I'm really not sure, I'm the furthest thing from a professional here, it's just an idea.

You could set up a RESTful API that can speak directly to a mySQL database like PHPRestSQL. It can do all the dirty work for you, but you would have full freedom in implementing new functions or restrictions.

What do you mean exactly by API, which API are you talking about?
This sounds more like a design decision. If I understand correctly, you want to interact with the User layer and Database / Persistence layer of an application. In general this is a bad idea. First it really reduces code reuse. This may not be a concern at your point in the development but it is a good idea to learn best practices. The layers I usually follow are:
Model-View-Controller
Service
Persistence
Model / Domain
You can see here that MVC (user interface) is separated from the model by at least two layers. This is usually more secure, and promotes code reuse.

Yes you can do this with any client / server database system (if it is a database server there must be a way to connect to it.)
It is not done much because of a number of issues.
Maintenance is hard
Security is worse
In general there is no benefit.
Basically it causes headaches and does not really provide anything which is good.

The two most important counter-questions are:
1) Is the underlying DB already determined, or can you choose one?
2) What sort of DB operations do your users really need to perform? If "select" is really enough then yes, it probably does make sense to expose the data via a "read only" web service. But if you want to update, delete, make stored procedure calls, etc. then you're going to need something like SQL and it's way hard to build a web services API for that.
If the answer to counter-question 1 is "I can choose", then take a look at CouchDB, which already has a RESTful API (http://wiki.apache.org/couchdb/HTTP_REST_AP) built for it.

Yes, almost all databases allow you to create users with only select access to a specific schema. I've used this to give advanced Excel users ODBC access without worrying that they will mess anything up. Use very sparingly--it has always created maintenance difficulty because people end up using parts of your schema in ways that you didn't intend (or had plans to replace).

You can connect Access to any database - Oracle, for example.
However, it's not necessarily a good idea - for security and data integrity reasons.

How much business logic should be in the database?

I'm developing a multi-user application which uses a (postgresql-)database to store its data. I wonder how much logic I should shift into the database?
e.g. When a user is going to save some data he just entered. Should the application just send the data to the database and the database decides if the data is valid? Or should the application be the smart part in the line and check if the data is OK?
In the last (commercial) project I worked on, the database was very dump. No constraits, no views etc, everything was ruled by the application. I think that's very bad, because every time a certain table was accesed in the code, there was the same code to check if the access is valid repeated over and over again.
By shifting the logic into the database (with functions, trigers and constraints), I think we can save a lot of code in the application (and a lot of potential errors). But I'm afraid of putting to much of the business-logic into the database will be a boomerang and someday it will be impossible to maintain.
Are there some real-life-approved guidelines to follow?

If you don't need massive distributed scalability (think companies with as much traffic as Amazon or Facebook etc.) then the relational database model is probably going to be sufficient for your performance needs. In which case, using a relational model with primary keys, foreign keys, constraints plus transactions makes it much easier to maintain data integrity, and reduces the amount of reconciliation that needs to be done (and trust me, as soon as you stop using any of these things, you will need reconciliation -- even with them you likely will due to bugs).
However, most validation code is much easier to write in languages like C#, Java, Python etc. than it is in languages like SQL because that's the type of thing they're designed for. This includes things like validating the formats of strings, dependencies between fields, etc. So I'd tend to do that in 'normal' code rather than the database.
Which means that the pragmatic solution (and certainly the one we use) is to write the code where it makes sense. Let the database handle data integrity because that's what it's good at, and let the 'normal' code handle data validity because that's what it's good at. You'll find a whole load of cases where this doesn't hold true, and where it makes sense to do things in different places, so just be pragmatic and weigh it up on a case by case basis.

Two cents: if you choose smart, remember not to go in the "too smart" field. The database should not deal with inconsistencies that are inappropriate for its level of understanding of the data.
Example: suppose you want to insert a valid (checked with a confirmation mail) email address in a field. The database could check if the email actually conforms to a given regular expression, but asking the database to check if the email address is valid (e.g. checking if the domain exists, sending the email and handling the response) it's a bit too much.
It's not meant to be a real case example. Just to illustrate you that a smart database has limits in its smartness anyway, and if an unexistent email address gets into it, the data is still not valid, but for the database is fine. As in the OSI model, everything should handle data at its level of understanding. ethernet does not care if it's transporting ICMP, TCP, if they are valid or not.

I find that you need to validate in both the front end (either the GUI client, if you have one, or the server) and the database.
The database can easily assert for nulls, foreign key constraints etc. i.e. that the data is the right shape and linked up correctly. Transactions will enforce atomic writes of this. It's the database's responsibility to contain/return data in the right shape.
The server can perform more complex validations (e.g. does this look like an email, does this look like a postcode etc.) and then re-structure the input for insertion into the database (e.g. normalise it and create the appropriate entities for insertion into the tables).
Where you put the emphasis on validation depends to some degree on your application. e.g. it's useful to validate a (say) postcode in a GUI client and immediately provide feedback, but if your database is used by other applications (e.g. an application to bulkload addresses) then your layer surrounding the database needs to validate as well. Sometimes you end up providing validation in two different implementations (e.g. in the above, perhaps a Javascript front-end and a Java DAO backend). I've never found a good strategic solution to this.

Using the common features of relational databases, like primary and foreign key constraints, datatype declarations, etc. is good sense. If you're not going to use them they why bother with a relational db?
That said, all data should be validated for both type and business rules before it hits the db. Type validation is just defensive programming- assume users are out to hack you and then you'll get fewer unpleasant surprises. Business rules are what your application is all about. If you make them part of the structure of your db they become much more tightly bound to how your app works. If you put them in the application layer, it's easier to change them if business requirements change.
As a secondary consideration: clients often have less choice about which db they use (postgresql, mysql, Oracle, etc) than which application language they have available. So if there is a good chance that your application will be installed on many different systems, your best bet is to make sure that your SQL is as standard as possible. This may will mean that constructing language agnostic db features like triggers, etc. will be more trouble than putting that same logic in your application layer.

It depends on the application :)
For some applications the dumb database is the best. For example Google's applications run on a big dumb database that can't even do joins because the need amazing scalability to be able to serve millions of users.
On the other hand, for some internal enterprise app it can be beneficial to go with very smart database as those are often used in more than just application and therefore you want a single point of control - think of employees database.
That said if your new application is similar to the previous one, I would go with dumb database. In order to eliminate all the manual checks and database access code I would suggest using an ORM library such as Hibernate for Java. It will essentially automate your data access layer but will leave all the logic to your application.
Regarding validation it must be done on all levels. See other answers for more details.

One other item of consideration is deployment. We have an application where the deployment of database changes is actually much easier for remote installations than the actual code base is. For this reason, we've put a lot of application code in stored procedures and database functions.
Deployment is not your #1 consideration but it can play an important role in deciding b/t various choices

This is as much a people question as it is a technology question. If your application is the only application that's ever going to manipulate the data (which is rarely the case, even if you think that's the plan), and you've only got application coders to hand, then by all means keep all the logic in the application.
On the other hand, if you've got DBAs who can handle it, or you know that more than one app will need to have its access validated, then managing data actually in the database makes a lot of sense.
Remember, though, that the best things for the database to be validating are a) the types of the data and b) relational constraints, which anything calling itself an RDBMS should have a handle on anyway.
If you've got any transactions in your application code, it's also worthwhile asking yourself whether they should be pushed to the database as a stored procedure so that it's impossible for them to be incorrectly reimplemented elsewhere.
I do know of shops where the only access allowed to the database is via stored procedures, so the DBAs have full resposibility for both the data storage semantics and access restrictions, and anyone else has to go through their gateways. There are obvious advantages to this, especially if more than one application has to have access to the data. Whether you go quite that far is up to you, but it's a perfectly valid approach.

While I believe that most data should be validated from the user interface (why send known bad stuff across the network tying up resources?), I also believe it is irresponsible not to put constraints on the database as the user interface is unlikely to be the only way that data ever gets into the database. Data also comes in from imports, other applications, quick script fixes for problems run at the query window, mass updates run (to update all prices by 10% for example). I want all bad records rejected no matter what their source and the database is the only place where you can be assured that will happen. To skip the database integrity checks because the user interface does it is to guarantee that you will most likely eventually have data integrity issues and then all of your data become meaningless and useless.

e.g. When a user is going to save some
data he just entered. Should the
application just send the data to the
database and the database decides if
the data is valid? Or should the
application be the smart part in the
line and check if the data is OK?
Its better to have the validation in the front end as well as the server side. So if the data is invalid the user will be notified immediately. Otherwise he will have to wait for the DB to respond after a post back.
When security is concerned its better to validate at both the ends. Front end as well as DB. Or how can the DB trust all the data that is sent by the application ;-)

Validation should be done on the client-side and server side and once it valid then it should be stored.
The only work that the database should do is any querying logic. So update rows, inserting rows, selects and everything else should be handled by the server side logic since thats where the real meat of the application lives.
Structuring your insert properly will handle any foreign Key constraints. Getting your business logic to call a sproc will insert data in the correct format. I don't really consider this validation but some people might.

My decision is : never use stored procedure in database. Stored procedure is not portable.

Preventing bad data input

Is it good practice to delegate data validation entirely to the database engine constraints?
Validating data from the application doesn't prevent invalid insertion from another software (possibly written in another language by another team). Using database constraints you reduce the points where you need to worry about invalid input data.
If you put validation both in database and application, maintenance becomes boring, because you have to update code for who knows how many applications, increasing the probability of human errors.
I just don't see this being done very much, looking at code from free software projects.

Validate at input time. Validate again before you put it in the database. And have database constraints to prevent bad input. And you can bet in spite of all that, bad data will still get into your database, so validate it again when you use it.
It seems like every day some web app gets hacked because they did all their validation in the form or worse, using Javascript, and people found a way to bypass it. You've got to guard against that.
Paranoid? Me? No, just experienced.

It's best to, where possible, have your validation rules specified in your database and use or write a framework that makes those rules bubble up into your front end. ASP.NET Dynamic Data helps with this and there are some commercial libraries out there that make it even easier.
This can be done both for simple input validation (like numbers or dates) and related data like that constrained by foreign keys.
In summary, the idea is to define the rules in one place (the database most the time) and have code in other layers that will enforce those rules.

The disadvantage to leaving the logic to the database is then you increase the load on that particular server. Web and application servers are comparatively easy to scale outward, but a database requires special techniques. As a general rule, it's a good idea to put as much of the computational logic into the application layer and keep the interaction with the database as simple as possible.
With that said, it is possible that your application may not need to worry about such heavy scalability issues. If you are certain that database server load will not be a problem for the foreseeable future, then go ahead and put the constraints on the database. You are quite correct that this improves the organization and simplicity of your system as a whole by keeping validation logic in a central location.

There are other concerns than just SQL injection with input. You should take the most defensive stance possible whenever accepting user input. For example, a user might be able to enter a link to an image into a textbox, which is actually a PHP script that runs something nasty.
If you design your application well, you should not have to laboriously check all input. For example, you could use a Forms API which takes care of most of the work for you, and a database layer which does much the same.
This is a good resource for basic checking of vulnerabilities:
http://ha.ckers.org/xss.html

It's far too late by the time the data gets to your database to provide meaningful validation for your users and applications. You don't want your database doing all the validation since that'll slow things down pretty good, and the database doesn't express the logic as clearly. Similarly, as you grow you'll be writing more application level transactions to complement your database transactions.

I would say it's potentially a bad practice, depending on what happens when the query fails. For example, if your database could throw an error that was intelligently handled by an application, then you might be ok.
On the other hand, if you don't put any validation in your app, you might not have any bad data, but you may have users thinking they entered stuff that doesn't get saved.

Implement as much data validation as you can at the database end without compromising other goals. For example, if speed is an issue, you may want to consider not using foreign keys, etc. Furthermore, some data validation can only be performed on the application side, e.g., ensuring that email addresses have valid domains.

Another disadvantage to doing data validation from the database is that often you dont validate the same way in every case. In fact, it often depends on application logic (user roles), and sometimes you might want to bypass validation altogether (cron jobs and maintenance scripts).

I've found that doing validation in the application, rather than in the database, works well. Of course then, all the interaction needs to go through your application. If you have other applications that work with your data, your application will need to support some sort of API (hopefully REST).

I don't think there is one right answer, it depends on your use.
If you are going to have a very heavily used system, with the potential that the database performance might become a bottleneck, then you might want to move the responsibility for validation to the front-end where it is easier to scale with multiple servers.
If you have multiple applications interacting with the database, then you might not want to replicate and maintain the validation rules across multiple applications, so then the database might be the better place.
You might want a slicker input screen that doesn't just hit the user with validation warnings when they try to save a record, maybe you want to validate a field after data has been entered and it losses focus; or even as the user types, changing the font colour as validation fails/passes.
Also related to constraints, is warnings of suspect data. In my application I have hard-constraints in the database (e.g. someone can't start a job before their date of birth), but then in the front-end have warnings for data that is possibly correct, but suspect (e.g. an eight year-old starting a job).