SQL Injection-safe call of polymorphic function

SQL Injection-safe call of polymorphic function - database

Several times I've found myself refactoring web application code and end up wanting to do something like this (Groovy in this case, but could be anything):
Map getData(String relationName, Integer rowId) {
def sql = Sql.newInstance([...])
def result = sql.firstRow('SELECT getRelationRow(?,?)', relationName, rowId)
sql.close()
return new HashMap(result)
}
where the stored procedure getRelationRow(relname text, rowid integer) executes dynamic sql to retrieve the row of the specified rowid in the requested relation. The best example I've seen of such a function is this polymorphic function using anyelement type, and is called as
SELECT * FROM data_of(NULL::pcdmet, 17);
However to call this in the above code would require
def result = sql.firstRow("SELECT * FROM data_of(NULL::${relationName},?)", rowId)
that is, it required the relation name to be pasted into the query, which risks SQL Injection. So is there away to keep the polymorphic goodness of the stored procedure but allow it to be called with generic relation names?

I don't think it can be done this way. I assume Groovy is using prepared statements here, which requires that input and return types are known at prepare time, while my function derives the return type from the polymorphic input type.
I am pretty sure you need string concatenation. But don't fret, there are functions like pg_escape() to sanitize table names and make SQLi impossible. Don't know Groovy, but it should have that, too.
Or does it?
Based on the function data_of(..) at the end of this related answer:
Refactor a PL/pgSQL function to return the output of various SELECT queries
With PREPARE I can make this work by declaring the return type explicitly:
PREPARE fooplan ("relationName") AS -- table name works as row type
SELECT * FROM data_of($1, $2);
Then I can hand in NULL, which is cast to "relationName" from the prepared context:
EXECUTE fooplan(NULL, 1);
So this might work after all, if your interface supports this. But you still have to concatenate the table name as return data type (and therefore defend against SQLi). Catch 22 I guess.

Related

Generate sql query by anorm, with all nulls except one

I developing web application with play framework 2.3.8 and scala, with complex architecture on backend and front-end side. As backend we use MS SQL, with many stored procedures, and called it by anorm. And here one of the problems.
I need to update some fields in database. The front end calls play framework, and recive name of the field, and value. Then I parse, field name, and then I need to generate SQL Query for update field. I need assign null, for all parameters, except recived parameter. I try to do it like that:
def updateCensusPaperXX(name: String, value: String, user: User) = {
DB.withConnection { implicit c =>
try {
var sqlstring = "Execute [ScXX].[updateCensusPaperXX] {login}, {domain}"
val params = List(
"fieldName1",
"fieldName2",
...,
"fieldNameXX"
)
for (p <- params){
sqlstring += ", "
if (name.endsWith(p))
sqlstring += value
else
sqlstring += "null"
}
SQL(sqlstring)
.on(
"login" -> user.login,
"domain" -> user.domain,
).execute()
} catch {
case e: Throwable => Logger.error("update CensusPaper04 error", e)
}
}
}
But actually that doesn't work in all cases. For example, when I try to save string, it give's me an error like:
com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near 'some phrase'
What is the best way to generate sql query using anorm with all nulls except one?

The reason this is happening is because when you write the string value directly into the SQL statement, it needs to be quoted. One way to solve this would be to determine which of the fields are strings and add conditional logic to determine whether to quote the value. This is probably not the best way to go about it. As a general rule, you should be using named parameters rather than building a string to with the parameter values. This has a few of benefits:
It will probably be easier for you to diagnose issues because you will get more sensible error messages back at runtime.
It protects against the possibility of SQL injection.
You get the usual performance benefit of reusing the prepared statement although this might not amount to much in the case of stored procedure invocation.
What this means is that you should treat your list of fields as named parameters as you do with user and domain. This can be accomplished with some minor changes to your code above. First, you can build your SQL statement as follows:
val params = List(
"fieldName1",
"fieldName2",
...,
"fieldNameXX"
)
val sqlString = "Execute [ScXX].[updateCensusPaperXX] {login}, {domain}," +
params.map("{" + _ + "}").mkString{","}
What is happening above is that you don't need to insert the values directly, so you can just build the string by adding the list of parameters to the end of your query string.
Then you can go ahead and start building your parameter list. Note, the parameters to the on method of SQL is a vararg list of NamedParameter. Basically, we need to create Seq of NamedParameters that covers "login", "domain" and the list of fields you are populating. Something like the following should work:
val userDomainParams: Seq[NamedParameter] = (("login",user.login),("domain",user.domain))
val additionalParams = params.map(p =>
if (name.endsWith(p))
NamedParameter(p, value)
else
NamedParameter(p, None)
).toSeq
val fullParams = userDomainParams ++ additionalParams
// At this point you can execute as follows
SQL(sqlString).on(fullParams:_*).execute()
What is happening here is that you building the list of parameters and then using the splat operator :_* to expand the sequence into the varargs needed as arguments for the on method. Note that the None used in the NamedParameter above is converted into a jdbc NULL by Anorm.
This takes care of the issue related to strings because you are no longer writing the string directly into the query and it has the added benefit eliminating other issues related with writing the SQL string rather than using parameters.

Creating a table method on a user defined type (like like 'nodes' on the XML data type)

I've created working SQLCLR-based user defined table-valued functions as well as user defined types.
What I want now is a method on a SQLCLR UDT that returns a table, similar to the nodes method on the XML data type.
The TableDefinition and FillRowMethodName properties of the SqlMethod attribute / decoration seem to imply that it should be possible, but nothing actually works. When I call the method like this (which I expect to fail):
SELECT #Instance.AsTable();
I get:
invalid data type
which is not the same error I get if I call a SQLCLR table-valued function that way.
When I call like this:
SELECT * FROM #Instance.AsTable();
I get the (possibly expected) error:
The table (and its columns) returned by a table-valued method need to be aliased.
When I call like this:
SELECT * FROM #Instance.AsTable() t(c1,c2);
It says:
Table-valued function 'AsTable' cannot have a column alias
I had a feeling the problem might be related to the fact that table functions are expected to be static so I also tried implementing it as an extension method but in that case it won't even compile. I get:
Extension method must be defined in a non-generic static class
Obviously I could implement it as a normal SQLCLR table function that expects a parameter of my UDT type. In fact I've actually done this and SELECT * FROM dbo.AsTable(#Instance); works but I don't like that solution. I'd really like to get this syntax to work: SELECT * FROM #Instance.AsTable() t(c1,c2);
Here's one non-working version:
[SqlMethod(IsDeterministic = false, DataAccess = DataAccessKind.None, OnNullCall=true
, TableDefinition = "RowKey nvarchar(32),RowValue nvarchar(1000)"
, FillRowMethodName = "FormatLanguageRow")]
public IEnumerable AllRows()
{
return _rowCollection;
}

Sadly, it is not possible to create a method on a UDT that returns a table (i.e. returns IEnumerable). Your confusion regarding the two properties of the SqlMethod attribute are quite understandable as they are misleading. If you take a look at the MSDN page for SqlMethodAttribute Class, you will see the following note in the "Remarks" section:
SqlMethodAttribute inherits from a SqlFunctionAttribute, so SqlMethodAttribute inherits the FillRowMethodName and TableDefinition fields from SqlFunctionAttribute. Note that it is not possible to write a table-valued method, although the names of these fields might suggest that it is possible.
Outside of the work-around that you have already tried (i.e. creating a TVF), you could return the data as XML (i.e. SqlXml) from a UDT method. It certainly won't be as efficient as a streaming TVF, but it is returning a value that can be turned into a table inline via the nodes() method of the XML datatype.
UPDATE (2017-01-16)
I have formally submitted this as a Microsoft Connect suggestion:
Allow CLR UDT to return IEnumerable / Table / TVF from a method ( SqlMethod ) like XML.nodes()

PostgreSQL v9.X have real "array of record"?

This query works fine,
WITH test AS (
SELECT array_agg(t) as x FROM (
SELECT 1111 as id, 'aaaaa' as cc
) AS t
) SELECT x[1] FROM test;
but, can I access the recod elements? I try SELECT x[1].id; SELECT x[1][1]; ... nothing works.
PS: with Google we see only OLD solutions... The context here is v9.X, no news about "array of record"?
I try also
select x[1] from (select array[row(1,2)] as x) as t;
no solution to access only item 1 or only item 2.
A clue that I was unable to follow: postgresql.1045698.n5.nabble.com solve the problem using a CREATE TYPE... Ok, but I need "all in a query" solution. Where the "dynamic typing" of PostgreSQL?? How to CAST or express the type without a CREATE TYPE clause?

At present there does not appear to be any syntax for accessing a record of anonymous type except via the function call syntax or via hstore. That's unfortunate, but not likely to get fixed in a hurry unless someone who really cares comes along to make it happen. There are other priorities.
You have three workaround options:
CREATE TYPE
hstore
CREATE TYPE
The issue is with records of anonymous type. So make it not anonymous. Unfortunately this is only possible before it becomes an anonymous record type; you can't currently cast from record to a usertype. So you'd need to do:
CREATE TYPE some_t AS (id integer, cc text);
WITH test AS (
SELECT array_agg(t::some_t) as x FROM (
SELECT 1111 as id, 'aaaaa' as cc
) AS t
) SELECT x[1].id FROM test;
Note the cast of the subquery output to some_t before aggregation.
I can't say I understand why this cast can't be performed after indexing the array instead.
hstore
As usual, hstore rides to the mostly-rescue with difficult type problems.
regress=> WITH test AS (
SELECT array_agg(t) as x FROM (
SELECT 1111 as id, 'aaaaa' as cc
) AS t
) SELECT hstore(x[1])->'id' FROM test;
?column?
----------
1111
(1 row)
You need the hstore extension, and I'm sure it's not efficient, but it works. This builds on the hstore support for creating a hstore from anonymous records that was added to support NEW and OLD in triggers, a past pain-point.
Wrapper function?
Turns out you can't get around it with a simple wrapper function to let you specify the type on the call-site:
regress=> CREATE OR REPLACE FUNCTION identity(record) RETURNS record AS $$
SELECT $1;
$$ LANGUAGE sql IMMUTABLE;
ERROR: SQL functions cannot have arguments of type record
so you'd have to use a higher-overhead procedural language, at which point you might as well use hstore instead, it'll be faster and easier.
Making it better?
So, this is all a bit ugly. It's not ever likely to be possible to directly index a field from an anonymous record, since it might not exist and its type cannot be deduced. But there's no reason we can't use the type system feature that allows us to return record from a function and specify its type on the caller-side to also do so in casts.
It should be possible to make PostgreSQL support something like:
WITH test AS (
SELECT array_agg(t) as x FROM (
SELECT 1111 as id, 'aaaaa' as cc
) AS t
) SELECT (x[1] AS some_t(id integer, cc text)).id FROM test;
it'd just involve appropriate parser-hacking, and a way to make sure that was never ambiguously parsed in conflict with a column-alias.
Really, even type-inference could be possible if someone was willing to put the work in and convince the team that the rather large amount of query-planner processor time required was worth it. (unlikely).
This is an irritating, but minor, corner in the type system. If you want it to change you will need to make noise on pgsql-general, and accompany that noise with the willingness to do real work to improve the problem. This may involve learning more than you ever wanted to know about the innards of PostgreSQL's type system, learning the fun of "backward compatibility", and having frustrating arguments around and around in circles. Welcome to open source!

How do Django queries 'cast' argument strings into the appropriate field-matching types?

Let's take the Django tutorial. In the first part we can find this model:
class Poll(models.Model):
question = models.CharField(max_length=200)
pub_date = models.DateTimeField('date published')
with which Django generates the following SQL:
CREATE TABLE "polls_poll" (
"id" serial NOT NULL PRIMARY KEY,
"question" varchar(200) NOT NULL,
"pub_date" timestamp with time zone NOT NULL
);
One can note that Django automatically added an AutoField, gloriously named id, which is akin to an IntegerField in that it handles integers.
On part 3, we build a custom view, reachable through the following url pattern:
(r'^polls/(?P<poll_id>\d+)/$', 'polls.views.detail'),
The tutorial helpfully explains that a subsequent HTTP request will result in the following call:
detail(request=<HttpRequest object>, poll_id='23')
A few scrolls later, we can find this snippet:
def detail(request, poll_id):
try:
p = Poll.objects.get(pk=poll_id)
Notice how the URL tail component becomes the poll_id argument with a string value of '23', happily churned by the Manager (and therefore QuerySet) get method to produce the result of an SQL query containing a WHERE clause with an integer value of 23 certainly looking like that one:
SELECT * FROM polls_poll WHERE id=23
Certainly Django performed the conversion from the fact that the id field is an AutoField one. The question is how, and when. Specifically, I want to know which internal methods are called, and in what order (kind of like what the doc explains for form validation).
Note: I took a look at sources in django.db.models and found a few *prep* methods, but don't know neither when or where they are called, let alone if they're what I'm looking for.
PS: I know it's not casting stricto sensu, but I think you get the idea.

I think it's in django.db.models.query.get_where_clause

Implementing check constraints with SQL CLR integration

I'm implementing 'check' constraints that simply call a CLR function for each constrained column.
Each CLR function is one or two lines of code that attempts to construct an instance of the user-defined C# data class associated with that column. For example, a "Score" class has a constructor which throws a meaningful error message when construction fails (i.e. when the score is outside a valid range).
First, what do you think of that approach? For me, it centralizes my data types in C#, making them available throughout my application, while also enforcing the same constraints within the database, so it prevents invalid manual edits in management studio that non-programmers may try to make. It's working well so far, although updating the assembly causes constraints to be disabled, requiring a recheck of all constraints (which is perfectly reasonable). I use DBCC CHECKCONSTRAINTS WITH ALL_CONSTRAINTS to make sure the data in all tables is still valid for enabled and disabled constraints, making corrections as necessary, until there are no errors. Then I re-enable the constraints on all the tables via ALTER TABLE [tablename] WITH CHECK CHECK CONSTRAINT ALL. Is there a T-SQL statement to re-enable with check all check constraints on ALL tables, or do I have to re-enable them table by table?
Finally, for the CLR functions used in the check constraints, I can either:
Include a try/catch in each function to catch data construction errors, returning false on error, and true on success, so that the CLR doesn't raise an error in the database engine, or...
Leave out the try/catch, just construct the instance and return true, allowing that aforementioned 'meaningful' error message to be raised in the database engine.
I prefer 2, because my functions are simpler without the error code, and when someone using management studio makes an invalid column edit, they'll get the meaningful message from the CLR like "Value for type X didn't match regular expression '^p[1-9]\d?$'" instead of some generic SQL error like "constraint violated". Are there any severe negative consequences of allowing CLR errors through to SQL Server, or is it just like any other insert/update failure resulting from a constraint violation?

For example, a "Score" class has a constructor which throws a meaningful error message when construction fails (i.e. when the score is outside a valid range). First, what do you think of that approach?
It worries me a bit, because calling a ctor requires memory allocation, which is relatively expensive. For each row inserted, you're calling a ctor -- and only for its side-effects.
Also expensive are exceptions. They're great when you need them, but this is a case where you vould use them in a ctor context, but not in a check context.
A refactoring could reduce both costs, by having the check exist as a class static or free function, then both the check constraint and the ctor could call that:
class Score {
private:
int score;
public:
static bool valid( int score ) {
return score > 0 ;
}
Score( int s ) {
if( ! valid( s ) ) {
throw InvalidParameter();
}
score = s;
}
}
Check constraint calls Score::valid(), no construction or exception needed.
Of course, you still have the overhead, for each row, of a CLR call. Whether that's acceptable is something you'll have to decide.
Is there a T-SQL statement to re-enable with check all check constraints on ALL tables, or do I have to re-enable them table by table?
No, but you can do this to generate the commands:
select 'ALTER TABLE ' || name || ' WITH CHECK CHECK CONSTRAINT ALL;'
from sys.tables ;
and then run the resultset against the database.
Comments from the OP:
I use base classes called ConstrainedNumber and RegexConstrainedString for all my data types. I could easily move those two classes' simple constructor code to a separate public boolean IsValueValid method as you suggested, and probably will.
The CLR overhead (and memory allocation) would only occur for inserts and updates. Given the simplicity of the methods, and rate at which table updates will occur, I don't think the performance impact will anything to worry about for my system.
I still really want to raise exceptions for the information they'll provide to management studio users. I like the IsValueValid method, because it gives me the 'option' of not throwing errors. Within applications using my data types, I could still get the exception by constructing an instance :)
I'm not sure I agree with the exception throwing, but again, the "take-home message" is that by decomposing the problem into parts, you can select what parts you're wiling to pay for, without paying for parts you don't use. The ctor you don't use, because you were only calling it to get the side-effect. So we decomposed creation and checking. We can further decompose throwing:
class Score {
private:
int score;
public:
static bool IsValid( int score ) {
return score > 0 ;
}
static checkValid( int score ) {
if( ! isValid( s ) ) {
throw InvalidParameter();
}
Score( int s ) {
checkValid( s ) ;
score = s;
}
}
Now a user can call the ctor, and get the check and possible exception and construction, call checkValid and get the check and exception, or isValid to just get the validity, paying the runtime cost for only what he needs.

Some clarification. These data classes set one level above the primitives types, constraining data to make it meaningful.
Actually, they sit just above the RegexConstrainedString and ConstrainedNumber<T> classes, which is where we're talking about refactoring the constructor's validation code into a separate method.
The problem with refactoring the validation code, is that the Regex necessary for validation exists only in the subclasses of RegexConstrainedString, since each subclass has a different Regex. This means that the validation data is only available to the RegexConstrainedString's constructor, not any of it's methods. So, if I factor out the validation code, callers would need access to the Regex.
public class Password: RegexConstrainedString
{
internal static readonly Regex regex = CreateRegex_CS_SL_EC( #"^[\w!""#\$%&'\(\)\*\+,-\./:;<=>\?#\[\\\]\^_`{}~]{3,20}$" );
public Password( string value ): base( value.TrimEnd(), regex ) {} //length enforced by regex, so no min/max values specified
public Password( Password original ): base( original ) {}
public static explicit operator Password( string value ) {return new Password( value );}
}
So, when reading a value from the database or reading user input, the Password constructor forwards the Regex to the base class to handle the validation. Another trick is that it trims the end characters automatically, in case the database type is char rather than varchar, so I don't have to remember to do it. Anyway, here is what the main constructor for RegexConstrainedString looks like:
protected RegexConstrainedString( string value, Regex subclasses_static_regex, int? min_length, int? max_length )
{
_value = (value ?? String.Empty);
if (min_length != null)
if (_value.Length < min_length)
throw new Exception( "Value doesn't meet minimum length of " + min_length + " characters." );
if (max_length != null)
if (_value.Length > max_length)
throw new Exception( "Value exceeds maximum length of " + max_length + " characters." );
value_match = subclasses_static_regex.Match( _value ); //Match.Synchronized( subclasses_static_regex.Match( _value ) );
if (!value_match.Success)
throw new Exception( "Invalid value specified (" + _value + "). \nValue must match regex:" + subclasses_static_regex.ToString() );
}
Since callers would need access to the subclass's Regex, I think my best bet is to implement a IsValueValid method in the subclass, which forwards the data to the IsValueValid method in the RegexConstrainedString base class. In other words, I would add this line to the Password class:
public static bool IsValueValid( string value ) {return IsValueValid( value.TrimEnd(), regex, min_length, max_length );}
I don't like this however, because I'm replicating the subclasses constructor code, having to remember to trim the string again and pass the same min/max lengths when necessary. This requirement would be forced upon all subclasses of RegexConstrainedString, and it's not something I want to do. These data classes like Password is so simple, because RegexConstrainedString handles most of the work, implementing operators, comparisons, cloning, etc.
Furthermore, there are other complications with factoring out the code. The validation involves running and storing a Regex match in the instance, since some data types may have properties that report on specific elements of the string. For example, my SessionID class contains properties like TimeStamp, which return a matched group from the Match stored in the data class instance. The bottom line is that this static method is an entirely different context. Since it's essentially incompatible with the constructor context, the constructor cannot use it, so I would end up replicating code once again.
So... I could factor out the validation code by replicating it and tweaking it for a static context and imposing requirements on subclasses, or I could keep things much simpler and just perform the object construction. The relative extra memory allocated would be minimal, as only a string and Match reference is stored in the instance. Everything else, such as the Match and the string itself would still be generated by the validation anyway, so there's no way around that. I could worry about the performance all day, but my experience has been that correctness is more important, because correctness often leads to numerous other optimizations. For example, I don't ever have to worry about improperly formatted or sized data flowing through my application, because only meaningful data types are used, which forces validation to the point-of-entry into the application from other tiers, be it database or UI. 99% of my validation code was removed as a no-longer-necessary artifact, and I find myself only checking for nulls nowadays. Incidentally, having reached this point, I now understand why including nulls was the billion dollar mistake. Seems to be the only thing I have to check for anymore, even though they are essentially non-existent in my system. Complex objects having these data types as fields cannot be constructed with nulls, but I have to enforce that in the property setters, which is irritating, because they otherwise would never need validation code... only code that runs in response to changes in values.
UPDATE:
I simulated the CLR function calls both ways, and found that when all data is valid, the performance difference is only fractions of a millisecond per thousand calls, which is negligible. However, when roughly half the passwords are invalid, throwing exceptions in the "instantiation" version, it's three orders of magnitude slower, which equates to about 1 extra sec per 1000 calls. The magnitudes of difference will of course multiple as multiple CLR calls are made for multiple columns in the table, but that's a factor of 3 to 5 for my project. So, is an extra 3 - 5 second per 1000 updates acceptable to me, as a trade off for keeping my code very simple and clean? Well that depends on the update rate. If my application were getting 1000 updates per second, a 3 - 5 second delay would be devastating. If, on the other hand, I was getting 1000 updates a minute or an hour, it may be perfectly acceptable. In my situation, I can tell you now that it's quite acceptable, so I think I'll just go with the instantiation method, and allow the errors through. Of course, in this test I handled the errors in the CLR instead of letting SQL Server handle them. Marshalling the error info to SQL Server, and then possibly back to the application, could definitely slow things down much more. I guess I will have to fully implement this to get a real test, but from this preliminary test, I'm pretty sure what the results will be.