The problem is I need to ignore the stray Letters in the numbers: e.g. 19A or B417
Take a look here:
Extracting Numbers with SQL Server
There are several hidden "gotcha's" that are explained pretty well in the article.
It depends on how much data you're dealing with, but doing that in SQL is probably going to be slow. Not everyone will agree with me here, but I think all data processing should be done in application code.
I would just take the rows you want, and filter it in the application you're dealing with.
The easiest thing to do here would be to create a CLR function which takes the address. In the CLR function, you would take the first part of the address (assuming it is the house number), which should be delimited by whitespace.
Then, replace any non-numeric characters with an empty string.
You should have a string representing an integer at that point which you can pass to the Parse method on the Int32 class to produce an integer, which you can then check to see if it is odd.
I recommend a CLR function (assuming you are using SQL Server 2005 and above, and can set the compatibility level of the database) here because it's easier to perform string manipulations in .NET than it is in T-SQL.
Assuming [Address] is the column with the address in it...
Select Case Cast(Substring(Reverse(Address), PatIndex('%[0-9]%',
Reverse(Address)), 1) as Integer) % 2
When 0 Then 'Even'
When 1 Then 'Odd' End
From Table
I've been through this drill before. The best alternative is to add a column to the table or to a subsidiary joinable table that stores the inferred numerical value for the purpose. Then use iterative queries to set the column repeatedly until you get sufficient accuracy and coverage. You'll end up encountering stuff like "First, Third," "451a", "1200 South 19th Blvd East", and worse.
Then filter new and edited records as they occur.
As usual, UDF's should be avoided as being slow and (comparatively) less debuggable.
Related
We are storing wine data in our database. The vintage of the wine might be a number like 2010 or a string such as Non-Vintage.
2010 means the grapes were harvested in 2010.
Non-Vintage means the grapes were harvested across an unknown time-period.
At first we decided to store the field as a string, since 2010 and Non-Vintage are both potentially strings. However, we need to be able to sort the years or perform some arithmetic (i.e. year > 2010).
We are considering either:
Store the data as a number and "Non-Vintage" would be assigned 0. However, we'd have to provide weird validations everywhere in the app for handling the 0 value.
Store the year as a number and provide a boolean "non_vintage" field for non-vintage wines.
The data pulled from the database will be delivered to an AngularJS front-end via an API. The Javascript code will have to parse through and use the year at various points... i.e. "show me all wines where year > 2010".
Anyone have any thoughts on which is better and why?
are you using mysql? you can cast strings as ints in your mysql statements.
select * from table order by cast(string AS signed) asc;
That would mean you can store as a string.
Option 1
Since you're not really treating the year as a number, nor do you need to .. the simplest option you have is to just create a string and have at it.
This has some advantages and disads right off the bat:
Advantage:
simplest, nothing complex. any database can handle this.
logic is fairly simple, although a side case to handle your "Non-Vintage" is needed.
Disadvantage:
potentially allows values you don't want/expect/etc. :
ie: "NonVintage", "Non-vantage", "Unknown", "Spider-man!" ... O.o
This might be mitigated by whatever your process for putting values in might be (ie if it's mostly automated, this might be a smaller issue) :)
========================================
Option 2
A more stricter way would be to use 2 columns.
A number and a string.
Store the vintage year in a 4-digit number field, and you know it'll always be a "proper" year. (you could add a check constraint to prevent years < 1000 if you want ;) )
Store the code "NV" in a 2 (or 3?) digit string "code" column. This gives you good flexibility going forward in case other requirements in future start asking for additional types or such.
You could just do a boolean - it would work, however, if things changed in future (they always do), you'd have to redesign .. the string code column has no real hard disad on the boolean and gives you simple flexibility going forward ;)
========================
It would depend on your system and what you know of it, and how the data's coming in ... but I'd probably lean towards option 2 (number + string) myself.
I am working in a SQL Server environment, heavy on stored procedures, where a lot of the procedures use 0 and '' instead of Null to indicate that no meaningful parameter value was passed.
These parameters appear frequently in the WHERE clauses. The usual pattern is something like
WHERE ISNULL(SomeField,'') =
CASE #SomeParameter
WHEN '' THEN ISNULL(SomeField,'')
ELSE #SomeParameter
END
For various reasons, it's a lot easier to make a change to a proc than a change to the code that calls it. So, given that the calling code will be passing empty strings for null parameters, what's the fastest way to compare to an empty string?
Some ways I've thought of:
#SomeParameter = ''
NULLIF(#SomeParameter,'') IS NULL
LEN(#SomeParameter) = 0
I've also considered inspecting the parameter early on in the proc and setting it to NULL if it's equal to '', and just doing a #SomeParameter IS NULL test in the actual WHERE clause.
What other ways are there? And what's fastest?
Many thanks.
Sorting out the parameter at the start of the proc must be faster than multiple conditions in a where clause or using a function in one. The more complex the query, or the more records that have to be filtered, the greater the gain.
The bit that would scare me is if this lack of nullability in the procedure arguments has got into the data as well. If it has when you start locking things down, your queries are going to come back with the "wrong" results.
If this product has some longevity, then I'd say easy is the wrong solution long term, and it should be corrected in the calling applications. If it doesn't then may be you should just leave it alone as all you would be doing is sweeping the mess from under one rug, under another...
How are you going to test these changes, the chances of you introducing a wee error, while bored out of your skull making the same change again and again and again, are very high.
I came across some existing code in our production environment given to us by our vendor. They use a string to store comma seperated values to store filtered results from a DB. Keep in mind that this is for a proprietary scripting language called PowerOn that interfaces with a database residing on an AIX system, but it's a language that supports strings, integers, and arrays.
For example, we have;
Account
----------------
123
234
3456
28390
The psuedo code might look like;
Define accounts As String
For Each Account
accounts=accounts + CharCast(Account) + ","
End
as opposed to something I would expect to see like
Define accounts As Integer Array(99)
Define index as Integer=0
For Each Account
accounts(index)=Account
index=index+1
End
By the time the loop is done, accounts will look like; 123,234,3456,28390,. The string is later used to test if a specific instance exists like so
If CharSearch("28390", accounts) > 0 Then Call DoSomething
In the example, the statement evaluates to true and DoSomething gets called. Given the option of arrays, why would want to store integer values whithin a string of comma seperated values? Every language I've come across, it's almost always more expensive to perform string based operations than integer based operations.
Considering I haven't seen this technique before and my experience is somewhat limitted, is there a name for this? Is this common practice or is this just another example of being too stringly typed? To extend the existing code, should I continue using string method? Did we get cruddy code from our vendor?
What I put in the comment still holds but my real answer is: It's probably a design decision with respect to compatibility/portability. In your integer-array case (and a low enough level of the API) you'd typically find yourself asking questions like, what's a safe guess of the size of an integer on "today"'s machines. What about endianness.
The most portable and most flexible of all data formats always has been and always will be printed representation. It may not be as fast to process that but that's where adapters/converters or so kick in. I wouldn't be surprised to find (human-readable) printed representation of something especially in database APIs like you describe.
If you want something fast, just take whatever is given to you, convert it to a more efficient internal format, do you processing and convert it back.
There's nothing inherently wrong with using comma-separated strings instead of arrays. Sure you can't readily access a random n's element of such a collection, but if such random access is not needed then there's no penalty for it, right?
As far as I know Oracle DB stores NUMBER values as strings (and if my memory is correct - for DATEs as well) for very practical reasons.
In your specific example looks like using strings is an overkill when dealing with passing data around without crossing the process boundaries. But could it be that the choice of string data type makes more sense when sending data over wire or storing on disk?
Three small number columns [Number(1)] >>
OptionA | 0/1
OptionB | 0/1
OptionC | 0/1
or one larger string column [Varchar2(29)] >>
Options | OptionA=0/1|OptionB=0/1|OptionC=0/1
I'm not sure about the way database handles tables, but I think that maintaining three columns as Number(1) is better than one column as Varchar2(29) !
-EDIT-
Let me explain the situation a bit more:
I am working on a common framework where the all incoming/outgoing request/response is tracked, these interactions can be channeled to a DB/File/JMS; now the all the configuration is being loaded from a table which has a column that corresponds to the output type, currently I'm using "DB=1|FILE=1|JMS=0" as the value of that column so that later if anyone wants to add this for their module they can easily understand what is going on, in my code I've written a simple logic which splits the string by "|" and then I use the exclusive or operator to switch between choice using a switch case..
Everything is already done but I don't like the idea of one large column is better than three small + it will remove the split string I'm doing.
-EDIT-
I finally got it clarified, there may be a situation where we have to add more options; in that case if we add the data column wise, it will result in modifying the table + changing the entity + adding more if's n all; on the other hand I ended up making an enum out of it in a simple bit wise logic to switch between options; this way, I need to modify the enum & add a new handler for the new option & then we are good to go.
Using a single column to store multiple pieces of data is probably the worst thing you can do in a database.
Violating first normal form has at least the following disadvantages:
More difficult to query. OptionA = 1 and OptionB = 1 and OptionC = 0 versus substr(options, 9, 1) = '1' and substr(options, 19, 1) = '1' and substr(options, 19, 1) = '0'.
Less flexable. What happens when you need to add another option? Adding a new column is easy. Adding a new format could mess up old queries. For example, if someone tries to read OptionC with substr(options, -1, 1). (Although this is a good reason to use a 3rd option - a separate table.)
No type safety. This can be a very subtle and tricky problem. Let's say you write substr(options, 9, 1) = 1 instead of substr(options, 9, 1) = '1'. If anyone ever gets the format wrong, a single value could ruin lots of queries. Or worse, it only intermittently crashes a small number of queries, because the access paths keep changing. (Although you can prevent this with a check constraint.)
Slower queries. Normally the amount of work done in an expression or condition isn't a significant cost for a query. But adding a lot of unnecessary string manipulation can make a difference.
Less optimizing. Oracle can only build efficient query plans if it can understand your data. For example, let's say that OptionA is "0" 99.9% of the time. When you filter OptionA = 0, Oracle can use a histogram make a very accurate prediction about the number of rows returned. But for substr(options, 9, 1) = '1' you'll only get a wild guess. If you have complicated queries using this columns you may spend a lot of time trying to "fix" the cardinality estimates. (Although maybe expression statistics could help with this?)
There are times when denormalizing is a good idea. For example, if you have terabytes of data, and compress the table, the single column may take up less space. (But if you're trying to save space, why not use a format like "000" instead?).
If there really is a good reason for this, it definitely needs to be documented. Perhaps add a comment on the column.
For a start, if I am reading your question right, you want each of the options to have one of just two possible values, correct?
If so then you could:
have a separate integer (or boolean) column for each option
have an options column that is a string of 1's and 0's, one digit for each options e.g. "001"
use an 'options' column that is an integer and use a bit value for each options, e.g. optionA == options & 1, optionB == options & 2 etc.
some databases have a bit vector data type which you could use. For mysql there is the BIT data type, which can store bit strings up to 64 bits long.
There will be a trade-off between code complexity and efficiency for each of these. Ask yourself, how much of the machine's time or storage will be saved by employing each of these options? And how much of your time will be saved?
In this instance the 3 column approach is the one I would recommend, not only does this keep things simple in terms of extracting data, but should you ever wish you could set values against all 3 columns rather than being limited to one VarChar2 field. If you opt for the single column VarChar2 then it is fairly simple to extract the info you need using the substr command or perhaps another variation, and although this isn't heavy work for an Oracle db, it does essentially put extra work on the server which is not necessary.
I am wondering how to represent an end-of-time (positive infinity) value in the database.
When we were using a 32-bit time value, the obvious answer was the actual 32-bit end of time - something near the year 2038.
Now that we're using a 64-bit time value, we can't represent the 64-bit end of time in a DATETIME field, since 64-bit end of time is billions of years from now.
Since SQL Server and Oracle (our two supported platforms) both allow years up to 9999, I was thinking that we could just pick some "big" future date like 1/1/3000.
However, since customers and our QA department will both be looking at the DB values, I want it to be obvious and not appear like someone messed up their date arithmetic.
Do we just pick a date and stick to it?
Use the max collating date, which, depending on your DBMS, is likely going to be 9999-12-31. You want to do this because queries based on date ranges will quickly become miserably complex if you try to take a "purist" approach like using Null, as suggested by some commenters or using a forever flag, as suggested by Marc B.
When you use max collating date to mean "forever" or "until further notice" in your date ranges, it makes for very simple, natural queries. It makes these kind of queries very clear and simple:
Find me records that are in effect as of a given point in time.
... WHERE effective_date <= #PointInTime AND expiry_date >= #PointInTime
Find me records that are in effect over the following time range.
... WHERE effective_date <= #StartOfRange AND expiry_date >= #EndOfRange
Find me records that have overlapping date ranges.
... WHERE A.effective_date <= B.expiry_date AND B.effective_date <= A.expiry_date
Find me records that have no expiry.
... WHERE expiry_date = #MaxCollatingDate
Find me time periods where no record is in effect.
OK, so this one isn't simple, but it's simpler using max collating dates for the end point. See: this question for a good approach.
Using this approach can create a bit of an issue for some users, who might find "9999-12-31" to be confusing in a report or on a screen. If this is going to be a problem for you then drdwicox's suggestion of using a translation to a user-friendly value is good. However, I would suggest that the user interface layer, not the middle tier, is the place to do this, since what may be the most sensible or palatable may differ, depending on whether you are talking about a report or a data entry form and whether the audience is internal or external. For example, some places what you might want is a simple blank. Others you might want the word "forever". Others you may want an empty text box with a check box that says "Until Further Notice".
In PostgreSQL, the end of time is 'infinity'. It also supports '-infinity'. The value 'infinity' is guaranteed to be later than all other timestamps.
create table infinite_time (
ts timestamp primary key
);
insert into infinite_time values
(current_timestamp),
('infinity');
select *
from infinite_time
order by ts;
2011-11-06 08:16:22.078
infinity
PostgreSQL has supported 'infinity' and '-infinity' since at least version 8.0.
You can mimic this behavior, in part at least, by using the maximum date your dbms supports. But the maximum date might not be the best choice. PostgreSQL's maximum timestamp is some time in the year 294,276, which is sure to surprise some people. (I don't like to surprise users.)
2011-11-06 08:16:21.734
294276-01-01 00:00:00
infinity
A value like this is probably more useful: '9999-12-31 11:59:59.999'.
2011-11-06 08:16:21.734
9999-12-31 11:59:59.999
infinity
That's not quite the maximum value in the year 9999, but the digits align nicely. You can wrap that value in an infinity() function and in a CREATE DOMAIN statement. If you build or maintain your database structure from source code, you can use macro expansion to expand INFINITY to a suitable value.
We sometimes pick a date, then establish a policy that the date must never appear unfiltered. The most common place to enforce that policy is in the middle tier. We just filter the results to change the "magic" end-of-time date to something more palatable.
Representing the notion of "until eternity" or "until further notice" is an iffy proposition.
Relational theory proper says that there is no such thing as null, so you're obliged to have whatever table it is split in two: one part with the rows for which the end date/end time is known, and another for the rows for which the end time is not yet known.
But (like having a null) splitting the tables in two will make a mess of your query writing too. Views can somewhat accommodate the read-only parts, but updates (or writing the INSTEAD OF on your view) will be tough no matter what, and likely to affect performance negatively no matter what at that).
Having the null represent "end time not yet known" will make updating a bit "easier", but the read queries get messy with all the CASE ... or COALESCE ... constructs you'll need.
Using the theoretically correct solution mentioned by dportas gets messy in all those cases where you want to "extract" a DATE from a DATETIME. If the DATETIME value at hand is "the end of (representable) time (billions of years from now as you say)", then this is not just a simple case of invoking the DATE extractor function on that DATETIME value, because you'd also want that DATE extractor to produce the "end of representable DATEs" for your case.
Plus, you probably do not want to show "absent end of time" as being a value 9999-12-31 in your user interface. So if you use the "real value" of the end of time in your database, you're facing a bit of work seeing to it that that value won't appear in your UI anywhere.
Sorry for not being able to say that there's a way to stay out of all messes. The only choice you really have is which mess to end up in.
Don't make a date be "special". While it's unlikely your code would be around in 9999 or even in 2^63-1, look at all the fun that using '12/31/1999' caused just a few years ago.
If you need to signal an "endless" or "infinite" time, then add a boolean/bit field to signal that state.