Confusion about nvarchar(MAX) in Fluent NHibernate/MSSQL - sql-server

I've been looking around for the best way to store a large string value (like a blog post, or a text description, etc.) with Fluent nHibernate and the answer I keep seeing is to use an nvarchar(MAX). Which, if my reading is correct (which it very often isn't) is 4000+. So I have a field like so...
Map(x => x.Description)
.Column("[description]")
.Length(4001)
.Access.Property()
.Not.Nullable();
In theory, this should do it, right? I'm a little confused about this though. In school, we were taught pretty clearly that you want to make each column size as small as possible.
If I make that column max size, doesn't that go against that very principle and make the table very large, and wasteful? Can anyone shed some very clear, stupid, blonde-proof logic on this for me? I've been left with a lot of confusion over the whole ordeal.

Have a look at this, maybe be it will help: http://www.crankingoutcode.com/post/Storing-large-strings-with-Fluent-NHibernate-(automapping).aspx
And Gotcha's: http://ayende.com/blog/1969/nhibernate-and-large-text-fields-gotchas

Note that max means that you can store characters upto 2^31-1 bytes of data. However it will consume space based on the actual length of the data

Related

Is it possible to get the original UUID1 from a timestamp?

Say, for example, I generate a time based UUID with the following program.
import uuid
uuid = uuid.uuid1()
print uuid
print uuid.time
I get the following:
47702997-155d-11ea-92d3-6030d48747ec
137946228962896279
Can I get back the original UUID, that is 47702997-155d-11ea-92d3-6030d48747ec, if I know the timestamp (137946228962896279)?
I am reading about UUID version 1 and found a few programs that "kind" of tries to reverses it, but, every time, I am getting a different UUID.
The part that is always changing are the timestamp part (last 4 digits of the first block - 47702997) and the clock_sequence (92d3).
If it is possible to get back the original UUID, what would I need?
Any help/direction is greatly appreciated.
I also made a post in Security Stackexchange but later realized that this question should have been posted here.
The more I look into it, I can see that this is not at all possible since the timestamp does not contain information on the clock_sequence unless I am wrong, in which case, please correct me.
A UUIDv1 contains two main pieces: a temporally unique part (aka timestamp) and a spatially unique part. By design, the two are completely independent, so if you throw one of the pieces away, there is no way to recover it later from the other one.
More generally, notice how the entire UUID is 32 hexadecimal digits, or 128 bits of information, but the timestamp is 18 decimal digits, or only about 60 bits of information. Even without knowing much about UUIDs you could guess that some of the bits are redundant or fixed (and a few are fixed, or at least guessable), but over half of them? Not likely, which means this translation is not reversible.

What is the best way to send big arrays with AllJoyn

According to specification, AllJoyn limits the array size to ALLJOYN_MAX_ARRAY_LEN = 2^17 bytes when DBus has a greater limit of 2^26.
I have a case where I need to send images as array of bytes ('ay' signature) that, even compressed, can be bigger than 2^17 bytes.
There are different ways to achieve this, for example splitting the array and adding an array ID and number of chunk, rebuilding the correct array on the other side.
I was wondering if anyone came into a similar problem and what approach would work best.
It looks like a Bulk Data Transfer interface has been specified and is being reviewed by the AllSeen Alliance interface review board. It does not look like it has been approved yet. Also it does not look like anyone has created an implementation of the interface at this time.
https://git.allseenalliance.org/gerrit/#/c/5977

Combining two GUID/UUIDs with MD5, any reasons this is a bad idea?

I am faced with the need of deriving a single ID from N IDs and at first a i had a complex table in my database with FirstID, SecondID, and a varbinary(MAX) with remaining IDs, and while this technically works its painful, slow, and centralized so i came up with this:
simple version in C#:
Guid idA = Guid.NewGuid();
Guid idB = Guid.NewGuid();
byte[] data = new byte[32];
idA.ToByteArray().CopyTo(data, 0);
idB.ToByteArray().CopyTo(data, 16);
byte[] hash = MD5.Create().ComputeHash(data);
Guid newID = new Guid(hash);
now a proper version will sort the IDs and support more than two, and probably reuse the MD5 object, but this should be faster to understand.
Now security is not a factor in this, none of the IDs are secret, just saying this 'cause everyone i talk to react badly when you say MD5, and MD5 is particularly useful for this as it outputs 128 bits and thus can be converted directly to a new Guid.
now it seems to me that this should be just dandy, while i may increase the odds of a collision of Guids it still seems like i could do this till the sun burns out and be no where near running into a practical issue.
However i have no clue how MD5 is actually implemented and may have overlooked something significant, so my question is this: is there any reason this should cause problems? (assume sub trillion records and ideally the output IDs should be just as global/universal as the other IDs)
My first thought is that you would not be generating a true UUID. You would end up with an arbitrary set of 128-bits. But a UUID is not an arbitrary set of bits. See the 'M' and 'N' callouts in the Wikipedia page. I don't know if this is a concern in practice or not. Perhaps you could manipulate a few bits (the 13th and 17th hex digits) inside your MD5 output to transform the hash outbut to a true UUID, as mentioned in this description of Version 4 UUIDs.
Another issue… MD5 does not do a great job of distributing generated values across the range of possible outputs. In other words, some possible values are more likely to be generated more often than other values. Or as the Wikipedia article puts it, MD5 is not collision resistant.
Nevertheless, as you pointed out, probably the chance of a collision is unrealistic.
I might be tempted to try to increase the entropy by repeating your combined value to create a much longer input to the MD5 function. In your example code, take that 32 octet value and use it repeatedly to create a value 10 or 1,000 times longer (320 octects, 32,000 or whatever).
In other words, if working with hex strings for my own convenience here instead of the octets of your example, given these two UUIDs:
78BC2A6B-4F03-48D0-BB74-051A6A75CCA1
FCF1B8E4-5548-4C43-995A-8DA2555459C8
…instead of feeding this to the MD5 function:
78BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C8
…feed this:
78BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C8
…or something repeated even longer.

CSV String vs Arrays: Is this too stringly typed?

I came across some existing code in our production environment given to us by our vendor. They use a string to store comma seperated values to store filtered results from a DB. Keep in mind that this is for a proprietary scripting language called PowerOn that interfaces with a database residing on an AIX system, but it's a language that supports strings, integers, and arrays.
For example, we have;
Account
----------------
123
234
3456
28390
The psuedo code might look like;
Define accounts As String
For Each Account
accounts=accounts + CharCast(Account) + ","
End
as opposed to something I would expect to see like
Define accounts As Integer Array(99)
Define index as Integer=0
For Each Account
accounts(index)=Account
index=index+1
End
By the time the loop is done, accounts will look like; 123,234,3456,28390,. The string is later used to test if a specific instance exists like so
If CharSearch("28390", accounts) > 0 Then Call DoSomething
In the example, the statement evaluates to true and DoSomething gets called. Given the option of arrays, why would want to store integer values whithin a string of comma seperated values? Every language I've come across, it's almost always more expensive to perform string based operations than integer based operations.
Considering I haven't seen this technique before and my experience is somewhat limitted, is there a name for this? Is this common practice or is this just another example of being too stringly typed? To extend the existing code, should I continue using string method? Did we get cruddy code from our vendor?
What I put in the comment still holds but my real answer is: It's probably a design decision with respect to compatibility/portability. In your integer-array case (and a low enough level of the API) you'd typically find yourself asking questions like, what's a safe guess of the size of an integer on "today"'s machines. What about endianness.
The most portable and most flexible of all data formats always has been and always will be printed representation. It may not be as fast to process that but that's where adapters/converters or so kick in. I wouldn't be surprised to find (human-readable) printed representation of something especially in database APIs like you describe.
If you want something fast, just take whatever is given to you, convert it to a more efficient internal format, do you processing and convert it back.
There's nothing inherently wrong with using comma-separated strings instead of arrays. Sure you can't readily access a random n's element of such a collection, but if such random access is not needed then there's no penalty for it, right?
As far as I know Oracle DB stores NUMBER values as strings (and if my memory is correct - for DATEs as well) for very practical reasons.
In your specific example looks like using strings is an overkill when dealing with passing data around without crossing the process boundaries. But could it be that the choice of string data type makes more sense when sending data over wire or storing on disk?

Phone Number Columns in a Database

In the last 3 companies I've worked at, the phone number columns are of type varchar(n). The reason being that they might want to store extensions (ext. 333). But in every case, the "-" characters are stripped out when inserting and updating. I don't understand why the ".ext" characters are okay to store but not the "-" character. Has any one else seen this and what explanation can you think of for doing it this way? If all you want to store is the numbers, then aren't you better off using an int field? Conversely, if you want to store the number as a string/varchar, then why not keep all the characters and not bother with formatting on display and cleaning on write?
I'm also interested in hearing about other ways in which phone number storage is implemented in other places.
Quick test: are you going to add/subtract/multiply/divide Phone Numbers? Nope. Similarly to SSNs, Phone Numbers are discrete pieces of data that can contain actual numbers, so a string type is probably most appropriate.
one point with storing phone numbers is a leading 0.
eg: 01202 8765432
in an int column, the 0 will be stripped of, which makes the phone number invalid.
I would hazard a guess at the - being swapped for spaces is because they dont actually mean anything
eg: 123-456-789 = 123 456 789 = 123456789
Personally, I wouldn't strip out any characters, as depending on where the phone number is from, it could mean different things. Leave the phone number in the exact format it was entered, as obviously that's the way the person who typed it in is used to seeing it.
It doesn't really matter how you store it, as long as it's consistent. The norm is to strip out formatting characters, but you can also store country code, area code, exchange, and extension separately if you have a need to query on those values. Again, the requirement is that it's consistent - otherwise querying it is a PITA.
Another reason I can think of not to store phone numbers as 'numbers' but as strings of characters, is that often enough part of the software stack you'd use to access the database (PHP, I am looking at you) wouldn't support big enough integers (natively) to be able to store some of the longer and/or exotic phone numbers.
Largest number that 32-bits can carry, without sign, is 4294967295. That wouldn't work for just any Russian mobile phone number, take, for instance, the number 4959261234.
So you have yourself an extra inconvenience of finding a way to carry more than 32-bits worth of number data. Even though databases have long supported very large integers, you only need one bad link in the chain for a showstopper. Like PHP, again.
Stripping some characters and allowing others may have an impact if the database table is going to drive another system, e.g. IP Telephony of some sort. Depending on the systems involved, it may be legitimate to have etc.333 as a suffix, whereas the developers may not have accounted for "-" in the string (and yes, I am guessing here...)
As for storing as a varchar rather than an int, this is just plain-ole common sense to me. As mentioned before, leading zeros may be stripped in an int field, the query on an int field may perform implicit math functions (which could also explain stripping "-" from the text, you don't want to enter 555-1234 and have it stored as -679 do you?)
In short, I don't know the exact reasoning, but can deduce some possibilities.
I'd opt to store the digits as a string and add the various "()" and "-" in my display code. It does get more difficult with international numbers. We handle it by having various "internationalized" display formats depending on country.
What I like to do if I know the phone numbers are only going to be within a specific region, such as North America, is to change the entry into 4 fields. 3 for area code, 3 for prefix, 3 for line, and maybe 5 for extension. I then insert these as 1 field with '-' and maybe an 'e' to designate extension. Any searching of course also needs to follow the same process. This ensures I get more regular data and even allows for the number to be used for actually making a phone call, once the - and the extension are removed. I can also get back to original 4 fields easily.
Good stuff! It seems that the main point is that the formatting of the phone number is not actually part of the data but is instead an aspect of the source country. Still, by keeping the extension part of the number as is, one might be breaking the model of separating the formatting from the data. I doubt that all countries use the same syntax/format to describe an extension. Additionally, if integrating with a phone system is a (possible) requirement, then it might be better to store the extension separately and build the message as it is expected. But Mark also makes a good point that if you are consistent, then it probably won't matter how you store it since you can query and process it consistently as well.
Thank you Eric for the link to the other question.
When an automated telephone system uses a field to make a phone call it may not be able to tell what characters it should use and which it should ignore in dialing. A human being may see a "(" or ")" or "-" character and know these are considered delimiters separating the area code, npa, and nxx of the phone number. Remember though that each character represents a binary pattern that, unless pre-programmed to ignore, would be entered by an automated dialer. To account for this it is better to store the equivalent of only the characters a user would press on the phone handset and even better that the individual values be stored in separate columns so the dialer can use individual fields without having to parse the string.
Even if not using dialing automation it is a good practice to store things you dont need to update in the future. It is much easier to add characters between fields than strip them out of strings.
In comment of using a string vs. integer datatype as noted above strings are the proper way to store phone numbers based on variations between countries. There is an important caveat to that though in that while aggregating statistics for reporting (i.e. SUM of how many numbers or calls) character strings are MUCH slower to count than integers. To account for this its important to add an integer as an identity column that you can use for counting instead of the varchar or char field datatype.

Resources