Indexing an encrypted column in sql server - sql-server

I have patient health information stored in a SQL Server 2012 database. When I do a search on a patient's name, their names are encrypted, so the search is very slow. How can I add an index on an encrypted column ?
I am using Symmetric Key encryption (256-bit AES) on varbinary fields.
There are separate encrypted fields for Patient's first name, last name, address, phone number, DOB, SSN. All of these are searchable (partial also) except SSN.

To build on the answer that #PhillipH provided: if you are performing an exact search on (say) last name you can include a computed column defined as CHECKSUM(encrypt(last_name)) (with encrypt your encryption operation). This is secure in that it does not divulge any information -- a checksum on the encrypted value does not reveal anything about the plaintext.
Create an index on this computed column. To search on the name, instead of just doing WHERE encrypted_last_name = encrypt(last_name), add a search on the hash: WHERE encrypted_last_name = encrypt(last_name) AND CHECKSUM(encrypt(last_name)) = hashed_encrypted_last_name. This is much faster because SQL Server only has to search an index for a small integer value, then verify that the name in fact matches, reducing the amount of data to check considerably. Note that no data is decrypted in this scheme, with or without the CHECKSUM -- we search for the encrypted value only. The speedup does not come from reducing the amount of data that is encrypted/decrypted (only the data you pass in is encrypted) but the amount of data that needs to be indexed and compared for equality.
The only drawback is that this does not allow partial searches, or even case variation, and indeed, doing that securely is not trivial. Case is relatively simple (hash encrypted(TOUPPER(name)), making sure you use a different key to avoid correlation), but partial matches require specialized indexes. The simplest approach I can think of is to use a separate service like Lucene to do the indexing, but make it use secure storage for its files (i.e. Encrypting File System (EFS) in Windows). Of course, that does mean a separate system that needs to be certified -- but I can't think of any convenient solution that remains entirely in SQL Server and does not require additional code.
If you can still change the database design/storage, you may wish to consider Transparent Data Encryption (TDE) which has the huge advantage that it's, well, transparent and integrated in SQL Server at the engine level. Not only should partial matching be much faster since individual rows don't need decrypting (just whole pages), if it's not fast enough you can create a full-text index which will also be encrypted. I don't know if TDE works with your security requirements, though.

As a programmatic solution, if you dont need a partial match, you could store a hash in the clear on another field and use the same hashing algorithm on the client/app server and match on hash. This would have the possibility of a false positive match but would negate the need to decrypt the data.

If you are using Microsoft SQL server implicit encryptbykey function, there is no benefit of using index on that column because sql sever encryptbykey function will have different output every time for same input because of random iv used by sql server itself.

Related

Encrypted Data only accessible for user as data owner and algorithm

Which method or way would you choose to make encrypted data only accessible for the user and an algorithm to process and evaluate the data? In this case the user would be one of n service-users, who would add sensible data (mostly answers to questions) about himself into the database. The company who is providing the database shouldn’t have any access to the sensible data, but to the results of the data processing. The results wouldn’t give any conclusion of the sensible data.
What you are looking for is Fully Homomorphic Encryption (FHE). FHE operates on encrypted data. This can be achieved by an encryption scheme that supports two operations on encrypted data. RSA and others only supported one operation until Gentry's work.
With FHE schemes like HeLib (there are many now), you can upload your data the server and give a function (circuit) to evaluate. The FHEs, in general, have semantic security (randomized encryption). The Semi-honest server can only see encrypted data and can return the result back to you.
Note: They are not practical, yet.
I think the best way to do that is to save only the result. but if you want to save the user's answers you could use AES with the user's password as a key by doing so the user will have to enter his password every time to decrypt the data.

T-SQL/CLR function for deterministic encryption

I have a table with User Agents Strings table with the following structure:
UserAgentStringID INT
UserAgentStringValue VARBINARY(8000)
The [UserAgentStringValue] field is encrypted with symmetric key. The previous version of the table structure was:
UserAgentStringID INT
UserAgentStringValue NVARCHAR(4000)
UserAgentStringHASH BINARY(32)
and I have index on the [UserAgentStringHASH] column in order to optimized searchers.
With the new format, such index is not efficient as the ENCRYPTION function uses InitializationVector in order to generate random values each time the encryption function is called with the same input:
Initialization vectors are used to initialize the block algorithm. It
is not intended to be a secret, but must be unique for every call to
the encryption function in order to avoid revealing patterns.
So, I can create index on my encrypted field, but if I try to search by encrypted value, I will not be able to find anything.
I do not want to use HASH because using hash function is not secure technique. If someone have my table data and table with all or huge amount of user agents, he/she will be able to perform an join by hash and reveal my data.
In SQL Server 2016 SP standard edition we have Always Encrypted which allows using Deterministic Encryption for column value - this means equal comparisons are working and indexes can be created.
I am looking for a way to optimize the search by other technique or a way to implement deterministic encryption using CLR for example?
Knowing there is no work around is OK for me, too. I guess I will pay the data protection with performance.
I am posting a workaround of this - it's not the ideal solution, but it is compromise between speed and security.
The details
a columns must be encrypted (lets say an email address)
fast search must be implemented (let say the email is used for login and we need to locate the record as fast as possible)
we are not able to use Always Encrypted deterministic encryption (due to various reasons)
we don't want to use hash function with salt - if one has the salt for each user, ze might be able to read the hashes using large sample database
The security hierarchy
There are various ways of implementing the security hierarchy. The following schema from the MSDN describes it very well.
In our environment we are using the Database Mater Key -> Certificate -> Symmetric Key hierarchy. Only DBAs know the DMK password, have access to certificate and symmetric keys. Some developers can do encrypt/decrypt data (using plain T-SQL) and other do not.
Note, using Always Encrypted you can have role separation - the people who works with the data have not access to the keys, and the people who have access to the keys, do not have access to the data. In our case, we want to protect our data from outsiders and have other techniques for granting/logging data access internally.
Developers with access to encrypted data
The developers who can access the protected data are able to encrypt and decrypt it. They have not access to the symmetric key values. If one have access to the symmetric key values, ze is able to decrypt the data event not having the certifications used for protecting the symmetric keys. Basically, only sys.admins and db_owners have access to the symmetric keys values.
How to hash
We need a hash to get fast searches, but we cannot use a salt which is not encrypted. And hash without a salt is like plain text from security perspective. So, we've decided to use use the symmetric key value as salt. It is get like this:
SELECT #SymmetricKeyValue = CONVERT(VARCHAR(128), DECRYPTBYCERT(C.[certificate_id], KE.[crypt_property]), 1)
FROM [sys].[symmetric_keys] SK
INNER JOIN [sys].[key_encryptions] KE
ON SK.[symmetric_key_id] = KE.[key_id]
INNER JOIN [sys].[certificates] C
ON KE.[thumbprint] = C.[thumbprint]
WHERE SK.[name] = #SymmetricKeyName;
And the value is concatenated to your email address and then the hash is calculated. It is good for us, because we are binding the hash to the security hierarchy. And it is not a different salt for each record, it is the same - but if one knows the symmetric key value, ze is able to decrypt the data directly.
Considerations
You need to create the routines (stored procedures, triggers) which are searching by hash values or computing hashes using the EXECUTE AS OWNER clause. Otherwise, developers will not be able to execute them as only sys.admins and db_owners have access to the symmetric key value.

Fuzzy Logic Lookup - How to use calculated columns

We're starting to implement Unicode as we've added some international customers. There are some issues comparing character data in SSIS because of capitals, accents, and other data problems.
I've thought that the Fuzzy logic lookup could be a good solution. However, when testing this solution out, I realized that in a lot of our existing code we limit what data to process, and send in those values by parameters.
I've noticed that in the Fuzzy Lookup, I can specify the name of the table, but I can't make changes like remove a % from a field and turn it into a decimal. Any ideas how we can setup the lookup with calculated fields?
Thanks!
Create a view in your database with the proper transformation your require using a sql query.

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.
You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.
If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.
I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.
If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)
Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Encrypted Fields & Full Text Search, Best Approach?

I've got some fields that store notes and sensitive information that I'd like to encrypt before it makes its way into the database.
Right now, I use a SQL Full-Text Search to search these fields. Obviously encrypting this data is going to throw off my search results.
What's the best way to encrypt these fields, but still allow searching?
It's not going to be easy. What you're describing is rarely implemented in commercial databases, although there are some theoretical results in the field. I'd suggest that you go to google scholar and start looking for papers on the subject.
Here are a few references to get you started:
Dawn Xiaodong Song, David Wagner, and Adrian Perrig. Practical techniques for
searches on encrypted data.
R. Brinkman, L. Feng, J. Doumen, P.H. Hartel, and W. Jonke. Efficient Tree Search in Encrypted Data. In Security In Information Systems, pages 126-135, 2004.
D Boneh, G Di Crescenzo, R Ostrovsky, G Persiano. Public Key Encryption with keyword Search
P Golle, J Staddon, B Waters. Secure Conjunctive Keyword Search over Encrypted Data.
There is NO database supporting encrypted index so you have to sacrifice some security to achieve this.
You can index partial data in clear and find the real data from your application. For example, if you want store credit-card number. You can have an index of last 4-digit. The number of cards sharing the same last 4 digit are limited so you can afford to decrypt each one and check the whole number.
Another option is to store the soundex of the encrypted data. You can then search on the soundex value and get close without decrypting the data.
Oracle's 10g Release 2 (or later) may support this functionality. From their website here:
http://www.oracle.com/technology/oramag/oracle/05-sep/o55security.html
"A new feature in Oracle Database 10g Release 2 lets you do just that: You can declare a column as encrypted without writing a single line of code. When users insert the data, the database transparently encrypts it and stores it in the column. Similarly, when users select the column, the database automatically decrypts it. Since all this is done transparently without any change to the application code, the feature has an appropriate name: Transparent Data Encryption (TDE)."
The idea is that no one can see the clear text in the database, but a select statement would work as normal. This might help with your searching if Oracle is an option?
Update: there is another option here:
http://www.critotech.com/index.htm
for MySQL databases, but it seems quite expensive.
I know this is an old answer, but both SQL Server and Oracle now have (expensive) offerings for Transparent Data Encryption, which basically allows your app to search with no changes, but the actual data at rest is encrypted. More info here:
SQL Server:
https://msdn.microsoft.com/en-us/library/bb934049%28v=sql.120%29.aspx
Oracle:
http://www.oracle.com/technetwork/database/options/advanced-security/index-099011.html

Resources