Encrypted Fields & Full Text Search, Best Approach? - sql-server

I've got some fields that store notes and sensitive information that I'd like to encrypt before it makes its way into the database.
Right now, I use a SQL Full-Text Search to search these fields. Obviously encrypting this data is going to throw off my search results.
What's the best way to encrypt these fields, but still allow searching?

It's not going to be easy. What you're describing is rarely implemented in commercial databases, although there are some theoretical results in the field. I'd suggest that you go to google scholar and start looking for papers on the subject.
Here are a few references to get you started:
Dawn Xiaodong Song, David Wagner, and Adrian Perrig. Practical techniques for
searches on encrypted data.
R. Brinkman, L. Feng, J. Doumen, P.H. Hartel, and W. Jonke. Efficient Tree Search in Encrypted Data. In Security In Information Systems, pages 126-135, 2004.
D Boneh, G Di Crescenzo, R Ostrovsky, G Persiano. Public Key Encryption with keyword Search
P Golle, J Staddon, B Waters. Secure Conjunctive Keyword Search over Encrypted Data.

There is NO database supporting encrypted index so you have to sacrifice some security to achieve this.
You can index partial data in clear and find the real data from your application. For example, if you want store credit-card number. You can have an index of last 4-digit. The number of cards sharing the same last 4 digit are limited so you can afford to decrypt each one and check the whole number.

Another option is to store the soundex of the encrypted data. You can then search on the soundex value and get close without decrypting the data.

Oracle's 10g Release 2 (or later) may support this functionality. From their website here:
http://www.oracle.com/technology/oramag/oracle/05-sep/o55security.html
"A new feature in Oracle Database 10g Release 2 lets you do just that: You can declare a column as encrypted without writing a single line of code. When users insert the data, the database transparently encrypts it and stores it in the column. Similarly, when users select the column, the database automatically decrypts it. Since all this is done transparently without any change to the application code, the feature has an appropriate name: Transparent Data Encryption (TDE)."
The idea is that no one can see the clear text in the database, but a select statement would work as normal. This might help with your searching if Oracle is an option?
Update: there is another option here:
http://www.critotech.com/index.htm
for MySQL databases, but it seems quite expensive.

I know this is an old answer, but both SQL Server and Oracle now have (expensive) offerings for Transparent Data Encryption, which basically allows your app to search with no changes, but the actual data at rest is encrypted. More info here:
SQL Server:
https://msdn.microsoft.com/en-us/library/bb934049%28v=sql.120%29.aspx
Oracle:
http://www.oracle.com/technetwork/database/options/advanced-security/index-099011.html

Related

Encrypted Data only accessible for user as data owner and algorithm

Which method or way would you choose to make encrypted data only accessible for the user and an algorithm to process and evaluate the data? In this case the user would be one of n service-users, who would add sensible data (mostly answers to questions) about himself into the database. The company who is providing the database shouldn’t have any access to the sensible data, but to the results of the data processing. The results wouldn’t give any conclusion of the sensible data.
What you are looking for is Fully Homomorphic Encryption (FHE). FHE operates on encrypted data. This can be achieved by an encryption scheme that supports two operations on encrypted data. RSA and others only supported one operation until Gentry's work.
With FHE schemes like HeLib (there are many now), you can upload your data the server and give a function (circuit) to evaluate. The FHEs, in general, have semantic security (randomized encryption). The Semi-honest server can only see encrypted data and can return the result back to you.
Note: They are not practical, yet.
I think the best way to do that is to save only the result. but if you want to save the user's answers you could use AES with the user's password as a key by doing so the user will have to enter his password every time to decrypt the data.

Indexing an encrypted column in sql server

I have patient health information stored in a SQL Server 2012 database. When I do a search on a patient's name, their names are encrypted, so the search is very slow. How can I add an index on an encrypted column ?
I am using Symmetric Key encryption (256-bit AES) on varbinary fields.
There are separate encrypted fields for Patient's first name, last name, address, phone number, DOB, SSN. All of these are searchable (partial also) except SSN.
To build on the answer that #PhillipH provided: if you are performing an exact search on (say) last name you can include a computed column defined as CHECKSUM(encrypt(last_name)) (with encrypt your encryption operation). This is secure in that it does not divulge any information -- a checksum on the encrypted value does not reveal anything about the plaintext.
Create an index on this computed column. To search on the name, instead of just doing WHERE encrypted_last_name = encrypt(last_name), add a search on the hash: WHERE encrypted_last_name = encrypt(last_name) AND CHECKSUM(encrypt(last_name)) = hashed_encrypted_last_name. This is much faster because SQL Server only has to search an index for a small integer value, then verify that the name in fact matches, reducing the amount of data to check considerably. Note that no data is decrypted in this scheme, with or without the CHECKSUM -- we search for the encrypted value only. The speedup does not come from reducing the amount of data that is encrypted/decrypted (only the data you pass in is encrypted) but the amount of data that needs to be indexed and compared for equality.
The only drawback is that this does not allow partial searches, or even case variation, and indeed, doing that securely is not trivial. Case is relatively simple (hash encrypted(TOUPPER(name)), making sure you use a different key to avoid correlation), but partial matches require specialized indexes. The simplest approach I can think of is to use a separate service like Lucene to do the indexing, but make it use secure storage for its files (i.e. Encrypting File System (EFS) in Windows). Of course, that does mean a separate system that needs to be certified -- but I can't think of any convenient solution that remains entirely in SQL Server and does not require additional code.
If you can still change the database design/storage, you may wish to consider Transparent Data Encryption (TDE) which has the huge advantage that it's, well, transparent and integrated in SQL Server at the engine level. Not only should partial matching be much faster since individual rows don't need decrypting (just whole pages), if it's not fast enough you can create a full-text index which will also be encrypted. I don't know if TDE works with your security requirements, though.
As a programmatic solution, if you dont need a partial match, you could store a hash in the clear on another field and use the same hashing algorithm on the client/app server and match on hash. This would have the possibility of a false positive match but would negate the need to decrypt the data.
If you are using Microsoft SQL server implicit encryptbykey function, there is no benefit of using index on that column because sql sever encryptbykey function will have different output every time for same input because of random iv used by sql server itself.

Algorithms for key value pair, where key is string

I have a problem where there is a huge list of strings or phrases it might scale from 100,000 to 100Million. when i search for a phrase if found it gives me the Id or index to database for further operation. I know hash table can be used for this, but i am looking for other algorithm which could serve me to generate index based on strings and can also be useful in some other features like autocomplete etc.
I read suffix tree/array based on some SO threads they serve the purpose but consumes alot memory than i can afford. Any alternatives to this?
Since my search is only in a huge list of millions of strings. No docs no webpages not interested in search engine like lucene etc.
Also read about inverted index sounds helpful but which algorithm i need to study for it?.
If this Database index is within MS SQL Server you may get good results with SQL Full Text Indexing. Other SQL providers may have a similar function but I would not be able to help with those.
Check out: http://www.simple-talk.com/sql/learn-sql-server/understanding-full-text-indexing-in-sql-server/
and
http://msdn.microsoft.com/en-us/library/ms142571.aspx

Wildcard search in cassandra database

I want to know if there is any way to perform wildcard searches in cassandra database.
e.g.
select KEY,username,password from User where username='\*hello*';
Or
select KEY,username,password from User where username='%hello%';
something like this.
There is no native way to perform such queries in Cassandra. Typical options to achieve the same are
a) Maintain an index yourself on likely search terms. For example, whenever you are inserting an entry which has hello in the username, insert an entry in the index column family with hello as the key and the column value as the key of your data entry. While querying, query the index CF and then fetch data from your data CF. Of course, this is pretty restrictive in nature but can be useful for some basic needs.
b) A better bet is to use a full text search engine. Take a look at Solandra, https://github.com/tjake/Solandra or Datastax enterprise http://www.datastax.com/products/enterprise
This project also looks promising
http://tuplejump.github.io/stargate/
I have not looked deeply at it recently, but when I last evaluated it, it looked promising.

How to implement an Enterprise Search

We are searching disparate data sources in our company. We have information in multiple databases that need to be searched from our Intranet. Initial experiments with Full Text Search (FTS) proved disappointing. We've implemented a custom search engine that works very well for our purposes. However, we want to make sure we are doing "the right thing" and aren't missing any great tools that would make our job easier.
What we need:
Column search
ability to search by column
we flag which columns in a table are searchable
Keep some relation between db column and data
we provide advanced filtering on the results
facilitates (amazon style) filtering
filter provided by grouping of results and allowing user to filter them via a checkbox
this is a great feature, users like it very much
Partial Word Match
we have a lot of unique identifiers (product id, etc).
the unique id's can have sub parts with meaning (location, etc)
or only a portion may be available (when the user is searching)
or (by a decidedly poor design decision) there may be white space in the id
this is a major feature that we've implemented now via CHARINDEX (MSSQL) and INSTR (ORACLE)
using the char index functions turned out to be equivalent performance(+/-) on MSSQL compared to full text
didn't test on Oracle
however searches against both types of db are very fast
We take advantage of Indexed (MSSQL) and Materialized (Oracle) views to increase speed
this is a huge win, Oracle Materialized views are better than MSSQL Indexed views
both provide speedups in read-only join situations (like a search combing company and product)
A search that matches user expectations of the paradigm CTRL-f -> enter text -> find matches
Full Text Search is not the best in this area (slow and inconsistent matching)
partial matching (see "Partial Word Match")
Nice to have:
Search database in real time
skip the indexing skip, this is not a hard requirement
Spelling suggestion
Xapian has this http://xapian.org/docs/spelling.html
Similar to google's "Did you mean:"
What we don't need:
We don't need to index documents
at this point searching our data sources are the most important thing
even when we do search documents, we will be looking for partial word matching, etc
Ranking
Our own simple ranking algorithm has proven much better than an FTS equivalent.
Users understand it, we understand it, it's almost always relevant.
Stemming
Just don't need to get [run|ran|running]
Advanced search operators
phrase matching, or/and, etc
according to Jakob Nielsen http://www.useit.com/alertbox/20010513.html
most users are using simple search phrases
very few use advanced searches (when it's available)
also in Information Architecture 3rd edition Page 185
"few users take advantage of them [advanced search functions]"
http://oreilly.com/catalog/9780596000356
our Amazon like filtering allows better filtering anyway (via user testing)
Full Text Search
We've found that results don't always "make sense" to the user
Searching with FTS is hard to tune (which set of operators match the users expectations)
Advanced search operators are a no go
we don't need them because
users don't understand them
Performance has been very close (+/1) to the char index functions
but the results are sometimes just "weird"
The question:
Is there a solution that allows us to keep the key value pair "filtering feature", offers the column specific matching, partial word matching and the rest of the features, without the pain of full text search?
I'm open to any suggestion. I've wondered if a document/hash table nosql data store (MongoDB, et al) might be of use? ( http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo ). Any experience with these is appreciated.
Again, just making sure we aren't missing something with our in-house customized version. If there is something "off the shelf" I would be interested in it. Or if you've built something from some components, what components (search engines, data stores, etc) did you use and why?
You can also make your point for FTS. Just make sure it meets the requirements above before you say "just use Full Text Search because that's the only tool we have."
I ended up coding my own.
The results are fantastic. Users like it, it works well with our existing technologies.
It really wasn't that hard. Just took some time.
Features:
Faceted search (amazon, walmart, etc)
Partial word search (the real stuff not full text)
Search databases (oracle, sql server, etc) and non database sources
Integrates well with our existing environment
Maintains relations, so I can have a n to n search and display
--> this means I can display child records of a master record in search results
--> also I can search any child field and return the master record
It's really amazing what you can do with dictionaries and a lot of memory.
I recommend looking into Solr, I believe it will meet you needs:
http://lucene.apache.org/solr/
For an off-she-shelf solution: Have you checked out the Google Search Appliance?
Quote from the Google Mini/GSA site:
... If direct database indexing is a requirement for you, we encourage you to consider the Google Search Appliance, which has direct database connectivity.
And of course it indexes everything else in the Googly manner you'd expect it to.
Apache Solr is a good way to start your project with and it is open source . You can also try Elastic Search and there are a lot of off shelf products which offer good customization abilities and search features such as Coveo, SharePoint Fast, Google ...

Resources