Database anonymization : Using additive noise - database

I want to do an experiment involving the use of additive noise for protecting a database from inference attacks.
My database should begin by generating a specific list of values that have a mean of 25, Then I will anonymize these values by adding a random noise value, which is designed to have an expected value of 0.
For example:
I can use uniformly distributed noise in the range [-1,1] or use a Normal (Gaussian) noise with mean 0.
I will test this anonymization method for a database of 100, 1000, 10000 values with different noise.
I am confused to use which platform and how, So I started with 10 values in Excel, For uniformly distributed noise value I use RAND() and add to the actual value, for normal noise, I use Norm.Inv with mean 0, then I add to the actual value.
But I don't know how to interpret data from the hacker's side, When I add noise to the dataset, how can I interpret its effect on privacy when the dataset becomes larger?
Also, should I use a database tool to handle this problem?

From what I understand, you're trying to protect your "experimental" database from inference attacks.
Attackers try to steal information from a database, using queries that are already allowed for public use. First, try to decide on your identifiers, quasi-identifiers and sensitive values.
Consider a student management system that has the GPA's of each student. We know that GPA is a sensitive information. Identifier is "student_id", and quasi-identifiers are "standing" and, let's say, "gender". In most cases, administrator of the RDBM system allows aggregate queries such as "Get average GPA of all students" or "Get average GPA of senior students" etc. Attackers try to infer from these aggregate queries. If, somehow, there are only one student who is senior, then the query "Get average GPA of senior students" would return the GPA of one specific person.
There are two main ways to protect the database from this kind of attacks. De-identification and Anonymization. De-identification means removing any identifier and quasi-identifier from the database. But this does not work in some cases. Consider one student who takes a make-up exam after grades are announced. If you get the average GPA of all students before and after he takes the exam, and compare the results of queries, you'd see a small amount of change (let's say, from 2.891 to 2.893). The attacker can infer the make-up exam score of that one particular student from this 0.002 difference of aggregate GPA.
The other approach is anonymization. With k-anonymity, you divide the database into groups that has at least k entities. For example, 2-anonymity ensures that there are no groups with single entity in it, so the aggregate queries on single-entity groups no longer leak private information.
Unless, you are one of the two entities in a group.
If there are 2 senior students in a class, and you want to know the average grade of seniors, the 2-anonymity allows you to have that information. But if you are a senior, and you already know your grade, you can infer the other student's grade.
Adding noise to sensitive values is a way to deal with those attacks, but if the noise is too low, then it has almost no effect on the information leaked (e.g. for grades, knowing someone has 57 out of 100 instead of 58 makes almost no difference). If it's too high, it results with a loss of functionality.
You've also asked how you can interpret the effect on privacy, as the dataset becomes larger. If you take average of an extremely large dataset, you'd see that the result you find is actually the sensitive value of everyone (this could be a little complex, but think the dataset as infinite, and the values that the sensitive information can take is finite, then calculate the probabilities). Adding noise with zero mean works, but range of the noise should get wider as the dataset get larger.
Lastly, if you are using excel, which is not a RDBMS but a spreadsheet, I suggest you to come up with a way to use equivalents of SQL queries, set identifiers and quasi-identifiers of the dataset, and public queries that can be performed by anyone.
Also, in addition to anonymity, take a look at "diversity" and "closeness" of a dataset and their use in database anonymization.
I hope that answers your question.
If you have further questions, please ask.

Related

Simple database setup for MALDI peaks

I have a very simple problem that I could come up with a crude solution to, but it seems to me that there is probably some off the shelf answer.
Problem: I have a list of discrete values (these are mass units) that I want to find within a database of discrete values (known mass units) and their identities, allowing for some inexact match. Example: If I am looking for 500.23 in the database then anything +/- 0.025 would be considered a match (50 ppm or 0.005%). This tolerance should be adjustable. So in this example, 500.23 may return the database text value, 500.25 which is Compound A.
I could also make this tool myself if someone would like to suggest the most straightforward approach. I am competent in Matlab, somewhat in R, good in excel, poor in access, and don't know anything about SQL. Best case would be for this tool to be used by non-coders.
Background: The real background of this problem is that I have MALDI TOF data where I have identified peaks of interest from an experiment (masses; m/z). These masses correspond to molecules that were released after enzymatic digestion. This class of molecule has reported masses with known identities, but unlike peptide mass fingerprinting, or metabolomic databases, these known masses are mostly unpublished and/or uncollated, so I would like to cross-reference them with a database of my own making. Each mass corresponds to one identity. The masses will not match exactly, and being able to search with a specified mass tolerance is key.
There are plenty of mass spectrometer data solutions you may want to look at. For example: http://www.ionsource.com/links/programs.htm

Efficiently search large DB for similar records

Suppose the following: as input, one would get a record consisting of N numbers and booleans. This vector has to be compared to a database of vectors, which include M additional "result" elements. That means, the database holds P N+M sized vectors.
Each vector in the database holds as last element a boolean. The aim of the exercise is to find as fast a possible the record(s) which are closest match to the input vector AND have a resulting vector ending with a TRUE boolean.
To make the above a bit more comprehensible, give the following exampe:
A database with personal health information, consisting of records holding:
age
gender
weight
lenght
hearth issues (boolean)
lung issues (boolean)
residence
alternative plan Chosen (if done)
accepted offer
The program would then search get an input like
36 Male 185pound 68in FALSE FALSE NYC
It would then find out which plan would be the best to offer the client, based on what's in the database.
I know of a few methods which would help to do this, eg the levenshtein distance method. However, most methods would involve searching the entire database for the best matches.
Are there any algorithms, methods which would cut back on the processing power/time required? I can't imagine that eg. insurance agencies don't use more efficient methods to search their databases...
Any insights into this area would be greatly appreciated!
Assumption: this is a relational database. If instead it were NOSQL then please provide more info on which db.
Do you have option to create bitmap indexes? They can cut down the # of records returned . That is useful for almost all of the columsn since the cardinalities are low.
After that the only one left is the residence, and you should use a Geo distance for that.
If you are unable to create bitmap indexes then what are your filtering options? If none then you have to do a full table scan.
For each of the components e.g. age, gender, etc. you need to
(a) determine a distance metric
(b) determine how to compute both the metric and the distance between different records.
I'm not sure an Levenshtein would work here - you need to take each field separately to find their contribution to the whole distance measure.

Database Normalization and User Defined Data Storage

am looking to let the users of my web application define their own attributes for products and then enter data for those products. I have found out that this technique is called n(th) normal form.
The following is DB structure I am currently considering deploying and was wondering what the positives and negatives would be in regards to integrity and scalability (and any other -ity's you can think of)
EDIT
(Sorry, This is more what I mean)
I have been staring at this for the last 15mins and I know (where the red arrow is) induces duplication and hence you would have to have integrity checks. But I just don't understand how else what I want could be done.
The products would number no more then 10. The variables would number no more then 200 (max 20 per product). The number of product instances would not exceed 100,000, therefore the maximum size of pVariable_data would not exceed 2 million
This model is called a database in a database and is not nice. Though sometimes it is impossible first check whether you really need it and your database is really the right database for the job.
With PostgreSQL you could use: http://www.postgresql.org/docs/8.4/static/hstore.html which is a standardized solution for this kind of issues.
Assuming that pVariable is more of a pVariable type, drop the reference to product_fk. It would mean that you need a new entry in that table for every Product record. Maybe try something like this:
Product(id, active, allow_new)
pVariable_type(id, name)
pVariable_data(id, product_fk, pvariable_fk, non_typed_value, bool, int, etc)
I would use the non_typed_value as your text value, and (unless you are keeping streams) write a record into that field along with the typed value. It will mean keeping the value of a record twice (and more of a pain on updates etc) but it will make querying easier, along with reporting (anything you just need to display the value for).
Note: it would also be idea to pull anything that is common to all products and put them in the product table. For example all products will most likely have a name, suggested price, etc.

How to make the database structure extensible for a turn based game?

I'm working on a fantasy turn base game.
I now have to create the database structure for my spells. The problem is that I don't really have a good idea on how to create it. Maybe the effects of those spells should not be stored in a database?
For instance, effects could be; increase attack, pull an enemy, heal, teleport, hide, put a mine and so on... Effects are pretty different and I would like the database structure to be extensible.
Edit:
It's a turn based game, time is the same as turns and distance represents the squares.
Some examples of what I mean below.
Let's say we have Incinerate:
it can target only 1 enemy (not ally)
it can be casted at a distance of 3 squares
it deals 5 damage per turn
it lasts 3 turns
Now we can take Shock Wave:
it travels in a line for 4 squares
it starts from a square near the caster
it damages the first target it hits (ally or enemy)
it deals 5 damage to the target and knocks it back 1 square
And the last one Rain Call:
it can be casted at any distance
it's a cloud the size of a 5x5 square
it can target both ally and enemies
only fire creatures take damage
while casting the caster is immobilized and it loses 5 mana/turn
As you can see there are a lot of possible columns: the distance it travels, turns, casting distance, type (damage, heal, armor, etc), value (+2), target (enemy, ally, both), size, etc.
I would not use a relational database for storing spells. Relational databases are good in cases when most of the following conditions apply:
you have very large amount of data,
the data can logically be organized as n-ary relations (tables, rows, columns),
you have many users that access to the data concurrently,
you need ACID properties,
et cetera
Databases are like trucks. They are big. They are difficult to use. They are expensive. (in terms of needed expertise, maintenance time, run time efficiency, etc. if not monetarily) They are very good at what they are good at, but not at anything else. Don't use a truck when a bicycle would suffice.
Let's come to your problem. The number of different types of spells is surely bounded and known at compile time, why don't you define an interface ISpell, and let each spell type be a class that implements ISpell? (You can also define an abstract class for common code) Then a SpellFactory may construct and provide access to all the spells when the program starts. Do you really need the spells be accessible from outside independent of your code?
If hard coding a SpellFactory is not flexible enough for your purposes, you can use xml configuration files. <spell type="blind" description="bla bla" picture="file.jpg"> <effects> <effect .. /> .. </effects> <range>5</range> etc. I don't know much about computer games, but this is what they did in sid meier civilization game, for example. Then, instead of hard coding the different spells in the SpellFactory, you can let it read them from the configuration file at the start up.
As far as I can see, using configuration files instead of a database has the following advantages:
It is a fast, easy, lightweight solution,
It is much more flexible than having all the spells having the same set of columns, (most of which will not make sense for a specific spell)
It is much easier to have more than one version of set of spells at the same time, for experiments, variations, etc,
You can let end users access and manipulate xml files for customizing the game without letting them access the database that would also contain sensitive data,
et cetera.
The disadvantages:
More people know about relational databases than xml format, so you might need a couple of hours to learn how to read and manipulate xml "elements".
Your question is pretty large. It depends on a lot of things, are you going to load the spell during runtime? Maybe you will load them at the beginning of the game? What database will you be using?
Amit Bhargava's suggestion is good and has the advantage of being user-understandable. However string are pretty slow, so what you could do is use flags in your spell table. Then, based on the flag you know which type of spell it is.

Determining the Similarity Between Items in a Database

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().

Resources