GDPR Pseudonymisation [closed] - database

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
GDPR specifies that personal data must:
Those measures may include pseudonymisation provided that those purposes can be fulfilled in that manner. Where those purposes can be fulfilled by further processing which does not permit or no longer permits the identification of data subjects, those purposes shall be fulfilled in that manner.
In a normal workflow this data is normally pseudonymized, because there is a table on db with personal data with an ID that will be used as foreign key in the other ones, but in case of a security breach, if database is stolen the personal data is no longer pseudonymized.
Does this mean that we need to have another database with the personal data?
EDIT
Added article 32
Taking into account the state of the art, the costs of implementation and the nature, scope, context and purposes of processing as well as the risk of varying likelihood and severity for the rights and freedoms of natural persons, the controller and the processor shall implement appropriate technical and organisational measures to ensure a level of security appropriate to the risk, including inter alia as appropriate:
(a) the pseudonymisation and encryption of personal data;
...
[including inter alia as appropriate]

Disclaimer:
I'm not a lawyer or authority on this topic, just sharing my thinking on this from the perspective of a developer who has worked with 'pseudonymised' user databases.
The Oxford English Dictionary definition of pseudonym is:
A fictitious name, especially one used by an author.
‘I wrote under the pseudonym of Evelyn Hervey’
So in context of GDPR a pseudonym seems likely to mean some made up name for an individual that doesn't identify the individual unless combined with some other information. A tangible example might be, as you suggest, a user ID which indexes that indivduals personal data in some table.
Ok, so to your question, should this table be isolated in its own database?
The regulation provides its own definition of pseudonymisation which provides some clarification here:
(5) ‘pseudonymisation’ means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person;
Why the emphasis on seperation?
We know that GDPR is concerned with protecting user privacy.
If pseudonyms are only used in a context that also allows correspondance to be drawn between pseudonyms and the individuals they reference then no privacy has been provided.
So some seperation is needed. My reading is that the degree of seperation required and the level of security necessary to enforce that should be a function of the sensitivity of the data your holding and the fallout mitigations afforded in case some isolated part of your system is compromised.
So for your example if storing personal data in a seperate database, for whatever reason, allows you to limit some discrete part of your system to only accessing user-IDs then if that part of the system were compromised you've only exposed user-IDs and we might expect that to be viewed more favorably in eyes of GDPR.

Related

Design - Dynamic data mapping [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am working on an online tool that serves to a number of merchants(e.g. lets say retail merchants).This application takes data from different merchants and provides some data on their retail shop. The solution that I am trying to incorporate here is that any merchant can signup for the tool, send (may be upload through excel or my application can input a json object) their transaction and inventory data and in turn return the result to merchant.
My application consist of domain that is intrinsic to the application and contain all the datapoints that can be used by merchants, e.g
Product {
productId,
productName,
...
}
But the problem that I am facing is that, each merchant will have their own way of representing data, for e.g. merchant x may call product as prod or merchant y may call product
as proddt.
Now I would need to way to convert data represented in merchant format to a way that application understand, i.e each time there is a request from merchant x, application should map prod to product e.t.c e.t.c.
Firstly I was thinking of coding these mappers but then this is not a viable solution as I can't really code these mappings for 1000's of merchants that may join my application.
Another solution I was think was to enable the merchant to map a field from their domain to application domain through UI. And then save this somewhere in DB and on each request from merchant first find the mapping from db and then apply it over any incoming request.(Though I am still confused how this can be done).
Does anyone has faced similar design issue before and know of the better way of solving this problem.
if you can find the order of fields then you can easily map data send by your client and you can return result. for example in Excel you client can mention data in this format:
product | name | quantity | cost
condition: your ALL client should send data in this format.
then it will be easy for you to map these field and access then with correct DTO and later save and process data.
I appreciate this "language" concern, and -in fact- multi-lingual applications do it the way you describe. You need to standardize your terminology at your end, so that each term has only one meaning and only one word/term to describe it. You could even use mnemonics for that, e.g. for "favourite product" you use "Fav_Prod" in your app and in your DB. Then, when you present data to you customer, your app looks-up their preferred term for it in a look-up-table, and uses "favourite product" for customer one, and perhaps the admin, and then "favr prod" for customer two, etc...
Look at SQL and DB design, you'll find that this is a form of normalization.
Are you dealing with legacy systems and/or APIs at the customer end? If so, someone will indeed have to type in the data.
If you have 1000s of customers, but there are only 10..50 terms, it may best to let the customer, not you, set the terms.
You might be lucky, and be able to cluster customers together who use similar or close enough terminology. For new customers you could offer them a menu of terms that they can choose from.
If merchants were required to input their mapping with their data, your tool would not require a DB. In JSON, the input could be like the following:
input = {mapping: {ID: "productId", name: "productName"}, data: {productId: 0, productName: ""}}
Then, you could convert data represented in any merchant's format to your tool's format as follows:
ID = input.data[input.mapping.ID]
name = input.data[input.mapping.name]
To reacp:
You have an application
You want to load client data (merchants in this case) into your application
Your clients “own” and manage this data
There are N ways in which such client data can be managed, where N <= the number of possible clients
You will run out of money and your business will close before you can build support for all N models
Unless you are Microsoft, or Amazon, or Facebook, or otherwise have access to significant resources (time, people, money)
This may seem harsh, but it is pragmatic. You should assume NOTHING about how potential clients will be storing their data. Get anything wrong, your product will process and return bad results, and you will lose that client. Unless your clients are using the same data management tools—and possibly even then—their data structures and formats will differ, and could differ significantly.
Based on my not entirely limited experience, I see three possible ways to handle this.
1) Define your own way of modeling data. Require your customers to provide data in this format. Accept that this will limit your customer base.
2) Identify the most likely ways (models) your potential clients will be storing data (e.g. most common existing software systems they might be using for this.) Build import structures, formats to suppor these models. This, too, will limit your customer base.
3) Start with either of the above. Then, as part of your business model, agree to build out your system to support clients who sign up. If you already support their data model, great! If not, you will have to build it out. Maybe you can work the expense of this into what you charge them, maybe not. Your customer base will be limited by how efficiently you can add new and functional means of loading data to your system.

Auto incrementing from two different points on the same table?

Each customer has an ID. For every new customer it's incremented by 1.
My company is wanting to start selling to a new type of customer (lets call it customer B), and have their ID start at 30k.
Is this a poor design? Our system has an option to set the customer type, so there is no need for this. We're not expected to reach this number for our other type of customers until 2050 (it's currently at 10k).
I'm trying to convince them that this is a bad idea... It also doesn't help the third party trying to implement this is ok with it.
Confusing the technical and functional keys is the really bad idea. Technical keys are designed to be unchanged, to maintain the referential integrity and to speed up the joins. Technical keys are also to avoid of exposing at the application level (i.e. in the screen forms).
Functional keys are visible to application users and may be a subject to change accordingly to new requirements.
Other bad idea is to use number parts to separate data partitions (customer groups in your case). The character code having a complex structure like a social security number is a more strong and scalable solution (however it's a sort of denormalization, too).
The recommended solution is to have a technical ID as an auto-incremental number but also add a functional customer code.
If the identifier is customer-facing or business user-facing then it makes sense to design it with readability and usability in mind. Most people find it very natural work with identification schemes that have information encoded within them: think of phone numbers, vehicle licence plate numbers, airline flight numbers. There is evidence that a well structured identification scheme can improve usability and reduce the rate of errors in day to day use. You could expect that a well-designed "meaningful" identification scheme will result in fewer errors than an arbitrary incrementing number with no identifiable features.
There's nothing inherently wrong with doing this. Whether it makes sense in your business context depends a lot on how the number will be used. Technically speaking it isn't difficult to build multiple independent sequences, e.g. using the SEQUENCE feature in SQL. You may want to consider including other features as well, like check digits.

A topic to experienced database architects [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I face the following problem.
I'm creating a database for (say) human beings' info. All the human beings may be classified in one of the three categories: adult female, adult male, child. It is clear that the parameters like "height" and "weight" are applicable to all of the categories. The parameter "number of children" is applicable only to adults, while the parameter "number of pregnancies" is applicable to females only. Also, each parameter may be classified as mandatory or optional depending on the category (for example, for adults the parameter "number of ex-partners" is optional).
When I load (say) "height" and "weight", I check whether the info in these two fields is self-consistent. I.e., I mark as a mistake the record which has height=6'4'' and weight=10 lb (obviously, this is physically impossible). I have several similar verification rules.
When I insert a record about a human being, I need to reflect the following characteristics of the info:
the maximum possible info for the category of this particular human being (including all the optional parameters).
the required minimum of information for the category (i.e., mandatory fields only)
what has actually been inserted for this particular human being (i.e., it is possible to insert whatever I have for this person no matter whether it is smaller than the amount of required minimum of info or not). The non-trivial issue here is that a field "XXX" may have NULL value because I have never inserted anything there OR because I have intentionally inserted exactly NULL value. The same logic with the fields that have a default value. So somewhere should be reflected that I have processed this particular field.
what amount of inserted information has been verified (i.e., even if I load some 5 fields, I can check for self-consistency only 3 fields while ignoring the 2 left).
So my question is how to technically organize it. Currently, all these required features are either hardcoded with no unified logic or broken into completely independent blocks. I need to create a unified approach.
I have some naive ideas in my head in this regard. For example, for each category of human beings, I can create and store a list of possible fields (I call it "template"). A can mark those fields that are mandatory.
When I insert a record about a human being, I copy the template and mark what fields from this templates have actually been processed. At the next stage, I can mark in this copy of the template those fields that will be currently verified.
The module of verification is specially corrected in the following way: for each verification procedure I create a list of fields that are being used in this particular verification procedure. Then I call only those verification procedures that have those fields that are actually marked "to be verified" in the copy of the template for the particular human being that is to be verified (see the previous passage).
As you see, this is the most straightforward way to solve this problem. But my guess is that there are a lot of quite standardized approaches that I'm not aware of. I really doubt that I'm the first in the world to solve such a problem. I don't like my solution because it is really painfull to write the code to correctly reflect in this copied template all the "updates" happening with a record.
So, I ask you to share your opinion how would you solve this problem.
I think there are two questions here:
how do I store polymorphic data in a database?
how do I validate complex business rules?
You should address them separately - trying to solve both at once is probably too hard.
There are a few approaches to polymorphic data in RDBMSes - ORMs use the term inheritance mapping, for instance. The three solutions here - table per class hierarchy, table per subclass and table per concrete class - are "pure" relational solutions. You can also use the "Entity-Attribute-Value" design, or use a document approach (storing data in structured formats such as XML or JSON) - these are not "pure" relational options, but have their place.
Validating complex business rules is often done using rule engines - these are super cool bits of technology, but you have to be sure that your problem really fits with their solution - deciding to invest in a rules engine means your project changes into a rules engine project, not a "humans" project. Alternatively, most mainstream solutions to this embody the business logic about the entities in the application's business logic layer. It sounds like you're outgrowing this.
This exact problem, both in health terms and in terms of a financial instrument, is used as a primary example in Martin Fowlers book Analysis Patterns. It is an extensive topic. As #NevilleK says you are trying to deal with two questions, and it is best to deal with them separately. One ultra simplified way of approaching these problems is:
1 Storage of polymorphic data - only put mandatory data that is common to the category in the category table. For optional data put these in a separate table in 1-1 relationship to the category table. Entries are made in these optional tables only if there is a value to be recorded. The record of the verification of the data can also be put in these additional tables.
2 Validate complex business rules - it is useful to consider the types of error that can arise. There are a number of ways of classifying the errors but the one I have found most useful is (a) type errors where one can tell that the value is in error just by looking at the data - eg 1980-02-30. (b) context errors where one can detect the error only by reference to previously captured date - eg DoB 1995-03-15, date of marriage 1996-08-26. and (c) lies to the system - where the data type is ok; the context is ok; but the information can only be detected as incorrect at a later date when more information comes to light eg if I register my DoB as 1990-12-31, when it is something different. This latter class of error typically has to be dealt with by procedures outside the system being developed.
I would use the Party Role pattern (Silverston):
Party
id
name
Individual : Party
current_weight
current_height
PartyRole
id
party_id
from_date
to_date (nullable)
AdultRole : PartyRole
number_of_children
FemaleAdultRole : AdultRole
number_of_pregnancies
Postgres has a temporal extension such that you could enforce that a party could only play one role at a time (yet maintain their role histories).
Use table inheritance. For simplicity use Single Table Inheritance (has nulls), for no nulls use Class Table Inheritance.

Is it too extreme of a measure to reduce redundancy in Microsoft's SQL Server 2012 by making tons of unique constraints? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have recently been tasked with making an account system for all of our products, much like that of a Windows account across Microsoft's products. However, one of the requirements is that we are able to easily check for accounts with the same information across them so that we can detect traded accounts, sudden and possibly fraudulent changes of information, etc.
When thinking of a way to solve this problem, I thought we could reduce redundancy of our data while we're at it. I thought it might help us save some storage space and processing time, since in the end we're going to just be processing the data set into what I explain below.
A bit of background on how this is set up right now:
An account table just contains an id and a username
A profile table contains a reference to an account and references to separate pieces of profile data: names, mailing addresses, email addresses
A name table contains an id and the first, last, and middle name of an individual
An address table contains data about an address
An email address table contains an id and the mailbox and domain of an email address
A profile record is what relates the unique pieces of profile data (shared across many accounts) to a specific account. If there are fifty people named "John Smith", then there is only one "John Smith" record in the names table. If a user changes any piece of their information, the profile record is soft deleted and a new one is created. This is to facilitate change tracking.
After profiling, I have noticed that creating constraints like UNIQUE(FirstName, MiddleName, LastName) is pretty painful in terms of record insertion. Is that simply the price we're going to have to pay or is there a better approach?
Having records about 2 persons named John Smith is not a redundancy, but a necessity.
The approach you've suggested is far from being optimal. "The profile record is soft deleted and a new one is created. This is to facilitate change tracking." Deleting and re-inserting will cause problems with dependent records in other tables.
There are much easier methods to track changes - search for some 3rd party tools
As for the tables created, it's not necessary to split the data in so many tables. Why don't you merge the Name and Account tables. Are both Address and Email address tables necessary?
I have concluded my research and decided that this approach is fine if insert performance is not critical. In cases where it is critical, increasing data redundancy within reason is an acceptable trade off.
The solution described in my question is adequate for my performance needs. Storage is considered more expensive than insertion time, in our model.

Best practices for consistent and comprehensive address storage in a database [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Are there any best practices (or even standards) to store addresses in a consistent and comprehensive way in a database ?
To be more specific, I believe at this stage that there are two cases for address storage :
you just need to associate an address to a person, a building or any item (the most common case). Then a flat table with text columns (address1, address2, zip, city) is probably enough. This is not the case I'm interested in.
you want to run statistics on your addresses : how many items in a specific street, or city or... Then you want to avoid misspellings of any sorts, and ensure consistency. My question is about best practices in this specific case : what are the best ways to model a consistent address database ?
A country specific design/solution would be an excellent start.
ANSWER : There does not seem to exist a perfect answer to this question yet, but :
xAL, as suggested by Hank, is the closest thing to a global standard that popped up. It seems to be quite an overkill though, and I am not sure many people would want to implement it in their database...
To start one's own design (for a specific country), Dave's link to the Universal Postal Union (UPU) site is a very good starting point.
As for France, there is a norm (non official, but de facto standard) for addresses, which bears the lovely name of AFNOR XP Z10-011 (french only), and has to be paid for. The UPU description for France is based on this norm.
I happened to find the equivalent norm for Sweden : SS 613401.
At European level, some effort has been made, resulting in the norm EN 14142-1. It is obtainable via CEN national members.
I've been thinking about this myself as well. Here are my loose thoughts so far, and I'm wondering what other people think.
xAL (and its sister that includes personal names, XNAL) is used by both Google and Yahoo's geocoding services, giving it some weight. But since the same address can be described in xAL in many different ways--some more specific than others--then I don't see how xAL itself is an acceptable format for data storage. Some of its field names could be used, however, but in reality the only basic format that can be used among the 16 countries that my company ships to is the following:
enum address-fields
{
name,
company-name,
street-lines[], // up to 4 free-type street lines
county/sublocality,
city/town/district,
state/province/region/territory,
postal-code,
country
}
That's easy enough to map into a single database table, just allowing for NULLs on most of the columns. And it seems that this is how Amazon and a lot of organizations actually store address data. So the question that remains is how should I model this in an object model that is easily used by programmers and by any GUI code. Do we have a base Address type with subclasses for each type of address, such as AmericanAddress, CanadianAddress, GermanAddress, and so forth? Each of these address types would know how to format themselves and optionally would know a little bit about the validation of the fields.
They could also return some type of metadata about each of the fields, such as the following pseudocode data structure:
structure address-field-metadata
{
field-number, // corresponds to the enumeration above
field-index, // the order in which the field is usually displayed
field-name, // a "localized" name; US == "State", CA == "Province", etc
is-applicable, // whether or not the field is even looked at / valid
is-required, // whether or not the field is required
validation-regex, // an optional regex to apply against the field
allowed-values[] // an optional array of specific values the field can be set to
}
In fact, instead of having individual address objects for each country, we could take the slightly less object-oriented approach of having an Address object that eschews .NET properties and uses an AddressStrategy to determine formatting and validation rules:
object address
{
set-field(field-number, field-value),
address-strategy
}
object address-strategy
{
validate-field(field-number, field-value),
cleanse-address(address),
format-address(address, formatting-options)
}
When setting a field, that Address object would invoke the appropriate method on its internal AddressStrategy object.
The reason for using a SetField() method approach rather than properties with getters and setters is so that it is easier for code to actually set these fields in a generic way without resorting to reflection or switch statements.
You can imagine the process going something like this:
GUI code calls a factory method or some such to create an address based on a country. (The country dropdown, then, is the first thing that the customer selects, or has a good guess pre-selected for them based on culture info or IP address.)
GUI calls address.GetMetadata() or a similar method and receives a list of the AddressFieldMetadata structures as described above. It can use this metadata to determine what fields to display (ignoring those with is-applicable set to false), what to label those fields (using the field-name member), display those fields in a particular order, and perform cursory, presentation-level validation on that data (using the is-required, validation-regex, and allowed-values members).
GUI calls the address.SetField() method using the field-number (which corresponds to the enumeration above) and its given values. The Address object or its strategy can then perform some advanced address validation on those fields, invoke address cleaners, etc.
There could be slight variations on the above if we want to make the Address object itself behave like an immutable object once it is created. (Which I will probably try to do, since the Address object is really more like a data structure, and probably will never have any true behavior associated with itself.)
Does any of this make sense? Am I straying too far off of the OOP path? To me, this represents a pretty sensible compromise between being so abstract that implementation is nigh-impossible (xAL) versus being strictly US-biased.
Update 2 years later: I eventually ended up with a system similar to this and wrote about it at my defunct blog.
I feel like this solution is the right balance between legacy data and relational data storage, at least for the e-commerce world.
I'd use an Address table, as you've suggested, and I'd base it on the data tracked by xAL.
In the UK there is a product called PAF from Royal Mail
This gives you a unique key per address - there are hoops to jump through, though.
I basically see 2 choices if you want consistency:
Data cleansing
Basic data table look ups
Ad 1. I work with the SAS System, and SAS Institute offers a tool for data cleansing - this basically performs some checks and validations on your data, and suggests that "Abram Lincoln Road" and "Abraham Lincoln Road" be merged into the same street. I also think it draws on national data bases containing city-postal code matches and so on.
Ad 2. You build up a multiple choice list (ie basic data), and people adding new entries pick from existing entries in your basic data. In your fact table, you store keys to street names instead of the street names themselves. If you detect a spelling error, you just correct it in your basic data, and all instances are corrected with it, through the key relation.
Note that these options don't rule out each other, you can use both approaches at the same time.
In the US, I'd suggest choosing a National Change of Address vendor and model the DB after what they return.
The authorities on how addresses are constructed are generally the postal services, so for a start I would examine the data elements used by the postal services for the major markets you operate in.
See the website of the Universal Postal Union for very specific and detailed information on international postal address formats:http://www.upu.int/post_code/en/postal_addressing_systems_member_countries.shtml
"xAl is the closest thing to a global standard that popped up. It seems to be quite an overkill though, and I am not sure many people would want to implement it in their database..."
This is not a relevant argument. Implementing addresses is not a trivial task if the system needs to be "comprehensive and consistent" (i.e. worldwide). Implementing such a standard is indeed time consuming, but to meet the specified requirement nevertheless mandatory.
normalize your database schema and you'll have the perfect structure for correct consistency. and this is why:
http://weblogs.sqlteam.com/mladenp/archive/2008/09/17/Normalization-for-databases-is-like-Dependency-Injection-for-code.aspx
I asked something quite similar earlier: Dynamic contact information data/design pattern: Is this in any way feasible?.
The short answer: Storing adderres or any kind of contact information in a database is complex. The Extendible Address Language (xAL) link above has some interesting information that is the closest to a standard/best practice that I've come accross...

Resources