Big Data Algorithm Validation legal issues - database

I am at the beginning of my personal big data project.
In the end -after I have implemented a specific algorithm- I want to validate my algorithm.
Unfortunately in order to validate the output of my algorithm, I am in need of personal information, such as:
Name, Surname
Occupation/Job
Organization name
Age
Location
Years of working experience
Income per month/year
So my question is:
How can I collect data without violating laws?
My thoughts on the question:
I know there are business-oriented social networks like LinkedIn or Xing, but I need additional information like income, which is not available through the social networks APIs or is illegal to store and use.

Related

Design - Dynamic data mapping [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am working on an online tool that serves to a number of merchants(e.g. lets say retail merchants).This application takes data from different merchants and provides some data on their retail shop. The solution that I am trying to incorporate here is that any merchant can signup for the tool, send (may be upload through excel or my application can input a json object) their transaction and inventory data and in turn return the result to merchant.
My application consist of domain that is intrinsic to the application and contain all the datapoints that can be used by merchants, e.g
Product {
productId,
productName,
...
}
But the problem that I am facing is that, each merchant will have their own way of representing data, for e.g. merchant x may call product as prod or merchant y may call product
as proddt.
Now I would need to way to convert data represented in merchant format to a way that application understand, i.e each time there is a request from merchant x, application should map prod to product e.t.c e.t.c.
Firstly I was thinking of coding these mappers but then this is not a viable solution as I can't really code these mappings for 1000's of merchants that may join my application.
Another solution I was think was to enable the merchant to map a field from their domain to application domain through UI. And then save this somewhere in DB and on each request from merchant first find the mapping from db and then apply it over any incoming request.(Though I am still confused how this can be done).
Does anyone has faced similar design issue before and know of the better way of solving this problem.
if you can find the order of fields then you can easily map data send by your client and you can return result. for example in Excel you client can mention data in this format:
product | name | quantity | cost
condition: your ALL client should send data in this format.
then it will be easy for you to map these field and access then with correct DTO and later save and process data.
I appreciate this "language" concern, and -in fact- multi-lingual applications do it the way you describe. You need to standardize your terminology at your end, so that each term has only one meaning and only one word/term to describe it. You could even use mnemonics for that, e.g. for "favourite product" you use "Fav_Prod" in your app and in your DB. Then, when you present data to you customer, your app looks-up their preferred term for it in a look-up-table, and uses "favourite product" for customer one, and perhaps the admin, and then "favr prod" for customer two, etc...
Look at SQL and DB design, you'll find that this is a form of normalization.
Are you dealing with legacy systems and/or APIs at the customer end? If so, someone will indeed have to type in the data.
If you have 1000s of customers, but there are only 10..50 terms, it may best to let the customer, not you, set the terms.
You might be lucky, and be able to cluster customers together who use similar or close enough terminology. For new customers you could offer them a menu of terms that they can choose from.
If merchants were required to input their mapping with their data, your tool would not require a DB. In JSON, the input could be like the following:
input = {mapping: {ID: "productId", name: "productName"}, data: {productId: 0, productName: ""}}
Then, you could convert data represented in any merchant's format to your tool's format as follows:
ID = input.data[input.mapping.ID]
name = input.data[input.mapping.name]
To reacp:
You have an application
You want to load client data (merchants in this case) into your application
Your clients “own” and manage this data
There are N ways in which such client data can be managed, where N <= the number of possible clients
You will run out of money and your business will close before you can build support for all N models
Unless you are Microsoft, or Amazon, or Facebook, or otherwise have access to significant resources (time, people, money)
This may seem harsh, but it is pragmatic. You should assume NOTHING about how potential clients will be storing their data. Get anything wrong, your product will process and return bad results, and you will lose that client. Unless your clients are using the same data management tools—and possibly even then—their data structures and formats will differ, and could differ significantly.
Based on my not entirely limited experience, I see three possible ways to handle this.
1) Define your own way of modeling data. Require your customers to provide data in this format. Accept that this will limit your customer base.
2) Identify the most likely ways (models) your potential clients will be storing data (e.g. most common existing software systems they might be using for this.) Build import structures, formats to suppor these models. This, too, will limit your customer base.
3) Start with either of the above. Then, as part of your business model, agree to build out your system to support clients who sign up. If you already support their data model, great! If not, you will have to build it out. Maybe you can work the expense of this into what you charge them, maybe not. Your customer base will be limited by how efficiently you can add new and functional means of loading data to your system.

How do I structure multiple Identity Data in a database

Am designing a database for a credit bureau and am seeking some guidance.
The data they receive from Banks, MFIs, Saccos, Utility companies etc comes with various types of IDs. E.g. It is perfectly legal to open a bank account with a National ID and also a Passport. Scenario One that has my head banging is that Customer1 will take a credit facility (call it loan for now) in bank1 with the passport and then go to bank2 and take another loan with their NationalID and Bank3 with their MilitaryID. Eventually when this data comes from the banks to the bureau, it would be seen as 3 different people while we know that its actually 1 person. At this point, there is nothing we can do as a bureau.
However, one way out (for now) is using the Govt registry which provides a repository which holds both passports and IDS. So once we query for this information and get a response, how do I show in the DB that Passport_X is related to NationalID_Y and MilitaryNumber_Z?
Again, a person's name could be captured in various orders states. Bank1 could do FName, LName, OName while Bank3 can do LName, FName only. How do I store this names?
Even against one ID type e.g. NationalID, you will often find misspellt names or missing names. So one NationalID in our database could end up with about 6 different names because the person's name was captured different by the various banks where he has transacted.
And that is just the tip of the iceberg. We have issues with addresses, telephone numbers, etc etc.
Could you have any insight as to how I'd structure my database to ensure we capture all data from all banks and provide the most accurate information possible regarding an individual? Better yet, do you have experience with this type of setup?
Thanks.
how do I show in the DB that Passport_X is related to NationalID_Y and MilitaryNumber_Z?
Trivial.
You ahve an identity table, that has an AlternateId field if the Identity is linked to another one. Use the first IDentity you created as master. Any alternative will have AlternateId pointing to it.
You need to separate the identity from the data in it, so you can have alterante versions of it, possibly with an origin and timestampt. You need oto likely fully support versioning and tying different identities to each other as alternative, including generating a "master identity" possibly by algorithm with the "official" version of your data (i.e. consolidated).
The details are complex - mostly you ahve to make a LOT of compromises without killing performance, so at the end HIRE A SPECIALIST. There is a reason there are people out as sensior database designers or architects that have 20+ years experience finding the optimal solution given the constrints you may not even be aware of (application wise).
Better yet, do you have experience with this type of setup?
Yes. Try financial information. Stock symbols / feeds / definitions are not necessariyl compatible and vary by whom you get it. Any non-trivial setup has different data feeds that may show the same item slightly different, sometimes in error. DIfferent name, sometimes different price (example: ES, CME group, is 50 USD per point, but on TT Fix it is 5 - to make up, the price is multiplied by 10, so instad of 1000.25 you get 10002.5). THis is the same line of consolidation, and it STINKS.
Tons of code, tons of proper database design, redoing it half a dozen time to get the proper performance. THis is tricky, sadly.

Is there a public geolocation system that I can query for names of local cities?

I'm looking to attempt to simplify address entry into a system where the city textbox has autosuggest initially populated by the user's geolocation. In the past it has seemed that autosuggesting the city name is prohibitively costly without knowing the province/state/country first but it doesn't make sense to require the user to enter the address backwards as we don't think about address information this way. On the other hand, not autosuggesting the city name means we end up with all sorts of weird and wonderful entries for mis-spelled cities from around the world.
I was wondering if there's a service that I can query that would automatically respond with the most appropriate city names according to not only what the user enters in the textbox, but the location of that user based on the country and political boundary they fall within?
For instance, if I am in Canada [as I am] and I enter 'Mi' then I'd be presented with all cities within Canada starting with 'Mi' until it was determined that the information I was entering wasn't Canadian at which point, it would use the next most likely configured country based on our usage pattern - i.e. it would check the U.S. next, followed by Mexico and then other less likely destinations. I can write all this myself if I had the database but I don't know where I can find one and my suspicion is that it would be less scalable than querying a pre-existing service on the web.
Looks as though MaxMind offers a free database that you could download in CSV:
There's an online demo to test it a bit if you'd like, but no way to query it through a web service.
IPInfoDB also has their database available for download - they have an XML API, but it only supports looking up the city/country for a particular IP. You're trying to do something a little more wide than that, looking for every city in a particular country, with country selected based on IP. I wouldn't expect that there's a web service for that, it's a pretty specific requirement.
Edited to add: You could use the IPInfoDB API to look up the country though, and then generate the autocomplete suggestions from a local country/city database. That way all the IP-geolocation wouldn't need to be done locally. There are various places that you can get a list of cities in a particular country. For example, here's some comprehensive lists maintained by the National Geospatial-Intelligence Agency

Best practices for consistent and comprehensive address storage in a database [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Are there any best practices (or even standards) to store addresses in a consistent and comprehensive way in a database ?
To be more specific, I believe at this stage that there are two cases for address storage :
you just need to associate an address to a person, a building or any item (the most common case). Then a flat table with text columns (address1, address2, zip, city) is probably enough. This is not the case I'm interested in.
you want to run statistics on your addresses : how many items in a specific street, or city or... Then you want to avoid misspellings of any sorts, and ensure consistency. My question is about best practices in this specific case : what are the best ways to model a consistent address database ?
A country specific design/solution would be an excellent start.
ANSWER : There does not seem to exist a perfect answer to this question yet, but :
xAL, as suggested by Hank, is the closest thing to a global standard that popped up. It seems to be quite an overkill though, and I am not sure many people would want to implement it in their database...
To start one's own design (for a specific country), Dave's link to the Universal Postal Union (UPU) site is a very good starting point.
As for France, there is a norm (non official, but de facto standard) for addresses, which bears the lovely name of AFNOR XP Z10-011 (french only), and has to be paid for. The UPU description for France is based on this norm.
I happened to find the equivalent norm for Sweden : SS 613401.
At European level, some effort has been made, resulting in the norm EN 14142-1. It is obtainable via CEN national members.
I've been thinking about this myself as well. Here are my loose thoughts so far, and I'm wondering what other people think.
xAL (and its sister that includes personal names, XNAL) is used by both Google and Yahoo's geocoding services, giving it some weight. But since the same address can be described in xAL in many different ways--some more specific than others--then I don't see how xAL itself is an acceptable format for data storage. Some of its field names could be used, however, but in reality the only basic format that can be used among the 16 countries that my company ships to is the following:
enum address-fields
{
name,
company-name,
street-lines[], // up to 4 free-type street lines
county/sublocality,
city/town/district,
state/province/region/territory,
postal-code,
country
}
That's easy enough to map into a single database table, just allowing for NULLs on most of the columns. And it seems that this is how Amazon and a lot of organizations actually store address data. So the question that remains is how should I model this in an object model that is easily used by programmers and by any GUI code. Do we have a base Address type with subclasses for each type of address, such as AmericanAddress, CanadianAddress, GermanAddress, and so forth? Each of these address types would know how to format themselves and optionally would know a little bit about the validation of the fields.
They could also return some type of metadata about each of the fields, such as the following pseudocode data structure:
structure address-field-metadata
{
field-number, // corresponds to the enumeration above
field-index, // the order in which the field is usually displayed
field-name, // a "localized" name; US == "State", CA == "Province", etc
is-applicable, // whether or not the field is even looked at / valid
is-required, // whether or not the field is required
validation-regex, // an optional regex to apply against the field
allowed-values[] // an optional array of specific values the field can be set to
}
In fact, instead of having individual address objects for each country, we could take the slightly less object-oriented approach of having an Address object that eschews .NET properties and uses an AddressStrategy to determine formatting and validation rules:
object address
{
set-field(field-number, field-value),
address-strategy
}
object address-strategy
{
validate-field(field-number, field-value),
cleanse-address(address),
format-address(address, formatting-options)
}
When setting a field, that Address object would invoke the appropriate method on its internal AddressStrategy object.
The reason for using a SetField() method approach rather than properties with getters and setters is so that it is easier for code to actually set these fields in a generic way without resorting to reflection or switch statements.
You can imagine the process going something like this:
GUI code calls a factory method or some such to create an address based on a country. (The country dropdown, then, is the first thing that the customer selects, or has a good guess pre-selected for them based on culture info or IP address.)
GUI calls address.GetMetadata() or a similar method and receives a list of the AddressFieldMetadata structures as described above. It can use this metadata to determine what fields to display (ignoring those with is-applicable set to false), what to label those fields (using the field-name member), display those fields in a particular order, and perform cursory, presentation-level validation on that data (using the is-required, validation-regex, and allowed-values members).
GUI calls the address.SetField() method using the field-number (which corresponds to the enumeration above) and its given values. The Address object or its strategy can then perform some advanced address validation on those fields, invoke address cleaners, etc.
There could be slight variations on the above if we want to make the Address object itself behave like an immutable object once it is created. (Which I will probably try to do, since the Address object is really more like a data structure, and probably will never have any true behavior associated with itself.)
Does any of this make sense? Am I straying too far off of the OOP path? To me, this represents a pretty sensible compromise between being so abstract that implementation is nigh-impossible (xAL) versus being strictly US-biased.
Update 2 years later: I eventually ended up with a system similar to this and wrote about it at my defunct blog.
I feel like this solution is the right balance between legacy data and relational data storage, at least for the e-commerce world.
I'd use an Address table, as you've suggested, and I'd base it on the data tracked by xAL.
In the UK there is a product called PAF from Royal Mail
This gives you a unique key per address - there are hoops to jump through, though.
I basically see 2 choices if you want consistency:
Data cleansing
Basic data table look ups
Ad 1. I work with the SAS System, and SAS Institute offers a tool for data cleansing - this basically performs some checks and validations on your data, and suggests that "Abram Lincoln Road" and "Abraham Lincoln Road" be merged into the same street. I also think it draws on national data bases containing city-postal code matches and so on.
Ad 2. You build up a multiple choice list (ie basic data), and people adding new entries pick from existing entries in your basic data. In your fact table, you store keys to street names instead of the street names themselves. If you detect a spelling error, you just correct it in your basic data, and all instances are corrected with it, through the key relation.
Note that these options don't rule out each other, you can use both approaches at the same time.
In the US, I'd suggest choosing a National Change of Address vendor and model the DB after what they return.
The authorities on how addresses are constructed are generally the postal services, so for a start I would examine the data elements used by the postal services for the major markets you operate in.
See the website of the Universal Postal Union for very specific and detailed information on international postal address formats:http://www.upu.int/post_code/en/postal_addressing_systems_member_countries.shtml
"xAl is the closest thing to a global standard that popped up. It seems to be quite an overkill though, and I am not sure many people would want to implement it in their database..."
This is not a relevant argument. Implementing addresses is not a trivial task if the system needs to be "comprehensive and consistent" (i.e. worldwide). Implementing such a standard is indeed time consuming, but to meet the specified requirement nevertheless mandatory.
normalize your database schema and you'll have the perfect structure for correct consistency. and this is why:
http://weblogs.sqlteam.com/mladenp/archive/2008/09/17/Normalization-for-databases-is-like-Dependency-Injection-for-code.aspx
I asked something quite similar earlier: Dynamic contact information data/design pattern: Is this in any way feasible?.
The short answer: Storing adderres or any kind of contact information in a database is complex. The Extendible Address Language (xAL) link above has some interesting information that is the closest to a standard/best practice that I've come accross...

What is the "best" way to store international addresses in a database?

What is the "best" way to store international addresses in a database? Answer in the form of a schema and an explanation of the reasons why you chose to normalize (or not) the way you did. Also explain why you chose the type and length of each field.
Note: You decide what fields you think are necessary.
Plain freeform text.
Validating all the world's post/zip codes is too hard; a fixed list of countries is too politically sensitive; mandatory state/region/other administrative subdivision is just plain inappropriate (all too often I'm asked which county I live in--when I don't, because Greater London is not a county at all).
More to the point, it's simply unnecessary. Your application is highly unlikely to be modelling addresses in any serious way. If you want a postal address, ask for the postal address. Most people aren't so stupid as to put in something other than a postal address, and if they do, they can kiss their newly purchased item bye-bye.
The exception to this is if you're doing something that's naturally constrained to one country anyway. In this situation, you should ask for, say, the { postcode, house number } pair, which is enough to identify a postal address. I imagine you could achieve similar things with the extended zip code in the US.
In the past I've modeled forms that needed to be international after the ups/fedex shipping address forms on their websites (I figured if they don't know how to handle an international order we are all hosed). The fields they use can be used as reference for setting up your schema.
In general, you need to understand why you want an address. Is it for shipping/mailing? Then there is really only one requirement, have the country separate. The other lines are freeform, to be filled in by the user. The reason for this is the common forwarding strategy for mail : any incoming mail for a foreign country is forwarded without looking at the other address lines. Hence, the detailed information is parsed only by the mail sorter located in the country itself. Like the receiver, they'll be familiar with national conventions.
(UPS may bunch together some small European countries, e.. all the Low Countries are probably served from Belgium - the idea still holds.)
I think adding country/city and address text will be fine. country and city should be separate for reporting. Managers always ask for these kind of reports which you do not expect and I dont prefer running a LIKE query through a large database.
Not to give Facebook undue respect. However, the overall structure of the database seems to be overlooked in many web applications launching every day. Obviously I don't think there is a perfect solution that covers all the potential variables with address structure without some hard work. That said, combined with autocomplete Facebook manages to take location input data and eliminate a majority of their redundant entries. They do this by organizing their database well enough to provide autocomplete information in a low cost, low error way to the client in real time allowing them to more or less choose the correct location from an existing list.
I think the best solution is to access a third party database which contains your desired geographic scope and use it to initially seed your user location information. This will allow you to avoid doing the groudwork of creating your own. With any luck you can reduce the load on your server by allowing your new users to receive the correct autocomplete information directly off your third party supplier. Eventually you will be able to fill most autocomplete for location information such as city, country, etc. from information contained in your own database from user input data.
You need to provide a bit more details about how you are planning to use the data. For example, fields like City, State, Country can either be text in the single table, or be codes which are linked to a separate table with a Foreign Key.
Simplest would be
Address_Line_01 (Required, Non blank)
Address_Line_02
Address_Line_03
Landmark
City (Required)
Pin (Required)
Province_District
State (Required)
Country (Required)
All the above can be Text/Unicode with appropriate field lengths.
Phone Numbers as applicable.

Resources