Which bank was used for the data for this dataset? - dataset

In the dataset linked here: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing , it uses the data from a Portuguese banking institution. However it does not say which exact bank this is or the name of the bank. How do I found out the exact bank/s that was used for the data in this dataset?
I read this from the source: https://www.sciencedirect.com/science/article/abs/pii/S016792361400061X however it also did not name the exact bank but just said a Portuguese banking institution.

Related

Big Data Algorithm Validation legal issues

I am at the beginning of my personal big data project.
In the end -after I have implemented a specific algorithm- I want to validate my algorithm.
Unfortunately in order to validate the output of my algorithm, I am in need of personal information, such as:
Name, Surname
Occupation/Job
Organization name
Age
Location
Years of working experience
Income per month/year
So my question is:
How can I collect data without violating laws?
My thoughts on the question:
I know there are business-oriented social networks like LinkedIn or Xing, but I need additional information like income, which is not available through the social networks APIs or is illegal to store and use.

Datasets for clustering algorithms

I am asked to give a lecture on clustering algorithms for an audience that is not very technical. With that in mind, I wanted to do a simple exercise where I will ask the audience to identify groups from a dataset. However, I cannot find good datasets that could be usable for this purpose.
Is there a dataset of customers and some products they have bought that I can use for this purpose? Or any other dataset that might look suitable!
I can suggest a simple geo location database for example all cities in germany. I think you can find it for free. Or you can look for the NASA sky data. Would be nice to cluster too.
Here is the Ta-Feng dataset containing 4 months of transactions. Got it from Prof. Chun Nan himself. It is now stored in my dropbox folder: https://www.dropbox.com/s/tsd5zd8a7afmzs7/D11-02.ZIP?dl=0 The first line of each file shows the column names in Chinese. In English is:
Date; Membership Card ID; Product Category; Product Code; Quantity; Total Transaction Amount (in TWD)

UK Zip Code List - Nearest Five Zip List

Does anyone know of a free postcode CSV for UK? I have one for the USA but can't fined a UK equivalent.
I want to implement nearest Five kind of logic based on long, Lat.
Kindly let me know if any one is aware.
Regards,
Jigar
It looks like you need to explore the Ordnance Survey OpenData dataset.
The UK doesn't have zip codes; rather, it has postcodes, which are much more precise. For example the building where I work has a postcode to itself, BS1 2PH.
Commercial organisations that validate postcodes use the Postcode Address File, which is commercially licensed (and expensive!), mostly as maintaining a per-premise address format is very time-consuming.
In practice most Web sites. etc. use Web services such as Postcode Anywhere which aggregate and resell the data.

Is there a public geolocation system that I can query for names of local cities?

I'm looking to attempt to simplify address entry into a system where the city textbox has autosuggest initially populated by the user's geolocation. In the past it has seemed that autosuggesting the city name is prohibitively costly without knowing the province/state/country first but it doesn't make sense to require the user to enter the address backwards as we don't think about address information this way. On the other hand, not autosuggesting the city name means we end up with all sorts of weird and wonderful entries for mis-spelled cities from around the world.
I was wondering if there's a service that I can query that would automatically respond with the most appropriate city names according to not only what the user enters in the textbox, but the location of that user based on the country and political boundary they fall within?
For instance, if I am in Canada [as I am] and I enter 'Mi' then I'd be presented with all cities within Canada starting with 'Mi' until it was determined that the information I was entering wasn't Canadian at which point, it would use the next most likely configured country based on our usage pattern - i.e. it would check the U.S. next, followed by Mexico and then other less likely destinations. I can write all this myself if I had the database but I don't know where I can find one and my suspicion is that it would be less scalable than querying a pre-existing service on the web.
Looks as though MaxMind offers a free database that you could download in CSV:
There's an online demo to test it a bit if you'd like, but no way to query it through a web service.
IPInfoDB also has their database available for download - they have an XML API, but it only supports looking up the city/country for a particular IP. You're trying to do something a little more wide than that, looking for every city in a particular country, with country selected based on IP. I wouldn't expect that there's a web service for that, it's a pretty specific requirement.
Edited to add: You could use the IPInfoDB API to look up the country though, and then generate the autocomplete suggestions from a local country/city database. That way all the IP-geolocation wouldn't need to be done locally. There are various places that you can get a list of cities in a particular country. For example, here's some comprehensive lists maintained by the National Geospatial-Intelligence Agency

Best practices for storing postal addresses in a database (RDBMS)?

Are there any good references for best practices for storing postal addresses in an RDBMS? It seems there are lots of tradeoffs that can be made and lots of pros and cons to each to be evaluated -- surely this has been done time and time again? Maybe someone has at least written done some lessons learned somewhere?
Examples of the tradeoffs I am talking about are storing the zipcode as an integer vs a char field, should house number be stored as a separate field or part of address line 1, should suite/apartment/etc numbers be normalized or just stored as a chunk of text in address line 2, how do you handle zip +4 (separate fields or one big field, integer vs text)? etc.
I'm primarily concerned with U.S. addresses at this point but I imagine there are some best practices in regards to preparing yourself for the eventuality of going global as well (e.g. naming fields appropriately like region instead of state or postal code instead of zip code, etc.
For more international use, one schema to consider is the one used by Drupal Address Field. It's based on the xNAL standard, and seems to cover most international cases. A bit of digging into that module will reveal some nice pearls for interpreting and validating addresses internationally. It also has a nice set of administrative areas ( province, state, oblast, etc ) with ISO codes.
Here's the gist of the schema, copied from the module page:
country => Country (always required, 2 character ISO code)
name_line => Full name (default name entry)
first_name => First name
last_name => Last name
organisation_name => Company
administrative_area => State / Province / Region (ISO code when available)
sub_administrative_area => County / District (unused)
locality => City / Town
dependent_locality => Dependent locality (unused)
postal_code => Postal code / ZIP Code
thoroughfare => Street address
premise => Apartment, Suite, Box number, etc.
sub_premise => Sub premise (unused)
A lessons I've learned:
Don't store anything numerically.
Store country and administrative area as ISO codes where possible.
When you don't know, be lax about requiring fields. Some country may not use fields you take for granted, even basic things like locality & thoroughfare.
As an 'international' user, there is nothing more frustrating than dealing with a website that is oriented around only US-format addresses. It's a little rude at first, but becomes a serious problem when the validation is also over-zealous.
If you are concerned with going global, the only advice I have is to keep things free-form. Different countries have different conventions - in some, the house number comes before the street name, in some it comes after. Some have states, some regions, some counties, some combinations of those. Here in the UK, the zipcode is not a zipcode, it's a postcode containing both letters and numbers.
I'd advise simply ~10 lines of variable-length strings, together with a separate field for a postcode (and be careful how you describe that to cope with national sensibilities). Let the user/customer decide how to write their addresses.
If you need comprehensive information about how other countries use postal addresses, here's a very good reference link (Columbia University):
Frank's Compulsive Guide to Postal Addresses
Effective Addressing for International Mail
You should definitely consider storing house number as a character field rather than a number, because of special cases such as "half-numbers", or my current address, which is something like "129A" — but the A is not considered as an apartment number for delivery services.
I've done this (rigorously model address structures in a database), and I would never do it again. You can't imagine how crazy the exceptions are that you'll have to take into account as a rule.
I vaguely recall some issue with Norwegian postal codes (I think), which were all 4 positions, except Oslo, which had 18 or so.
I'm positively sure that from the moment we started using the geographically correct ZIP codes for all of our own national addresses, quite a few people started complaining that their mail arrived too late. Turned out those people were living near a borderline between postal areas, and despite the fact that someone really lived in postal area, say, 1600, in reality his mail should be addressed to postal area 1610, because in reality it was that neighbouring postal area that actually served him, so sending his mail to his correct postal area would take that mail a couple of days longer to arrive, because of the unwanted intervention that was required in the correct postal office to forward it to the incorrect postal area ...
(We ended up registering those people with an address abroad in the country with ISO-code 'ZZ'.)
Unless you are going to do maths on the street numbers or zip / postal codes, you are just inviting future pain by storing them as numerics.
You might save a few bytes here and there, and maybe get a faster index, but what do you when US postal, or whatever other country you are dealing with, decides the introduce alphas into the codes?
The cost of disk space is going to be a lot cheaper than the cost of fixing it later on... y2k anybody?
Adding to what #Jonathan Leffler and #Paul Fisher have said
If you ever anticipate having postal addresses for Canada or Mexico added to your requirements, storing postal-code as a string is a must. Canada has alpha-numeric postal codes and I don't remember what Mexico's look like off the top of my head.
You should certainly consult "Is this a good way to model address information in a relational database", but your question is not a direct duplicate of that.
There are surely a lot of pre-existing answers (check out the example data models at DatabaseAnswers, for example). Many of the pre-existing answers are defective under some circumstances (not picking on DB Answers at all).
One major issue to consider is the scope of the addresses. If your database must deal with international addresses, you have to be more flexible than if you only have to deal with addresses in one country.
In my view, it is often (which does not mean always) sensible to both record the 'address label image' of the address and separately analyze the content. This allows you to deal with differences between the placement of postal codes, for example, between different countries. Sure, you can write an analyzer and a formatter that handle the eccentricities of different countries (for instance, US addresses have 2 or 3 lines; by contrast, British addresses can have considerably more; one address I write to periodically has 9 lines). But it can be easier to have the humans do the analysis and formatting and let the DBMS just store the data.
Ive found that listing all possible fields from smallest discrete unit to largest is the easiest way. Users will fill in the fields they see fit. My address table looks like this:
*********************************
Field Type
*********************************
address_id (PK) int
unit string
building string
street string
city string
region string
country string
address_code string
*********************************
Where's the "trade off" in storing the ZIP as a NUMBER or VARCHAR? That's just a choice -- it's not a trade off unless there are benefits to both and you have to give up some benefits to get others.
Unless the sum of zips has any meaning at all, Zips as number is not useful.
This might be an overkill, but if you need a solution that would work with multiple countries and you need to programmatically process parts of the address:
you could have country specific address handling using two tables: One generic table with 10 VARCHAR2 columns, 10 Number columns, another table which maps these fields to prompts and has a country column tying an address structure to a country.
If you ever have to verify an address or use it to process credit card payments, you'll at least need a little structure. A free-form block of text does not work very well for that.
Zip code is a common optional field for validating payment card transactions without using the whole address. So have a separate and generously sized field for that (at least 10 chars).
Inspired by Database Answers
Line1
Line2
Line3
City
Country_Province
PostalCode
CountryId
OtherDetails
At the moment, I'm developing an international ecommerce website.
It should cover almost all addresses in this world as shown below:
*****************************************************************
Type Field name Displayed name in your form
*****************************************************************
INT id (PK)
VARCHAR(100) building Apt, office, suite, etc. (Optional)
VARCHAR(100) street Street address
VARCHAR(100) city City
VARCHAR(100) state State, province or prefecture
VARCHAR(100) zip_code Zip code
VARCHAR(100) country Country
*****************************************************************
I would just put all the fields together in a large NVARCHAR(1000) field, with a textarea element for the user to enter the value for (unless you want to perform analysis on eg. zip codes). All those address line 1, address line 2, etc. inputs are just so annoying if you have an address that doesn't fit well with that format (and, you know, there are other countries than the US).

Resources