Datasets for clustering algorithms - dataset

I am asked to give a lecture on clustering algorithms for an audience that is not very technical. With that in mind, I wanted to do a simple exercise where I will ask the audience to identify groups from a dataset. However, I cannot find good datasets that could be usable for this purpose.
Is there a dataset of customers and some products they have bought that I can use for this purpose? Or any other dataset that might look suitable!

I can suggest a simple geo location database for example all cities in germany. I think you can find it for free. Or you can look for the NASA sky data. Would be nice to cluster too.

Here is the Ta-Feng dataset containing 4 months of transactions. Got it from Prof. Chun Nan himself. It is now stored in my dropbox folder: https://www.dropbox.com/s/tsd5zd8a7afmzs7/D11-02.ZIP?dl=0 The first line of each file shows the column names in Chinese. In English is:
Date; Membership Card ID; Product Category; Product Code; Quantity; Total Transaction Amount (in TWD)

Related

I'd like to hear thoughts about using time-series databases for this project:

The project is to collect longitudinal data on inmates in a state prison system with the goals of recognizing time-based patterns and empowering prison justice advocates. The question is what time-series DB should I use?
My starting point is this article:
https://medium.com/schkn/4-best-time-series-databases-to-watch-in-2019-ef1e89a72377
and it's looking like the first 3 (InfluxDB, TimescaleDB, OpenTSDB) are on the table, but not so much the last one (I'm dealing with much more than strictly numerical data)
Project details:
Currently I'm using Postgres and plan to update the schema to look like (in broad strokes):
low-volatility fields like: id number, name, race, gender, date-of-birth
higher-volatility fields like: current facility, release date, parole eligibility date, etc
time-series admin data: begin current, end current, period checked. Where this shows the time period the above 2 data fields are current and how often they were checked for changes.
I'm thinking it would be better to move to a time-series db and keep track of each individual update instead of having some descriptive info associated with a start date, end date, and period checked field. (like valid 2020-01-01 to 2021-08-25, checked every 14 days)
What I want to prioritize is speed of pulling reports (like what percentage of inmates grouped by certain demographics have exited the system before serving 90% of their sentence?) over insert throughput and storage space. I'm also interested in hearing opinions on ease of learning, prominence in the industry, etc.
My background:
I'm a bootcamp-educated generalist in data science with a background in CS. I've worked with SQL (Postgres, SQLite) and NoSQL (Mongo) databases in the past, and my DB-modeling ability is from an undergrad databases class. I'm most familiar with Java and Python (and many of the data science python packages), but learning a new language isn't a huge hurdle.
Thanks for your time!

Database schema naming conventions and common mistakes?

(https://i.stack.imgur.com/VYkV6.png) :
I'm asked to design a relational database to keep data to answer clinic operation queries such as:
● List the patient appointments for each doctor for a given date.
● When a patient rings to make an appointment, give the available time slots for a given date.
● Retrieve the address of patients to send notices via mail services.
I have one database schema of one relation as shown below, but I was wondering whether there were any mistakes I've made?
ABC(doc-name, doc-gender, registration_num, qualification, pat-name, pat-gender, DOB, address, phone-num, appoint-date, appoint-time, type)
Is the use of words such as date and the use of hyphens generally discouraged? Are there any other weaknesses in my design?
Thank you
So, that's not a schema or a design. Not for a relational database, which, based on the tags for the question, is what you're looking for. That's the storage definition for an ID/Value style of database. If you're looking for actual relational storage, you should be building out those relationships through the process of normalization.
For example, let's start at the beginning with doc-name (I am personally not crazy about using hyphens, but it's not a showstopper, so at least on that note, be sure whichever RDBMS you're working with supports them in the name and then you're good to go). If we think about this just from a data entry stand point, we don't want to have to type in the name of the doctor every time we use that doctor. Instead, we'd want to pull that from a list. So, clearly, we can break that apart from the rest of the information. There is the beginning of our normalization process. We can also easily note the fact that a patient is likely to have more than one appointment. Under the current structure, we'd have to re-enter every bit of patient information prior to the appointment. There's another place where we'd break this apart.
There is tons more to this simple example that could be split out and normalized.
I'd suggest you read up on data normalization. My favorite teacher on the subject is Louis Davidson. Here's his book on the topic. Read that and then try to readdress the situation you're facing.
I'm assuming this isn't just homework. If it is, currently, I'd give you an "F". If it isn't, you should track down someone to give you hand with this database design. You won't be able to quickly read Louis' book on the topic and turn around even a rough working design in any reasonable period of time.
I have to second what Grant said, this is not a relational design at all.
Stop and ask yourself for example what happens if Steven Arrow has to take an afternoon off and update his schedule. You need to be very careful updating the database lest you reassign all his patients.
Spending a total of 5 minutes on this, I see at the very least:
A Doctors table, a Patients Table, and probably a table of open appointment times (which btw, is a bit harder than you think, so you have to give some thought how to handle that and some reading up on tables for scheduling).
That's for starters. I might break out Patients phone numbers to its own table. Why? Well how many columns do you want have for phone numbers? 1? What if they have a work AND home number? Or a Work and Cell and Home? And more.
The concept you're looking for is normal forms. You don't need to go overboard, but generally 3NF is about right.

sql, product inventory module

I have been tasked with creating a product inventory module. After reading all the posts I can find on Stack Overflow, I have decided the best way is to not keep a separate, running ‘balance’, but to create one on the fly. I have attached a representation of the tables involved.
Actually, it seems like I don't have enough reputation points to include a picture, so here is a link to a dropbox file:
So I have two questions, which are somewhat related, so it seem like I should include them in the same question posting, though I am not a frequent poster and a sql noob. So please excuse me if I am displaying my ignorance with posting or sql.
First, does this look correct (I named all the columns as non-opaque as possible)? I have to create reports that show the current inventory balance for all the products and for products individually as well as a ‘Transaction Register’ with running balance.
Second, provided the first answer is yes, is this a good candidate for creating a view?
Complex question. Difficult to answer without understanding the full scope of the project. One point - I see there is no Current On-hand table. I agree that the running balance at any point in time is best to use a calculated table. It is however common practice to keep a current on-hand table. This gives you the on hand inventory and values with-out having to sum up the transaction. This is the approach in Microsoft Axapta, and other products I have worked with.

comparing similar words and phrases

I have 2 databases in excel. In database A, I have the names of various companies, cities, and charities. Database B is the same. However Database B is filled out by the customer. As such, I get a lot of random mistakes and/or variations on the legal name.
What is the best way to match the names?
Here are some sample differences:
City of ABC might show up as Corporation of the City of ABC
ABC Corporation might just up as ABCcorporation (They forgot a space)
University of ABC may be abbreviated as Univ of ABC
Canadian Tire might show up as Canadian Tire Store #503
Canadian Tire might be spelt wrong like Canadia Tire
ABC Corp might show up as ABC Inc
Is there a good solution to this? I know this question is a bit of a long shot, but if I can do this I will have saved people in my company like thousands of hours each year...
Any advice will be greatly appreciated
This is a very complex problem. Look up "master data management" and "dedup". This wikipedia article is a good starting point.
The problem is best solved in small chunks. My recommendation is to read up a little and implement a tool that lists potential duplicates and some easy way to merge them. The keyword here is potential; you don't want to do wrong merges and false positives are very likely and very harmful.
You could use regular expressions to filter these databases.
http://en.wikipedia.org/wiki/Regular_expression
http://www.zytrax.com/tech/web/regex.htm
You can have a program pattern match based on the relevant part of a company name.
For example, If someone puts in Microsoft Corporation of Redmond, and your program
pattern matches against 'Microsoft' you'd get a hit.

"Parametrized" database model & backend storage system as well as data mining manipulation

I have implicitly made this a community wiki seeing that the answers can be quite broad.
I'm working with a start-up company to accomplish the following goal.
In a medical research, a patient medical record can have infinite amount of data regarding a patient for a specific diagnosis, e.g. a smoker has a higher chance of catching lung cancer but that doesn't necessarily mean that a non-smoker can catch lung cancer. My goal is to create/use a database model that can deal with such parameters.
Now, I also have to come up with ways to data mine these parametrized data to create statistical data e.g. see the trends on all 40 year old female who suffered from lung cancer. That report can be generic, (graph, tabular, etc.) where doctors can see trends or analyse possible solutions that can work....
My questions are:
1) Which Database systems allows for parametrized backend storage (e.g. Cassandra) that can easily be used in java, and is very efficient in data retrieval, linkage, etc. We are dealing with high amount of patient records per states.
2) What algorithms or AI techniques can I use for data mining? Is there any mining techniques out there that can help me do this?
PS How does Google Analytics deal with parametrised data?
PPS A parametrized data is data which has a key, and data where data can be value, another key-value pair, a list of value, a set of parametrized data (organized, unorganized)
I'm looking forward for suggestive answers! :-D
I'll try to answer your first question only.
Cassandra is a key-value datastore (in your case parametrized). If you use Cassandra, you need higher computation time to derive complex reports. The reason being - it stores data in raw format. Cassandra like NOSQL databases are good if you want to scale very very big. They are eventually consistent and compromise on data replication and latency.
In your case as a patient can have data in infinitely any form, try to fit the model of a Triple Store (Semantic Web frameworks like Jena, OpenSesame, etc). They allow you to have a lousy data structures and can be molded at runtime. Also, their querying engines (SPARQL, SeRQL) give you more power than NOSQL stores (like Cassandra), but these querying capabilities are obviously lesser than RDBMS.
For this question, this is how we have implemented this.
We created a keyspace called medical and a supercolumn family called patient.
under the supercolumn family, we have a general supercolumn which basically store the patient details, and another supercolumn called operation to keep recording of the user occupation.
Don't forget that the general supercolumn keeps record of the patient as he/she comes to the doctor. That way, we know exactly the patient's exact condition before, during and after operation.
I know some data can be duplicates, but no supercolumns can be identical as there is no way that you can have exactly 2 different patient of identical attributes and sickness.
So basically, Cassandra allows 3 layers of abstraction, Keyspace, Column/Supercolumn family, Column/Supercolumn.
Hope this can help somebody.

Resources