How sample data that has to be distributed in different criteria - sql-server

I am looking for a way to sample data using 2 different criterias, is there anyone who can assist?
I have this that that I have clean with 2000 records. I would like to sample 100 clients distributed at 80% employed and 20 % self employed, furthermore on I have to apply another criteria. Each of the employed and self_employed sample will have to be further distributed by profession, 20% Lawyers, 10% Doctors, 50% Engineers and 20% Accountants.
this is what the data looks like:
Client ID | self employed | Profession
123456 | yes |lawyer
123457 | no |doctor
123458 | yes |accountant
123459 | yes |accountant
123460 | yes |engineer
123461 | yes |lawyer
123462 | no |engineer
123456 | yes |doctor
123456 | yes |lawyer
123456 | yes |engineer

I can't help with the SQL, but the basic idea is straightforward. You need to cross the categories of employment by the professions, with the desired percentages in the margins. Then fill out the table by multiplying the row and column percentages:
employed | unemployed
-------- | -----------
Lawyer | 16% | 4% | 20%
Doctor | 8% | 2% | 10%
Engineer | 40% | 10% | 50%
Accountant | 16% | 4% | 20%
-------- -----------
80% 20%
The entries in the table are what percentage of each crossed category you want in your sample. Since you want a total sample size of 100, multiply each percentage by 100 to get the desired sample size. Given your stated proportions, you want 16 employed lawyers, 4 unemployed lawyers, 8 employed doctors, etc.
Divide your data into subsets corresponding to the 8 categories, and randomly select the appropriate number from each subset. I don't know if SQL provides a random shuffling capability, but if so that's an easy way to select the sample without replacement. Shuffle the employed lawyers and take the first 16, shuffle the unemployed lawyers and take the first 4, and so on. Note that this presumes that each category has enough elements to supply the desired size sample.

Related

Need help creating a simple form for reviewing a (very) large number of diagnosis codes

OK, been lurking here for a long time, but never asked a question before. Apologies for long and complicated question. So I have a very large excel sheet with nearly 40,000 unique codes from the ICD-10 classification system, which classifies essentially all known diseases. Theis is a hierarchical clasisfication system where codes are organized in 20 something chapters and gradually more specific codes, with 3 or more positions. For example, the code A22 is anthrax, with a number of sub-codes A22.0=Cutaneous anthrax, A22.1=Pulmonary anthrax, etc. However, for some diseases, there are no 4-digit codes under the 3-digit codes (e.g. C01, below) or only one 4-digit code that is meaningful for us to recognize (e.g. C00, below). For other diseases, we want full precision (e.g. G23).
Example table
| 3-digit code | Specific code | Description |
| -------- | -------- |-------- |
| C00 | C00.0 | External upper lip |
| C00 | C00.1 | External lower lip |
| C00 | C00.2 | External lip, unspecified |
| C00 | C00.3 | Upper lip, inner aspect |
| C01 | C01 | Malignant neoplasm of base of tongue |
| G23 | G23 | Other degenerative diseases of basal ganglia |
| G23 | G23.0 | Hallervorden-Spatz disease |
| G23 | G23.1 | Progressive supranuclear ophthalmoplegia [Steele-Richardson-Olszewski] |
| G23 | G23.2 | Multiple system atrophy, parkinsonian type [MSA-P] |
| G23 | G23.3 | Multiple system atrophy, cerebellar type [MSA-C] |
The issue at hand is that I'm conducting a large-scale research study based on a health register where diagnoses are coded using this system. Due to a policy of information minimization/data privacy, we need to select which of these 40,000 codes where we need full precision (i.e. on 4-digit level) and where it is sufficent with 3-digit codes. This is a very tedious task and I need to make it as efficient as possible. My idea is to create a simple form that links to my large table (which has the exact format as above, only longer) and presents each 3-digit code one by one, with a simple checkbox or something that allows me to select or not select whether this group should have full precision. I'm envisioning something simple like this:
enter image description here
Sorry for the stupidly long prelude, but my question is much simpler: what would be a simple way to achieve this? I don't "know" any graphical programming languages, but have used SAS, R and statistical programming systems for about 20 years, so I really just need a push in the right direction. Could it, for example, be done using Access form? Any help would be much appreciated!
Thanks,
Gustaf
So, I haven't really tried anything yet as I don't even know where to start.

SSAS - MDX calculated member

I've a fact table that details individual line amounts for orders placed by my organisation. In this fact, at line level, I've included the total order amount to be used, as it's possible we might need that level of detail at some point.
Here's an example of what I've got:-
+------------+------------+---------------+------------+---------------------+
| BookingKey | Booking_ID | Category_FKey | Line_Value | Total_Booking_Value |
+------------+------------+---------------+------------+---------------------+
| 1 | 12 | 8 | 150 | 700 |
| 2 | 12 | 4 | 150 | 700 |
| 3 | 12 | 5 | 300 | 700 |
| 4 | 12 | 4 | 100 | 700 |
+------------+------------+---------------+------------+---------------------+
As you can see, the Total_Booking_Value here is the sum of the Line_Value for the booking in the example (Booking_ID = 12).
The Category_FKey looks up to a Categories dimension.
Using this structure I've created a simple cube and this works fine, mainly.
The issue I have is that I'd like to be able to view the Total Line_Value amount, and somehow include the Total_Booking_Value alongside it.
So, for example I might add the Categories dimension as a filter and want to filter by say Category_FKey = 4.
If this was the case I'd want the aggregates to tell me that the total Line_Value was 250 (for BookingKeys 2 and 4), and the Total_Booking_Value should be 700. Using normal aggregation (ie SUM) I'm getting the Total_Booking_Value as 1400 (obviously - because it's adding 700 * 2 for the two rows the cube would return).
So, the way I see it I'd like to create an MDX calculation that somehow takes the Total_Booking_Value and gives just the value for the Booking in question.
Should this be done using some kind of average, or division by the Distinct number of items? I can't figure this out. I tried something like this:-
create member currentcube.measures.[Calculated Booking Value]
as
[Measures].[Total_Booking_Value] / count(Measures.Booking_ID);
But this isn't working.
Hopefully this makes sense and you can point me in the right direction.
I find it strange that booking_ID is a measure - intuitively it strikes me as something that would be an attribute and therefore a hierarchy - in which case you'd be able to do the count like this:
[Measures].[Total_Booking_Value]
/
COUNT(EXISTING [Booking].[Booking_ID].[Booking_ID].members)
A straightforward solution would be to have two fact tables: one with granularity booking key and one with granularity booking id. The first would contain all columns except total booking value, and the second would contain columns booking id and total booking value.
Then each of both measures would easily be summable.
The reference type between the second fact table and the category dimension could be configures as many-to-many via the first fact table. Thus, you would see the full values of the involved bookings for each selected category, automatically eliminating double counting.

Database Design - how to store quantities that are measured in different ways

I would like to know if the database design i have in mind for an online food store is good according to the usually followed standards and conventions.
Basically the confusion i have is how to store items whose quantity is measured in different ways.
For example, there are items that are measured in terms of kilograms and then there are items measured in terms of number of packets.
For example rice is measured in kilograms and something like say, Noodles would be measured in terms of number of packets.
so the tables are planned to have below fields:
Items table with the fields: category,name,company,variant and a boolean variable named measured_in_packets?..
for items where measured_in_packets is set to true, an entry in another table will hold the available packet sizes:
packet_sizes table with item_id and packet_size..
so if one product is available in multiple packet sizes (250 gm, 500 gm etc), a row would be made for each available size against the item id...
does this sound like a good database design?
In a nutshell, you have items which have a quantity value, but that quantity value can be measured in different kinds of measurement types. You gave examples such as kilograms, packages, and we can perhaps add others such as litres for liquids, etc.
One of the problems with the current solution is that is doesn't allow for any easy alteration or expansion. It also relies on the checking of a boolean field in order to make decisions (such as which table to join I believe, based on your description).
Instead, a better approach would be to create a table containing the possible measurement types, such as kilograms or packets. Your items then simply have a foreign key to this table, and that tells you how the item is measured. This allows you to expand the types in the future, and no need to maintain a boolean flag, or do any other manual work.
This diagram illustrates what I'm referring to:
So if the data in these tables looked like this:
items
+----+---------+----------+----------------------+
| id | name | quantity | measurement_types_id |
+----+---------+----------+----------------------+
| 1 | Rice | 50 | 1 |
| 2 | Noodles | 75 | 2 |
+----+---------+----------+----------------------+
measurement_types
+----+-----------+--------------------+
| id | name | measurement_symbol |
+----+-----------+--------------------+
| 1 | Kilograms | kg |
| 2 | Packets | packets |
+----+-----------+--------------------+
A practical example of this data using the following query:
SELECT items.name, items.quantity, measurement_types.measurement_symbol
FROM items
INNER JOIN measurement_types
ON measurement_types.id = items.measurement_types_id;
would yield this result:
+---------+----------+--------------------+
| name | quantity | measurement_symbol |
+---------+----------+--------------------+
| Rice | 50 | kg |
| Noodles | 75 | packets |
+---------+----------+--------------------+

Which is a better database schema for a tracking tool?

I have to generate a view that shows tracking across each month. The ultimate view will be something like this:
| Person | Task | Jan | Feb | Mar| Apr | May | June . . .
| Joe | Roof Work | 100% | 50% | 50% | 25% |
| Joe | Basement Work | 0% | 50% | 50% | 75% |
| Tom | Basement Work | 100% | 100% | 100% | 100% |
I already have the following tables:
Person
Task
I am now creating a new table to foreign key into the above 2 tables and i am trying to figure out the pros and cons of creating 1 or 2 tables.
Option 1:
Create a new table with the following Columns:
Id
PersonId
TaskId
Jan2012
Feb2012
Mar2012
Apr2013
or
Option 2:
have 2 seperate tables
One table for just
Id
PersonId
TaskId
and another table for just the following columns
Id
PersonTaskId (the id from table above)
MonthYearKey
MonthYearValue
So an example record would be
| 1 | 13 | Jan2011 | 100% |
where 13 would represent a specific unique Person and Task combination. This second way would avoid having to create new columns to continue over time (which seems right) but i also want to avoid overkill.
which would be a more scalable way to have this schema. Also, any other suggestions or more elegant ways of doing this would be great as well?
You can have a m2m table with data columns. I don't see a reason why you can't just put MonthYearKey, MonthYearValue on the same table with PersonId and TaskId
Id
TaskId
PersonId
MonthYearKey
MonthYearValue
It's possible too that you would want to move the MonthYearKey out into their own table, it really just comes down to common queries and what this data is used for.
I would note, you never want to design a schema where you are adding columns due to time. The first option would require maintenance all the time, and would become very difficult to query also.
Option 2 is definitely more scalable and is not overkill.
Option 1 would require you to add a new column every month and simple date based queries of your data would not be possible, e.g. Show me all people who worked at least 90% in any month last year.
The ultimate view would be generated from a particular query or view of your data.

Checking for overlapping car reservations

I'm writing a simple booking program for a car rental (a school assignment). Me and my buddy are trying to make the system a little more advanced than the assignment dictates, but we're having some problems we hoped you could help us with.
The idea is that you can reserve a certain car type, and when you get the car it will be one of that type (you don't reserve a specific car, as our assignment dictates, but only a type). Only one customer can have the car on a specific date. As the reservations tick in we have to make sure, that we don't hire out more cars of each type than we've got. The reservations are basically stored with a start date, an end date, and a car type.
If we ignore the car type for now (lets say we only have one type) then the reservations could graphically look something like this:
1/12 2/12 3/12 4/12 5/12 6/12 7/12
|-------------------|
|-----------------|
|-----|
|-------|
|-----------|
|-------------|
If the rental only has three cars it would be possible to rent a car from 3/12 to 5/12 since all days only have 2 car reservations. But how do we know this? Do we have to check each date and count() the number of reservations that spans over that date?
And what if somebody had reserved a car on 4/12, then 3/12 and 5/12 would still only have 2 reservations, but 4/12 would have 3.
Would it be possible to do with a query some how, or do we have to step through each date in the program to check the number of reservations didn't exceed the number of cars?
(This is easy enough with only full dates, but consider the scenario where you could rent the cars on an hourly basis (not only on a daily as here). Then it could be a though one to step through each our if we have a lot of reservations and cars and the timespan is long...)
Hope you have some nice ideas that will help us along. Thanks for taking the time to read the question :)
Mikkel, Denmark
Assume, You have such reservation situation in real life:
1/12 2/12 3/12 4/12 5/12 6/12 7/12
Car1: |-------------------|
Car2: |-----------------|
Car3: |-------| |-----------| |-----|
Car4: |-------------|
Table car
| id | type | registration |
| 1 | 1 | HH1111 |
| 2 | 1 | HH3333 |
| 3 | 2 | HH77 |
| 4 | 3 | DD999 |
Table reservation
| car_id | date_from | date_to |
| 1 | 2013-12-01 | 2013-12-04 |
| 2 | 2013-12-04 | 2013-12-07 |
| 3 | 2013-12-01 | 2013-12-02 |
| 3 | 2013-12-03 | 2013-12-05 |
| 3 | 2013-12-06 | 2013-12-07 |
| 4 | 2013-12-01 | 2013-12-03 |
Now, You must by really simple logic, select all available cars for period
from 2013-12-05 to 2013-12-06
"Select ALL cars, which does not have any reservation with dates, which blocks it for usage"
with brillian mysql select:
select * from car where not exists ( select * from reservation
where car.id = reservation.car_id AND
date_from < '2013-12-06' AND
date_to > '2013-12-05' )
"Would it be possible to do with a query some how, or do we have to step through each date in the program to check the number of reservations didn't exceed the number of cars? (This is easy enough with only full dates,"
The nature of your problem is that a violation of the constraint could appear on any individual date. So logically speaking, it is indeed necessary to do the check for each individual date comprised in a new reservations. The only optimisation possible would be to do the check at the level of "smallest intervals". To do that, you must first compute all the intervals that already appear in the database, and which overlap with your new reservation.
For example, a new reservation for 4/12-6/12 would have to be split into 4/12-5/12 (second line) and 5/12-6/12 (third line). Those individual intervals might be longer than one single day, and you can do the checks on the level of those individual intervals. (They are the same as individual days in this particular example, but a reservation 7/12-19/12 would not have to be split at all.
However, computing this might prove difficult, and there's another caveat: when you're looking al multi-row inserts, you should also be splitting over the other rows to be inserted (and that requires you to record all the inserted rows in a temporary table, otherwise you won't be able to access them).

Resources