How can I research for specific data - database

This is my first time in the forum so if I made any mistakes please let me know.
So I have to do a research work for school and I am having a little bit of a problem finding a data set that meets all my requirements. This is for my database class. I have been looking the whole weekend and I have not been able to identify any valid source.
Still after all that I have been able to find nothing.
I was wondering if you could help me; perhaps there is a more flexible website to help me narrow the search; or a specific website with the information..
Here are all the requirements:
The data set must be from a legitimate source (e.g., the us gov, a state agency, a university).
The data set must measure something by date and by zip code. in essence, the data set can contain just 3 fields (date, zip code, measure).
The date range must be at least 10 years and span.
The level of granularity of the date must be at least by month, and in the format of YYYYMM or YYYYMMDD.
Zip code must cover all fifty (not 58) states.
Anything except weather
Thanks a lot in advance for the help

You can try the U.S. Census Bureau, there are several datasets that should match your need, although there is probably more information than the minimum you need: that is likely to be true in most real-world examples.

Related

Strategies for UK Postal Address Matching

I have 2 tables of UK postal addresses (around 300000 rows each) and need to match one set to another in order to return a unique ID contained in first set for each address.
The problem is there's a lot of variation in the formats of the addresses and in the spellings.
I've written a lot of t-sql scripts to pick off the east matches (exact postcode + house number + street name, etc) but there are many unmatched records left that are proving difficult to handle. I might end up having as many sql scripts as there are exceptions!
I've look at Levenstein function and ranking word for word but these methods are unreliable and problematic too.
Does anyone have any experience of doing similar work and what was your approach & success rate?
Thank you!
I agree with the commenters that this is largely a business rule thing rather than a programming question, but for what it's worth...
I had a somewhat similar problem with a catalogue many years ago. Entries weren't always consistent in the way we'd hoped, different editions came up weirdly and with a wide variety of variations. All had to be linked.
What I did in the end was a fuzzy matcher. Broke the item down into components. Normalised the data where I could - removing spaces from fields that didn't always have them and could live without them for example. Worked out the distance between near misses - bar and car being 1 apart, for example. I stemmed words - see http://snowball.tartarus.org/algorithms/english/stemmer.html for more info. Think I even played with SQL Server's SOUNDEX matching.
I then went through and scripted the job to produce a list of candidate matches. Anything above a certain level got presented to an administrator, who was shown what the program thought was the best match along with other likely matches. They picked the one that looked best, ticked it and went on to the next one.
At the start of the list everyone thought the job was far too huge to be manageable. They then started going through it, and found it was much faster than they thought and much easier than they'd feared to stay on top of the new data as it came in.
The script to do it all programmatically will never be perfect, and will end up being nearly as long as the source list with as many objections as it'll generate. Don't try to automate it perfectly; automate the easy stuff, put a human in the loop for the uncertain cases. Much easier and safer.

Database schema for calendar application

I'm creating a calendar application in which each date has one of 3 states: available, maybe available, and unavailable. Trying to figure out the best schema for this situation.
One thought might be to have a UserDate model with a field state. The problem with this is that the DB will have #-of-users- x 365 rows for each year - seems like it would grow too quickly for a modestly sized app.
Another thought might be to have a default state, and only create a UserDate object when the user has signified that their availability on that date is different from the default. This seems convoluted though.
Has anyone dealt with this situation before? Any suggestions on the best way to go about this?
When you create a new user, you do not want to be inserting records for the next 50 years of their life. Only creating the UserDate object when there is a non-default value is what you should do.
You could consider storing a range of dates for a user if you are likely to have lots of consecutive dates with the same status. For example, if they are unavailable for all of December, then this could be represented as a single row.
Think about the sort of information you want to extract from your database, and how difficult this will be with each of your possible designs.

Data quality database model

Need an example of a database model to be attached to a database for data quality. Best form of the answer would at the very least be DDL that's executable in MySQL; other RDMS DDL's are okay, I'll just post another question asking for a porting of the code.
A good explaintion would be a huge plus.
Questions, comments, feedback, etc. -- just comment, thanks!!
The biggest problem is identifying meaningful measures of quality. That's so highly application-dependent, I doubt that anybody will be able to help you very much. (At least not without a lot more information--perhaps more than you're allowed to give.)
But let's say your application records observations of birds by individuals. (I'm just throwing this together off the top of my head. Read it for the gist, and expect the details to crumble under scrutiny.) Under average field conditions,
some species are hard for even a beginner to get wrong
some species are hard for an expert to get right
a specific individual's ability varies irregularly over time (good days, bad days)
individuals usually become more skilled over time
you might be highly skilled at identifying hawks, and totally suck at identifying gulls
individuals are prone to suggestion (who they're with makes a difference in their reliability)
So, to take a shot at assessing the quality of an identification, you might try to record a lot of information besides the observation "3 red-tailed hawks at Cape May on 05-Feb-2011 at 4:30 pm". You might try to record
weather
lighting
temperature (some birders suck in the cold)
hours afield (some birders suck after 3 hours, or after 20 cold minutes)
names of others present
average difficulty of correctly
identifying red-tailed hawks
probability that this individual
could correctly identify red-tails
under these field conditions
alcohol intake
Although this might be "meta" to field birders, to the database designer it's just data. And you'd design the tables just like you'd design them for any other application. (That's what I did, anyway.)

Is it highly necessary to record the registration date of new website users?

What are the advantages and disadvantages?
That depends on what your site is, and how you use that information. On StackOverflow, you are awarded a "yearling" badge once a full year elapses from the time you registered. Clearly here that information is necessary.
If I were you, I'd save it. It's a small piece of information that may become useful eventually. It's better to have it and not need it than to need it and not have it. It would be rather difficult to extrapolate an accurate registration date retrospectively if you don't store it to begin with.
Advantage:
You don't get in a migration horror when needing it at some point. For a lot of data you cannot find out this data afterwards. You could fake around with MODIFICATION_DATE but often this is not accurate and sits in the future (e.g. when profile can be edited by user).
Disadvantage:
In case you never need this information, you wasted space (though another small data payload column shouldn't make a problem). Further more you have an 'all-time' deprecated field, which can be confusing to new developers ("what is this column for, cannot see where it is used...?")
As mentioned the registration-date is most likely a valuable information I would add it from start on. When thinking of persistent data and its model you sometimes have to think "more" for the future.

Advantages of keeping to a protocol for a data model

The question title is probably not correct because part of my question is to try and get some more understanding on the problem.
I am looking for the advantages of making sure data that is imported to a database (simple example: Excel table to Access database) should be given using the same schema and should also be valid to the business requirements.
I have an Excel table containing none normalised data and an Access database with normalised tables.
The Excel table comes from multiple third parties, none of which stick to the same format as each other or the database.
Some of the sources also do not supply all the relevant data.
Example of what could be supplied
contact_key, date, contact_title, reject_name, reject_cost, count_of_unique_contact
count_of_unique_contact is derived from distinct contact_title's and should not be imported.
contact_key is sometimes not supplied.
title is sometimes unknown and passed in as such "n/a", "name = ??1342", "#N/A" etc. rather random.
reject_name is often miss spelled.
the fields are sometimes not even supplied, e.g. date and contact_key are missing.
I am trying to find information to help explain the issues with the above.
Issues only related to incorrect data or fields making it difficult to have useful data in the database such as not being able to report a trend on reject costs in a month when the date is not supplied. Normalising the excel file is not an option available to me.
Requesting the values and fields in the Excel files to match the business requirements and the format to be the same for every third party that sends them is what I want to do but the request is falling on deaf ears.
I want to explain to the client that inputting fake data and checking for invalid/existing rejects/contacts all the time is wrong and doing it is going to fail or at the best be difficult without constant maintenance of a poor system.
Does anyone have any information on this problem?
Thanks
This is a common problem; this gets referred to in data processing circles as "garbage in, garbage out". Essentially, what you're running up against is that the data as given is of poor quality; you're correct to recognize that the problem is that it will be hard (if not impossible) to use this data to extract any useful information.
To some extent, this is a problem that should be fixed at the source; whatever your source of your data is, they need to be convinced that the data quality must improve. In the short term, you can sanitize your data; the term refers to removing or cleaning the bad entries to make the remainder of the data (the "good" data) importable into your database. Depending on just what percentage of your data is bad, you may or may not be able to do useful things with the sanitized data once you import it.
At some point, since you're not getting traction with management about the quality of the data, you will simply have to show them that the system is not working as intended because the quality of the data is bad. They'll need to improve their processes at that point to improve the quality of the data you get in at that point. Until then, though, keep pressing for better data; investigate the process of sanitizing the data and see what you can do with the remaining data. Good luck!

Resources