I am working on a requirement to match people from different databases. One tricky problem is variance in names like Bob - Robert, Jim - James, Lizzy - Elizabeth etc across databases.
Is there a lookup/translation available for this kind of a requirement.
Take a look at my answer (as well as the others) here:
Tools for matching name/address data
You'd need to implement a lookup table with the alternate names in it:
Base | Alternate
----------------
Robert | Bob
Elizabeth | Liz
Elizabeth | Lizzy
Elizabeth | Beth
Then search the database for the base name and all alternates. You'll end up with a number of multiple matches which will then need to be checked to see if they really match based on a comparison of whatever other data you have in the two databases. Maybe the dates of the records in each database could be used - records entered close in time indicate the same person.
Related
I want to show the results of a MySQL query on my website using angularjs. For now, I'm showing them using a simple table with ng-repeat and it works with no problem. But because the data is a lot, I wanted to ask if it is possible to create multiple panels or tables per specific field.
To be more specific, I have 4 fields returned from the query: name, address, occupation, department. Right now I have a table such as:
George Smith Nikis 10 Project Manager Finance
Maria Bexley Lincoln 20 Project Manager Research
Chris Liggs Forks 123 Programmer Computer Science
etc. I want to know if I can create as many panels or tables as the unique values of the "occupation" field are and then show the results per that unique value inside each panel/table. So instead of the above table I would have something like:
Project Manager
George Smith Nikis 10 Finance
Maria Bexley Lincoln 20 Research
Programmer
Chris Liggs Forks 123 Computer Science
I think you need to use groupby filter
Check this fiddle by Darryl Rubarth, it contains the answer you need
http://jsfiddle.net/drubarth/R8YZh/
<div ng-repeat="item in MyList | orderBy:'groupfield' | groupBy:'groupfield'" >
You can use group by filter in ng-repeat with which you need to group
I'm trying to design a precise database following the normal standards of database design taught, and I'm stuck on the idea of a ZIP Code table. I've put a table together for zip code information with the following specs:
-PK | zip_code_id | NN UQ - BIGINT
- | zip_code | NN - STRING(5)
- | city | NN - STRING(40)
- | state | NN - STRING(2)
however I'm stuck if I should include the ZIP+4 in the table or as a field for whatever record I use. My question is this: According to best practices should the ZIP+4 be included in the separate table used for zip codes or should that be used in as a field in whatever record this table is used with? This will only be using american addresses.
I'm open to any redesign if my design is to restrictive ie allowing non-american addresses.
I'm looking for an efficient way of storing sets of objects that have occurred together during events, in such a way that I can generate aggregate stats on them on a day-by-day basis.
To make up an example, let's imagine a system that keeps track of meetings in an office. For every meeting we record how many minutes long it was and in which room it took place.
I want to get stats broken down both by person as well as by room. I do not need to keep track of the individual meetings (so no meeting_id or anything like that), all I want to know is daily aggregate information. In my real application there are hundreds of thousands of events per day so storing each one individually is not feasible.
I'd like to be able to answer questions like:
In 2012, how many minutes did Bob, Sam, and Julie spend in each conference room (not necessarily together)?
Probably fine to do this with 3 queries:
>>> query(dates=2012, people=[Bob])
{Board-Room: 35, Auditorium: 279}
>>> query(dates=2012, people=[Sam])
{Board-Room: 790, Auditorium: 277, Broom-Closet: 71}
>>> query(dates=2012, people=[Julie])
{Board-Room: 190, Broom-Closet: 55}
In 2012, how many minutes did Sam and Julie spend MEETING TOGETHER in each conference room? What about Bob, Sam, and Julie all together?
>>> query(dates=2012, people=[Sam, Julie])
{Board-Room: 128, Broom-Closet: 55}
>>> query(dates=2012, people=[Bob, Sam, Julie])
{Board-Room: 22}
In 2012, how many minutes did each person spend in the Board-Room?
>>> query(dates=2012, rooms=[Board-Room])
{Bob: 35, Sam: 790, Julie: 190}
In 2012, how many minutes was the Board-Room in use?
This is actually pretty difficult since the naive strategy of summing up the number of minutes each person spent will result in serious over-counting. But we can probably solve this by storing the number separately as the meta-person Anyone:
>>> query(dates=2012, rooms=[Board-Room], people=[Anyone])
865
What are some good data structures or databases that I can use to enable this kind of querying? Since the rest of my application uses MySQL, I'm tempted to define a string column that holds the (sorted) ids of each person in the meeting, but the size of this table will grow pretty quickly:
2012-01-01 | "Bob" | "Board-Room" | 2
2012-01-01 | "Julie" | "Board-Room" | 4
2012-01-01 | "Sam" | "Board-Room" | 6
2012-01-01 | "Bob,Julie" | "Board-Room" | 2
2012-01-01 | "Bob,Sam" | "Board-Room" | 2
2012-01-01 | "Julie,Sam" | "Board-Room" | 3
2012-01-01 | "Bob,Julie,Sam" | "Board-Room" | 2
2012-01-01 | "Anyone" | "Board-Room" | 7
What else can I do?
Your question is a little unclear because you say you don't want to store each individual meeting, but then how are you getting the current meeting stats (dates)? In addition any table given the right indexes can be very fast even with alot of records.
You should be able to use a table like log_meeting. I imagine it could contain something like:
employee_id, room_id, date (as timestamp), time_in_meeting
Where foreign keys to employee id to employee table, and room id key to room table
If you index employee id, room id, and date you should have a pretty quick lookup as mysql multiple-column indexes go left to right such that you gain index on (employee id, employee id + room id, and employee id + room id + timestamp) when do searches. This is explained more in the multi-index part of:
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
By refusing to store meetings (and related objects) individually, you are loosing the original source of information.
You will not be able to compensate for this loss of data, unless you memorize on a regular basis the extensive list of all potential daily (or monthly or weekly or ...) aggregates that you might need to question later on!
Believe me, it's going to be a nightmare ...
If the number of people are constant and not very large you can then assign a column to each person for present or not and store the room, date and time in 3 more columns this can remove the string splitting problems.
Also by the nature of your question I feel first of all you need to assign Ids to everything rooms,people, etc. No need for long repetitive string in DB. Also try reducing any string operation and work using individual data in each column for better intersection performance. Also you can store a permutation all the people in a table and assign a id for them then use one of those ids in the actual date and time table. But all techniques will require that something be constant either people or rooms.
I do not understand whether you know all "questions" in design time or it's possible to add new ones during development/production time - this approach would require to keep all data all the time.
Well if you would know all your questions it seems like classic "banking system" which recalculates data on daily basis.
How I think about it.
Seems like you have limited number of rooms, people, days etc.
Gather logging data on daily basis, one table per day. Just one event, one database row, all information (field) what you need.
Start to analyse data using some crone script at "midnight".
Update stats for people, rooms, etc. Just increment number of hours spent by Bob in xyz room etc. All what your requirements need.
As analyzed data are limited and relatively small as you analyzed (compress) them, your system can contain also various queries as indexes would be relatively small etc.
You could be able to use scalable map/reduce algorithm.
You can't avoid storing the atomic facts as follows: (the meeting room, the people, the duration, the day), which is probably only a weak consolidation when the same people meet multiple times in the same room on the same day. Maybe that happens a lot in your office :).
Making groups comparable is an interesting problem, but as long as you always compose the member strings the same, you can probably do it with string comparisons. This is not "normal" however. To normalise you'll need a relation table (many to many) and compose a temporary table out of your query set so it joins quickly, or use an "IN" clause and a count aggregate to ensure everyone is there (you'll see what I mean when you try it).
I think you can derive the minutes the board room was in use as meetings shouldn't overlap, so a sum will work.
For storage efficiency, use integer keys for everything with lookup tables. Dereference the integers during the query parsing, or just use good old joins if you are feeling traditional.
That's how I would do it anyway :).
You'll probably have to store individual meetings to get the data you need anyway.
However you'll have to make sure you aggregate and anonymise it properly before creating your reports. Make sure to separate concerns and access levels to stay within the proper legal limits on data.
This is a system architecture question on designing full-text search with (relational) database. The specific software I'm using are Solr and PostgreSQL, just FYI.
Suppose we are building a forum with two users Andy and Betty --
Post ID | User | Title | Content
--------|-------|-------------------|---------------------------
1 | Andy | Dark Knight rocks | Dark Knight rocks blah
2 | Betty | I love Twilight | Twilight blah blah
3 | Andy | Twilight sucks | Twilight sucks blah
4 | Betty | Andy sucks | Twilight rocks, Andy sucks
When the posts table is indexed in Solr, we can easily return the posts sorted by their relevancy to "?q=twilight" or "?q=dark+night".
Now we want to add a new feature to search for users instead of posts. A naive implementation would simply index user name and return "Andy" to "?q=a" and "Betty" to "?q=b", but what if we want to make our system smarter to also take into account of the user posts and return "Betty" before "Andy" to "?q=twilight" because Betty mentions Twilight more than Andy does.
How would you design the system to efficiently handle the user-search function for hundreds of thousands of users and millions of posts?
Faceting on User would return number of results per user. If Andy wrote 15 posts that match Twilight while Betty wrote 10, the faceting will return them as such.
But it wont help if both wrote 15 posts about Twilight, but Andy's was supposed to be more relevant; you will see all facet counts (15, 15 in this case) even if you are paginating to see only (say,) top 5 results and Andy made 4 of them.
If above solution is not good enough, consider a background job that writes documents of
type: suggest_user_type (so you can distinguish them by a `fq`)
user: Andy (the user)
concatted_posts: "I think Twilight.." (concatenate the users latest 50 posts)
once a week. And if you
fq=type:suggest_user_type&
q=concatted_posts:twilight&
fl=user
you get a sorted list of users based on relevance of concatted_posts with respect to twilight.
I believe term frequency is included in full text search ranking. It's part of a research area called information retrieval. There's also another value called the inverse document frequency, which filters out common terms.
There are other steps common to ranking text, you may want to have a look at the OpenNLP project if you're interested.
In terms of database design, there's too much to cover in a post and I'm not the one to write it. The general consensus seems to be for very large systems they key is building an efficient index, then distributing this over a number machines to scale performance. I would recommend reading up on Page Rank and how Google developed its systems as a starting point.
Consider this table:
+-------+-------+-------+-------+
| name |hobby1 |hobby2 |hobby3 |
+-------+-------+-------+-------+
| kris | ball | swim | dance |
| james | eat | sing | sleep |
| amy | swim | eat | watch |
+-------+-------+-------+-------+
There is no priority on the types of hobbies, thus all the hobbies belong to the same domain. That is, the hobbies in the table can be moved on any hobby# column. It doesn't matter on which column, a particular hobby can be in any column.
Which database normalization rule does this table violate?
Edit
Q. Is "the list of hobbies [...] in an arbitrary order"?
A. Yes.
Q. Does the table have a primary key?
A. Yes, suppose the key is an AUTO_INCREMENT column type named user_id.
The question is if the columns hobby# are repeating groups or not.
Sidenote: This is not a homework. It's kind of a debate, which started in the comments of the question SQL - match records from one table to another table based on several columns. I believe this question is a clear example of the 1NF violation.
However, the other guy believes that I "have fallen fowl of one of the fallacies of 1NF." That argument is based on the section "The ambiguity of Repeating Groups" of the article Facts and Fallacies about First Normal Form.
I am not writing this to humiliate him, me, or whomever. I am writing this, because I might be wrong, and there is something I am clearly missing and maybe this guy is not explaining it good enough to me.
You say that the hobbies belong to the same domain and that they can move around in the columns. If by this you mean that for any specific name the list of hobbies is in an arbitrary order and kriss could just as easily have dance, ball, swim as ball, swim, dance, then I would say you have a repeating group and the table violates 1NF.
If, on the other hand, there is some fundamental semantic difference between a particular person's first and second hobbies, then there may be an argument for saying that the hobbies are not repeating groups and the table may be in 3NF (assuming that hobby columns are FK to a hobby table). I would suggest that this argument, if it exists, is weak.
One other factor to consider is why there are precisely 3 hobbies and whether more or fewer hobbies are a potential concern. This factor is important not so much for normalization as for flexibility of design. This is one reason I would split the hobbies into rows, even if they are semantically different from one-another.
Your three-hobby table design probably violates what I usually call the spirit of the original 1NF (probably for the reasons given by dportas and others).
It turns out however, that it is extremely difficult to find [a set of] formal and precise "measurable" criteria that accurately express that original "spirit". That's what your other guy was trying to explain talking about "the ambiguity of repeating groups".
Stress "formal", "precise" and "measurable" here. Definitions for all other normal forms exist that satisfy "formal", "precise" and "measurable" (i.e. objectively observable). For 1NF it's just hard (/impossible ???) to do. If you want to see why, try this :
You stated that the question was "whether those three hobby columns constitute a repeating group". Answer this question with "yes", and then provide a rigorous formal underpinning for your answer.
You cannot just say "the column names are the same, except for the numbered suffix". Making a violation of such a rule objectively observable/measurable would require to enumerate all the possible ways of suffixing.
You cannot just say "swim, tennis" could equally well be "tennis, swim", because getting to know that for sure requires inspecting the external predicate of the table. If that is just "person <name> has hobby <hobby1> and also has <hobby2>" , then indeed both are equally valid (aside : and due to the closed world assumption it would in fact require all possible permutations of the hobbies to be present in the table !!!). However, if that external predicate is "person <name> spends the most time on <hobby1> and the least on <hobby2>", then "swim, tennis" could NOT equally well be "tennis,swim". But how do you make such interpretations of the external predicate of the table objective (for ALL POSSIBLE PREDICATES) ???
etc. etc.
This clearly "looks" like a design error.
It's not not a design error when this data is simply stored and retrieved. You need only 3 of the hobbies and you don't intend to use this data in any other way than retrieve.
Let's consider this relationship:
Hobby1 is the main hobby at some point in a person's life (before 18 years of age for example)
Hobby2 is the hobby at another point (19-30)
Hobby3 is her hobby at a another one.
Then this table seems definitely well designed and while the 1NF convention is respected the naming arguably "sucks".
In the case of an indiscriminate storage of hobbies this is clearly wrong in most if not all cases I can think of right now. Your table has duplicate rows which goes against the 1NF principles.
Let's not consider the reduced efficiency of SQL requests to access data from this table when you need to sort the results for paging or any other practical reason.
Let's take into consideration the effort required to work with your data when your database will be used by another developer or team:
The data here is "scattered". You have to look in multiple columns to aggregate related data.
You are limited to only 3 of the hobbies.
You can't use simple rules to establish unicity (same hobby only once per user).
You basically create frustration, anger and hatred and the Force is disturbed.
Well,
The point is that, as long as all hobby1, hobby2 and hobby3 values are not null, AND names are unique, this table could be considered more or less as abbiding by 1NF rules (see here for example ...)
But does everybody has 3 hobbies? Of course not! Do not forget that databases are basically supposed to hold data as a representation of reality! So, away of all theories, one cannot say that everybody has 3 hobbies, except if ... our table is done to hold data related to people that have three hobbies without any preference between them!
This said, and supposing that we are in the general case, the correct model could be
+------------+-------+
| id_person |name |
+------------+-------+
for the persons (do not forget to unique key. I do not think 'name' is a good one
+------------+-------+
| id_hobby |name |
+------------+-------+
for the hobbies. id_hobby key is theoretically not mandatory, as hobby name can be the key ...
+------------+-----------+
| id_person |id_hobby |
+------------+-----------+
for the link between persons and hobbies, as the physical representation of the many-to-many link that exists between persons and their hobbies.
My proposal is basic, and satisfies theory. It can be improved in many ways ...
Without knowing what keys exist and what dependencies the table is supposed to satisfy it's impossible to determine for sure what Normal Form it satisfies. All we can do is make guesses based on your attribute names.
Does the table have a key? Suppose for the sake of example that Name is a candidate key. If there is exactly one value permitted for each of the other attributes for each tuple (which means that no attribute can be null) then the table is in at least First Normal Form.
If any of the columns in the table accept nulls then then the table violates first normal form. Assuming no nulls, #dportas has already provided the correct answer.
The table does not violate first normal form.
First normal form does not have any prohibition against multiple columns of the same type. As long as they have distinct column names, it is fine.
The prohibition against "Repeating Groups" concerns nested records - a structure which is common in hierarchical databases, but typically not possible in relational databases.
The table using repeating groups would look something like this:
+-------+--------+
| name |hobbies |
+-------+--------+
| kris |+-----+ |
| ||ball | |
| |+-----+ |
| ||swim | |
| |+-----+ |
| ||dance| |
| |+-----+ |
+-------+--------+
| james |+-----+ |
| ||eat | |
| |+-----+ |
| ||sing | |
| |+-----+ |
| ||sleep| |
| |+-----+ |
+-------+--------+
| amy |+-----+ |
| ||swim | |
| |+-----+ |
| ||eat | |
| |+-----+ |
| ||watch| |
| |+-----+ |
+-------+--------+
In a table conforming to 1NF all values can be located though table name, primary key, and column name. But this is not possible with repeated groups, which require further navigation.