(Full-Text) Search And Database Design

(Full-Text) Search And Database Design - database

This is a system architecture question on designing full-text search with (relational) database. The specific software I'm using are Solr and PostgreSQL, just FYI.
Suppose we are building a forum with two users Andy and Betty --
Post ID | User | Title | Content
--------|-------|-------------------|---------------------------
1 | Andy | Dark Knight rocks | Dark Knight rocks blah
2 | Betty | I love Twilight | Twilight blah blah
3 | Andy | Twilight sucks | Twilight sucks blah
4 | Betty | Andy sucks | Twilight rocks, Andy sucks
When the posts table is indexed in Solr, we can easily return the posts sorted by their relevancy to "?q=twilight" or "?q=dark+night".
Now we want to add a new feature to search for users instead of posts. A naive implementation would simply index user name and return "Andy" to "?q=a" and "Betty" to "?q=b", but what if we want to make our system smarter to also take into account of the user posts and return "Betty" before "Andy" to "?q=twilight" because Betty mentions Twilight more than Andy does.
How would you design the system to efficiently handle the user-search function for hundreds of thousands of users and millions of posts?

Faceting on User would return number of results per user. If Andy wrote 15 posts that match Twilight while Betty wrote 10, the faceting will return them as such.
But it wont help if both wrote 15 posts about Twilight, but Andy's was supposed to be more relevant; you will see all facet counts (15, 15 in this case) even if you are paginating to see only (say,) top 5 results and Andy made 4 of them.
If above solution is not good enough, consider a background job that writes documents of
type: suggest_user_type (so you can distinguish them by a `fq`)
user: Andy (the user)
concatted_posts: "I think Twilight.." (concatenate the users latest 50 posts)
once a week. And if you
fq=type:suggest_user_type&
q=concatted_posts:twilight&
fl=user
you get a sorted list of users based on relevance of concatted_posts with respect to twilight.

I believe term frequency is included in full text search ranking. It's part of a research area called information retrieval. There's also another value called the inverse document frequency, which filters out common terms.
There are other steps common to ranking text, you may want to have a look at the OpenNLP project if you're interested.
In terms of database design, there's too much to cover in a post and I'm not the one to write it. The general consensus seems to be for very large systems they key is building an efficient index, then distributing this over a number machines to scale performance. I would recommend reading up on Page Rank and how Google developed its systems as a starting point.

Related

Cassandra/Solr data model improvement

I have the following table:
CREATE TABLE videos_tags (
id text,
tag text,
video text,
someotherfield long,
PRIMARY KEY (id),
) WITH gc_grace_seconds = 1296000
AND compaction={'class': 'LeveledCompactionStrategy'}
AND compression={'sstable_compression': 'LZ4Compressor'};
The table stores a list of tags and videos. A video can have one or more tags; and a tag can be attributed to more than one video. Example:
id | tag | video
------------------------------------------
1 | dancing | video1
2 | singing | video2
3 | prank | video3
4 | prank | video4
5 | funny | video3
6 | cover | video2
I want to show to my users a list of related videos based from tag assignment - the more tags a certain video has in common with the user's video, the more "related" it is. The actual approach that I use comprises of 2 steps:
Get a list of the user's video's tags
q=:&fq=video:video1&fl=tag
Identify the videos use the same tags as the user's video and select the top 10 (resultset slicing is done in application side)
q=:&fq=tag:tag1 AND tag:tag2 AND tag:tag3 AND !video:video1&fl=video&stats=true&stats.field=someotherfield&stats.facet=video
Note: I used stats instead of plain facet because I also need the sum of someotherfield
This approach yields an average execution time of 30 seconds. Unfortunately, the maximum acceptable query time for my app is 10 seconds
Is there a better approach to tackling this data requirement? I'm open to:
Alternative query approach (minor tweaks are preferred; but I can accept something as drastic as replacing my 2-step approach completely)
Alternative schema
Notes:
The actual schema has several other fields that I removed from this post for brevity
I do all read operations via Solr (Datastax Enterprise 4.6.0). Nothing fancy in the Solr schema
The table currently holds 1.5 billion rows, but could grow to double or triple of that within years (so the solution must take into account the table/index size)
No fulltext search - only exact string filters

Can a value in AWS DynamoDB point to value in different table?

First off, I have very minimal experience with servers and databases (I have only used it once in my entire life and only beginning to learn) and this would not exactly be a "code" question strictly speaking because it is a question concerning a concept regarding DynamoDB.. But here it is because I cannot find answer to it no matter how much I search!
I am trying to make an application where users can see if their friends are "online" or not. There will be a table that keeps track of the users who are online and offline like this:
user_id | online
1 | O
2 | X
3 | O
and when user_id 1 who has friends 2 & 3 "refreshes", 1 would be able to see that 2 is offline and 3 is online. This would normally be done by batch_get in dynamodb, but each item I read would count as one unit, meaning if user1 had 20 friends, one refresh would use up 20 read units. To me, that would cost too much, and I thought that if I made a table for each user that would hold list of their friends that shows whether they are online or not, each refresh would cost only one read unit.
user_id | friends_on_off_line
1 | {2:X, 3:O}
2 | {1:O}
3 | {1:O}
However, the values in the list would have to be a "pointer" to the first table, because I cannot update the value everytime someone goes online or offline (if 1 went offline, I would have to write 1 as offline to both tables, and in second table, write it twice, using 3 write units which would end up costing even more)
So I am trying to make it so that in second table, values would point to the first table that would read whether they are online/offline and return the values as a list using only 1 read unit: like this
user_id | friends_on_off_line
1 | {pointer_to_2.online , pointer_to_3.online}
2 | {pointer_to_1.online}
3 | {pointer_to_1.online}
Is this possible in DynamoDB? If not, which service should I use and how can I make it possible?
Thanks in advance!

I don't think DynamoDB is the right tool for this kind of job.
SQL databases (Mysql/PostgreSQL) both have easy designs - just use joins (pointers).
You can also look at this question regarding this area for MongoDB.
What you should ask yourself is what are the most common questions the database needs to answer and what is the update / read rate. This questions usually navigate you to the right direction when picking up a database.

How to store sets of objects that have occurred together during events?

I'm looking for an efficient way of storing sets of objects that have occurred together during events, in such a way that I can generate aggregate stats on them on a day-by-day basis.
To make up an example, let's imagine a system that keeps track of meetings in an office. For every meeting we record how many minutes long it was and in which room it took place.
I want to get stats broken down both by person as well as by room. I do not need to keep track of the individual meetings (so no meeting_id or anything like that), all I want to know is daily aggregate information. In my real application there are hundreds of thousands of events per day so storing each one individually is not feasible.
I'd like to be able to answer questions like:
In 2012, how many minutes did Bob, Sam, and Julie spend in each conference room (not necessarily together)?
Probably fine to do this with 3 queries:
>>> query(dates=2012, people=[Bob])
{Board-Room: 35, Auditorium: 279}
>>> query(dates=2012, people=[Sam])
{Board-Room: 790, Auditorium: 277, Broom-Closet: 71}
>>> query(dates=2012, people=[Julie])
{Board-Room: 190, Broom-Closet: 55}
In 2012, how many minutes did Sam and Julie spend MEETING TOGETHER in each conference room? What about Bob, Sam, and Julie all together?
>>> query(dates=2012, people=[Sam, Julie])
{Board-Room: 128, Broom-Closet: 55}
>>> query(dates=2012, people=[Bob, Sam, Julie])
{Board-Room: 22}
In 2012, how many minutes did each person spend in the Board-Room?
>>> query(dates=2012, rooms=[Board-Room])
{Bob: 35, Sam: 790, Julie: 190}
In 2012, how many minutes was the Board-Room in use?
This is actually pretty difficult since the naive strategy of summing up the number of minutes each person spent will result in serious over-counting. But we can probably solve this by storing the number separately as the meta-person Anyone:
>>> query(dates=2012, rooms=[Board-Room], people=[Anyone])
865
What are some good data structures or databases that I can use to enable this kind of querying? Since the rest of my application uses MySQL, I'm tempted to define a string column that holds the (sorted) ids of each person in the meeting, but the size of this table will grow pretty quickly:
2012-01-01 | "Bob" | "Board-Room" | 2
2012-01-01 | "Julie" | "Board-Room" | 4
2012-01-01 | "Sam" | "Board-Room" | 6
2012-01-01 | "Bob,Julie" | "Board-Room" | 2
2012-01-01 | "Bob,Sam" | "Board-Room" | 2
2012-01-01 | "Julie,Sam" | "Board-Room" | 3
2012-01-01 | "Bob,Julie,Sam" | "Board-Room" | 2
2012-01-01 | "Anyone" | "Board-Room" | 7
What else can I do?

Your question is a little unclear because you say you don't want to store each individual meeting, but then how are you getting the current meeting stats (dates)? In addition any table given the right indexes can be very fast even with alot of records.
You should be able to use a table like log_meeting. I imagine it could contain something like:
employee_id, room_id, date (as timestamp), time_in_meeting
Where foreign keys to employee id to employee table, and room id key to room table
If you index employee id, room id, and date you should have a pretty quick lookup as mysql multiple-column indexes go left to right such that you gain index on (employee id, employee id + room id, and employee id + room id + timestamp) when do searches. This is explained more in the multi-index part of:
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

By refusing to store meetings (and related objects) individually, you are loosing the original source of information.
You will not be able to compensate for this loss of data, unless you memorize on a regular basis the extensive list of all potential daily (or monthly or weekly or ...) aggregates that you might need to question later on!
Believe me, it's going to be a nightmare ...

If the number of people are constant and not very large you can then assign a column to each person for present or not and store the room, date and time in 3 more columns this can remove the string splitting problems.
Also by the nature of your question I feel first of all you need to assign Ids to everything rooms,people, etc. No need for long repetitive string in DB. Also try reducing any string operation and work using individual data in each column for better intersection performance. Also you can store a permutation all the people in a table and assign a id for them then use one of those ids in the actual date and time table. But all techniques will require that something be constant either people or rooms.

I do not understand whether you know all "questions" in design time or it's possible to add new ones during development/production time - this approach would require to keep all data all the time.
Well if you would know all your questions it seems like classic "banking system" which recalculates data on daily basis.
How I think about it.
Seems like you have limited number of rooms, people, days etc.
Gather logging data on daily basis, one table per day. Just one event, one database row, all information (field) what you need.
Start to analyse data using some crone script at "midnight".
Update stats for people, rooms, etc. Just increment number of hours spent by Bob in xyz room etc. All what your requirements need.
As analyzed data are limited and relatively small as you analyzed (compress) them, your system can contain also various queries as indexes would be relatively small etc.
You could be able to use scalable map/reduce algorithm.

You can't avoid storing the atomic facts as follows: (the meeting room, the people, the duration, the day), which is probably only a weak consolidation when the same people meet multiple times in the same room on the same day. Maybe that happens a lot in your office :).
Making groups comparable is an interesting problem, but as long as you always compose the member strings the same, you can probably do it with string comparisons. This is not "normal" however. To normalise you'll need a relation table (many to many) and compose a temporary table out of your query set so it joins quickly, or use an "IN" clause and a count aggregate to ensure everyone is there (you'll see what I mean when you try it).
I think you can derive the minutes the board room was in use as meetings shouldn't overlap, so a sum will work.
For storage efficiency, use integer keys for everything with lookup tables. Dereference the integers during the query parsing, or just use good old joins if you are feeling traditional.
That's how I would do it anyway :).

You'll probably have to store individual meetings to get the data you need anyway.
However you'll have to make sure you aggregate and anonymise it properly before creating your reports. Make sure to separate concerns and access levels to stay within the proper legal limits on data.

relational database design for multidimensional matrix questions

I am designing a relational DB for an online survey.
However, I am not sure what is the best relational database design for storing multidimensional matrix questions.
Let's say, I have the following question (sorry, it does not allow me to insert HTML table):
What was your experience of...
----------| Not friendly| (2) |Very friendly|Length of stay|Visited in the last year?|
Sydney |radio button | rb | rb | drop down | check box |
--------------------------------------------------------------------------------------
New York | rb | rb | rb | drop down | check box |
--------------------------------------------------------------------------------------
London | rb | rb | rb | drop down | check box |
--------------------------------------------------------------------------------------
Do you think I should do something along the following lines or is there a better way?
To hold all the question:
Question
questionID
question
QuestionMatrix2d
matrix2dID
questionID
subquestionID
subquestion
QuestionMatrix
questionID
matrix2dID
question_parentID
And to hold all the responses:
QuestionResponse
questionID
response_code
QuestionMatrix2dResponse
questionID
subquestionID
response_code
Thank you for your help.

I disagree with ryan1234. This totally is a relational problem, and there is very little reason not to put it into a database.
I have to do a bit of guesswork though, in what you're trying to achieve here. You have an online survey, so I assume it will be used by more than one person. Your database will need to acommodate for that by having a session or user table, I'll go with the latter since it is more clear to read.
Secondly, you have a list of locations (Sidney, New York, London). I assume this list can either change over time or even from one questionaire to the next.
Then you have a set of questions. You don't explicitly state that these would be variable or fixed. Since you designed a set of tables for that, I assume it's supposed to be variable. Please note that your questions are not a matrix, but a list. Even if they are hierarchical, they still do not compose a matrix.
Last but not least you've got answers to those questions.
Lets create a users table:
user_id user_name
1 me
2 somebody else
Second table is as simple: locations
location_id location_name
1 Sidney
2 New York
3 London
Third table is a bit more complicated - and to be honest: just plain ugly. But this is what you get if you design a database in a database, and the alternatives (using DDL or storing that information in XML/JSON or even outside the database) are not pretty either. If there is a hierarchical question (your examples don't show them), you could add a "parent_question_id" column.
question_id question_text question_type question_type_info
1 How do you rate RADIO 0 to 5
2 Length of stay COMBOBOX 1 day, 2 days, whatever
3 Visited last year CHECKBOX
Finally you need a fourth table to store all the answers
user_id location_id question_id value
1 1 1 2 <-- value here means "rating of 2"
1 1 2 5 <-- value here means "5 days"
1 1 3 1 <-- value here means "yes, visited last year"
Yep. ugly as well. If you had a fixed list of questions I could provide you with a pretty database :)
Edit: Answering to your comments: To link your questions to a survey, you'll need a few more tables surveys defining which questions for which locations are going to be asked. The following database layout lets you specify a list of locations and a list of questions asked as well as a survey name.
Table surveys:
survey_id survey_name
1 Spring 2013 London Travel Survey
2 Spring 2013 Northern Hemisphere Short Survey
Table survey_questions:
survey_id question_id
1 1
1 2
1 3
2 1
Table survey_locations:
survey_id location_id
1 1
2 1
2 2
The contents I put in here gives you two surveys. Survey #1 will ask all three questions just on one location: 'London'. Survey #2 will just ask one question on both London and New York. If you want to ask different questions on different locations your table layout will have to accommodate for that, but such a system won't fit into your original table-like layout.

Having done things similar to this, I would recommend considering not turning this into a relational problem. What if you have objects and just serialize them to something like JSON and store that?
Doing this relationally you'll end up spending quite a bit of time making tables and wiring together complex drawing code in your application to make sure the questions/answers draw in the right order etc.
Otherwise I think you can make your approach work. There is no silver bullet for designing survey stuff in an RDBMS.

Human Name lookup / translation

I am working on a requirement to match people from different databases. One tricky problem is variance in names like Bob - Robert, Jim - James, Lizzy - Elizabeth etc across databases.
Is there a lookup/translation available for this kind of a requirement.

Take a look at my answer (as well as the others) here:
Tools for matching name/address data

You'd need to implement a lookup table with the alternate names in it:
Base | Alternate
----------------
Robert | Bob
Elizabeth | Liz
Elizabeth | Lizzy
Elizabeth | Beth
Then search the database for the base name and all alternates. You'll end up with a number of multiple matches which will then need to be checked to see if they really match based on a comparison of whatever other data you have in the two databases. Maybe the dates of the records in each database could be used - records entered close in time indicate the same person.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight