I would like to store some information as follows (note, I'm not wedded to this data structure at all, but this shows you the underlying information I want to store):
{ user_id: 12345, page_id: 2, country: 'DE' }
In these records, user_id is a unique field, but the page_id is not.
I would like to translate this into a Redis data structure, and I would like to be able to run efficient searches as follows:
For user_id 12345, find the related country.
For page_id 2, find all related user_ids and their countries.
Is it actually possible to do this in Redis? If so, what data structures should I use, and how should I avoid the possibility of duplicating records when I insert them?
It sounds like you need two key types: a HASH key to store your user's data, and a LIST for each page that contains a list of related users. Below is an example of how this could work.
Load Data:
> RPUSH page:2:users 12345
> HMSET user:12345 country DE key2 value2
Pull Data:
# All users for page 2
> LRANGE page:2:users 0 -1
# All users for page 2 and their countries
> SORT page:2:users By nosort GET # GET user:*->country GET user:*->key2
Remove User From Page:
> LREM page:2:users 0 12345
Repeat GETs in the SORT to retrieve additional values for the user.
I hope this helps, let me know if there's anything you'd like clarified or if you need further assistance. I also recommend reading the commands list and documentation available at the redis web site, especially concerning the SORT operation.
Since user_id is unique and so does country, keep them in a simple key-value pair. Quering for a user is O(1) in such a case... Then, keep some Redis sets, with key the page_id and members all the user_ids..
Related
I have an interesting problem. We receive the feed files from our customers which contains the products along with their information. We log each of the feed request received from our customers in a database.
The Problem is that given a feed file, we need to get all the feed requests which has the same list of products in the given feed file.Every feed request has nearly 2million candidate feeds for matching?
Let me to summarize the probelem, just to make sure that we are on the same page.
The application may get a Feed Request, which contains list of products. Every time it happens, you log FR in db, and in addition you want to check for all FRs in the past which contained the same products set, is that right?
If so, an idea is to generate a hash key for a list of products within a FR. In that way every FR in db, has its own hash - which corresponds to list of products this FR contained.
Eg.
Feed Request came to the app, and it contains products 2, 1, 3. The
app sorts products identities: [1, 2, 3], and then generate hash:
h([1, 2, 3]) = abc. Then, all you need to look for previous FRs with
the same products set, is to generate a query: "get all records from feed requests, where
hash is equal to "abc" ".
Such comparison is not very expensive if you index the data in the right way, even if there are milions of records.
I know you can pass a key or a range to return records in CouchDB, but I want to do something like this. Find X records that are X values.
So for example, in regular SQL, lets say I wanted to return records with ids that are 5, 7, 29, 102. I would do something like this:
SELECT * FROM sometable WHERE id = 5 OR id = 7 or id = 29 or id = 102
Is it possible to do this in CouchDB, where I toss all the values I want to find in the key array, and then CouchDB searches for all of those records that could exist in the "key parameter"?
You can do a POST as documented on CouchDB wiki. You pass the list of keys in the body of the request.
{"keys": ["key1", "key2", ...]}
The downside is that a POST request is not cached by the browser.
Alternatively, you can obtain the same response using a GET with the keys parameter. For example, you can query the view _all_docs with:
/DB/_all_docs?keys=["ID1","ID2"]&include_docs=true
which, properly URL encoded, becomes:
/DB/_all_docs?keys=%5B%22ID1%22,%22ID2%22%5D&include_docs=true
this should give better cacheability, but keep in mind that _all_docs changes at each doc update. Sometimes, you can workaround this by defining your own view with only the needed documents.
With a straight view function, this will not be possible. However, you can use a _list function to accomplish the same result.
My structure
cat:id:name -> name of category
cat:id:subcats -> set of subcategories
cat:list -> list of category ids
The following gives me a list of cat ids:
lrange cat:list 0, -1
Do I have to iterate each id from the above command to get the name field in my script? Because that seems inefficient. How can I get a list of category names from redis?
There are a couple different approaches. You may want to have the values in the list be delimited/encoded strings that contain both the id, the name, and any other value you need quick access to. I recommend JSON for interoperability and efficient string length, but there are other formats which are more performant.
Another option is to, like you said, iterate. You can make this more efficient by getting all your keys in a single request and then using MGET, pipelining, or MULTI/EXEC to fetch all the names in a single, efficient, operation.
Consider an e-commerce application with multiple stores. Each store owner can edit the item catalog of his store.
My current database schema is as follows:
item_names: id | name | description | picture | common(BOOL)
items: id | item_name_id | picture | price | description | picture
item_synonyms: id | item_name_id | name | error(BOOL)
Notes: error indicates a wrong spelling (eg. "Ericson"). description and picture of the item_names table are "globals" that can optionally be overridden by "local" description and picture fields of the items table (in case the store owner wants to supply a different picture for an item). common helps separate unique item names ("Jimmy Joe's Cheese Pizza" from "Cheese Pizza")
I think the bright side of this schema is:
Optimized searching & Handling Synonyms: I can query the item_names & item_synonyms tables using name LIKE %QUERY% and obtain the list of item_name_ids that need to be joined with the items table. (Examples of synonyms: "Sony Ericsson", "Sony Ericson", "X10", "X 10")
Autocompletion: Again, a simple query to the item_names table. I can avoid the usage of DISTINCT and it minimizes number of variations ("Sony Ericsson Xperia™ X10", "Sony Ericsson - Xperia X10", "Xperia X10, Sony Ericsson")
The down side would be:
Overhead: When inserting an item, I query item_names to see if this name already exists. If not, I create a new entry. When deleting an item, I count the number of entries with the same name. If this is the only item with that name, I delete the entry from the item_names table (just to keep things clean; accounts for possible erroneous submissions). And updating is the combination of both.
Weird Item Names: Store owners sometimes use sentences like "Harry Potter 1, 2 Books + CDs + Magic Hat". There's something off about having so much overhead to accommodate cases like this. This would perhaps be the prime reason I'm tempted to go for a schema like this:
items: id | name | picture | price | description | picture
(... with item_names and item_synonyms as utility tables that I could query)
Is there a better schema you would suggested?
Should item names be normalized for autocomplete? Is this probably what Facebook does for "School", "City" entries?
Is the first schema or the second better/optimal for search?
Thanks in advance!
References: (1) Is normalizing a person's name going too far?, (2) Avoiding DISTINCT
EDIT: In the event of 2 items being entered with similar names, an Admin who sees this simply clicks "Make Synonym" which will convert one of the names into the synonym of the other. I don't require a way to automatically detect if an entered name is the synonym of the other. I'm hoping the autocomplete will take care of 95% of such cases. As the table set increases in size, the need to "Make Synonym" will decrease. Hope that clears the confusion.
UPDATE: To those who would like to know what I went ahead with... I've gone with the second schema but removed the item_names and item_synonyms tables in hopes that Solr will provide me with the ability to perform all the remaining tasks I need:
items: id | name | picture | price | description | picture
Thanks everyone for the help!
The requirements you state in your comment ("Optimized searching", "Handling Synonyms" and "Autocomplete") are not things that are generally associated with an RDBMS. It sounds like what you're trying to solve is a searching problem, not a data storage and normalization problem. You might want to start looking at some search architectures like Solr
Excerpted from the solr feature list:
Faceted Searching based on unique field values, explicit queries, or date ranges
Spelling suggestions for user queries
More Like This suggestions for given document
Auto-suggest functionality
Performance Optimizations
If there were more attributes exposed for mapping, I would suggest using a fast search index system. No need to set aliases up as the records are added, the attributes simply get indexed and each search issued returns matches with a relevance score. Take the top X% as valid matches and display those.
Creating and storing aliases seems like a brute-force, labor intensive approach that probably won't be able to adjust to the needs of your users.
Just an idea.
One thing that comes to my mind is sorting the characters in the name and synonym throwing away all white space. This is similar to the solution of finding all anagrams for a word. The end result is ability to quickly find similar entries. As you pointed out, all synonyms should converge into one single term, or name. The search is performed against synonyms using again sorted input string.
The title is a bit awkward but I couldn't found a better one. My problem is as follows:
I have several users stored as documents and I am storing several key-value-pairs or items (which have an id) for each document. Now, if I apply highlighting with hl.snippets=5 I can get the first 5 items. But every user could have several hundreds items, so
you will not get the most relevant 5 items. You will get the first 5 items ...
Another problem is that
the highlighted text won't contain the id and so retrieving additional information of the highlighted item text is ugly.
Example where items are emails:
user1 has item1 { text:"developers developers developers", id:1, title:"ms" }
item2 { text:"c# development", id:2, title:"nice!" }
...
item77 ...
user2 has item1 { text:"nice restaurant", id:3, title:"bla"}
item2 { text:"best cafe", id:4, title:"blup"}
...
item223 ...
Now if I use highlighting for the text field and query against "restaurant" I get user2 and the text nice <b>restaurant</b>. But how can I determine the id of the highlighted text to display e.g. the title of this item? And what happens if more relevant items are listed at the end of the item-list? Highlighting won't display those ...
So how can I find the best items of a documents with multiple such items?
I added my two findings as answers, but as I will point out each of them has its own drawbacks.
Could anyone point me to a better solution?
One of my rules of thumb for designing Solr schemas is: the document is what you will search for.
If you want to search for 'items', then these 'items' are your documents. How you store other stuff, like 'users', is secondary. So 'users' could be in another index like you mentioned, they could be "denormalized" (e.g. their information duplicated in each document), in a relational database, etc. depending on RDBMS availability, how many 'users' there are, how many fields these 'users' have, etc.
EDIT: now you explain that the 'items' are emails, and a possible search is 'restaurant X' and you want to find the best 'items' (emails). Therefore, the document is the email. The schema could be as simple as this: (id, title, text, user).
You could enable highlighting to get snippets of the 'text' or 'title' fields matching the 'restaurant X' query.
If you want to give the end-user information about the users that wrote about 'restaurant X', you could facet the 'user' field. Then the end-user would see that John wrote 10 emails about 'restaurant X' and Robert wrote 6. The end-user thinks "This John dude must know a lot about this restaurant" so he drills down into a search by 'restaurant x' with a filter query user:John
You could use use two indices: users->items as described in the question and an index with 'pure items' referencing back to the user.
Then you will need 2 queries (thats the reason I called the question '2d Search in Solr'):
query the user index => list of e.g. 10 users
query the items index for each user of the 1. step => best items
Assume the following example:
userA emails are "restaurant X is bad but restaurant X is cheap", "different topic", "different topicB" and
userB emails are "restaurant X is not nice", "revisited restaurant X and it was ok now", "again in restaurant X and I think it is the best".
Now I query the user index for "restaurant X" and the first user will be userB, which is what I want. If I would query only the item-index I would get the item1 of less relevant userA.
Drawbacks:
bad performance, because you will need one query against the user index and e.g. 10 more to get the most relevant items for each user.
maintaining two indices.
Update to avoid many queries I will try the following: using the user index to get some highlighted snippets and then offering a 'get relevant items'-button for every user which then triggers a query against the item index.
You can use the collapse patch and store each item as separate document linking back to the user.
The problem of that approach is that you won't get the most relevant user. Ie. the most relevant item is not necessarily from the most relevant user (because he can have several slightly less relevant items)
See the "Assume the following example:" part in my second answer.