"2d Search" in Solr or how to get the best item of the multivalued field 'items'? - solr

The title is a bit awkward but I couldn't found a better one. My problem is as follows:
I have several users stored as documents and I am storing several key-value-pairs or items (which have an id) for each document. Now, if I apply highlighting with hl.snippets=5 I can get the first 5 items. But every user could have several hundreds items, so
you will not get the most relevant 5 items. You will get the first 5 items ...
Another problem is that
the highlighted text won't contain the id and so retrieving additional information of the highlighted item text is ugly.
Example where items are emails:
user1 has item1 { text:"developers developers developers", id:1, title:"ms" }
item2 { text:"c# development", id:2, title:"nice!" }
...
item77 ...
user2 has item1 { text:"nice restaurant", id:3, title:"bla"}
item2 { text:"best cafe", id:4, title:"blup"}
...
item223 ...
Now if I use highlighting for the text field and query against "restaurant" I get user2 and the text nice <b>restaurant</b>. But how can I determine the id of the highlighted text to display e.g. the title of this item? And what happens if more relevant items are listed at the end of the item-list? Highlighting won't display those ...
So how can I find the best items of a documents with multiple such items?
I added my two findings as answers, but as I will point out each of them has its own drawbacks.
Could anyone point me to a better solution?

One of my rules of thumb for designing Solr schemas is: the document is what you will search for.
If you want to search for 'items', then these 'items' are your documents. How you store other stuff, like 'users', is secondary. So 'users' could be in another index like you mentioned, they could be "denormalized" (e.g. their information duplicated in each document), in a relational database, etc. depending on RDBMS availability, how many 'users' there are, how many fields these 'users' have, etc.
EDIT: now you explain that the 'items' are emails, and a possible search is 'restaurant X' and you want to find the best 'items' (emails). Therefore, the document is the email. The schema could be as simple as this: (id, title, text, user).
You could enable highlighting to get snippets of the 'text' or 'title' fields matching the 'restaurant X' query.
If you want to give the end-user information about the users that wrote about 'restaurant X', you could facet the 'user' field. Then the end-user would see that John wrote 10 emails about 'restaurant X' and Robert wrote 6. The end-user thinks "This John dude must know a lot about this restaurant" so he drills down into a search by 'restaurant x' with a filter query user:John

You could use use two indices: users->items as described in the question and an index with 'pure items' referencing back to the user.
Then you will need 2 queries (thats the reason I called the question '2d Search in Solr'):
query the user index => list of e.g. 10 users
query the items index for each user of the 1. step => best items
Assume the following example:
userA emails are "restaurant X is bad but restaurant X is cheap", "different topic", "different topicB" and
userB emails are "restaurant X is not nice", "revisited restaurant X and it was ok now", "again in restaurant X and I think it is the best".
Now I query the user index for "restaurant X" and the first user will be userB, which is what I want. If I would query only the item-index I would get the item1 of less relevant userA.
Drawbacks:
bad performance, because you will need one query against the user index and e.g. 10 more to get the most relevant items for each user.
maintaining two indices.
Update to avoid many queries I will try the following: using the user index to get some highlighted snippets and then offering a 'get relevant items'-button for every user which then triggers a query against the item index.

You can use the collapse patch and store each item as separate document linking back to the user.
The problem of that approach is that you won't get the most relevant user. Ie. the most relevant item is not necessarily from the most relevant user (because he can have several slightly less relevant items)
See the "Assume the following example:" part in my second answer.

Related

How to find the pointer node of a relationship using full text search on the edge node in neo4j

This question is related to Neo4j databases. Suppose I have a relationship (employee)-[WORKS-IN]->(company).. Imagine an employee works in multiple companies. I should be able to find the companies that a specific employee is working using full text search in neo4j. I'll be searching from the users name and I should be able to return company nodes..how to do that??
Full text search must be used.
So you want to search for a Person by name with full text and then retrieve the companies he worked for.
Compare this easily with the default Movies graph in Neo4j, you want to search for a Person by name with full text and then retrieve the movies the person acted in .
CALL db.index.fulltext.queryNodes('Person', 'kea*')
YIELD node
MATCH (node)-[:ACTED_IN]->(movie)
RETURN node.name, movie.title
This is an example when I created this node:
CREATE (e:Employee {name: 'Nirmana Testing'})
Then create the full text index on Employee.name
CREATE FULLTEXT INDEX employeeNameIdx FOR (e:Employee) ON EACH [e.name]
Then run a query using this full text index. Noted that the keyword 'nirmana' can be upper case or any case.
CALL db.index.fulltext.queryNodes("employeeNameIdx", "nirmana") YIELD node as employee
MATCH (employee)-[:WORKS-IN]->(company:Company)
RETURN employee, company
reference:
https://neo4j.com/docs/cypher-manual/current/indexes-for-full-text-search/
Thank you very much. Sorted it out. And one more thing. Suppose for a particular worker there can be various relationships except [WORKS-IN] , such as [PART TIME WORKER] , [FREELANCER], [PROJECT MANAGER] and so on. So for a particular user, If we want to find the place or company that he is working, freelancing, managing projects by searching the relationship type how could it be done using full text search.

How do i properly design a data model for Recipe / Ingredient using MongoDB

Recently i have designed a database model or ERD using Hackalode.
So the problem I'm currently facing is that base on my current design, i can't query it correctly as I wanted. I studied ERD with MYSQL and do know that Mongo doesn't work the same
The idea was simple, I want a recipe that has a array list of ingredients, and the ingredients are from separate collection.
The recipe also consist of measurement of the ingredient ie. (1 tbps sugar)
Can also query from list of ingredients and find the recipe that contains the ingredients
I wanted this collections to be in Many to Many relationship and the recipe can use the ingredients that are already in the database.
I just don't know how to query the data
I have tried a lot of ways by using $elemMatch and populate and all i get is empty array list as a result.
Im expecting two types of query where i can query by name of ingredients or by the recipe
My expectation result would be like this
[{
id: ...,
name: ....,
description: ...,
macros: [...],
ingredients: [
{
id,
amount: ....,
unit: ....
ingredient: {
id: ....,
name: ....
}
}
}, { ... }]
But instead of getting
[]
Imho, your design is utterly wrong. You over normalized your data. I would do something much simpler and use embedding. The reasoning behind that is that you define your use cases first and then you model your data to answer the question arising from your use cases in the most efficient way.
Assumed use cases
As a user, I want a list of all recipes.
As a user, I want a list of all recipes by ingredient.
As a designer, I want to be able to show a list of all ingredients.
As a user, I want to be able to link to recipes for compound ingredients, should it be present on the site.
Surely, this is just a small excerpt, but it is sufficient for this example.
How to answer the questions
Ok, the first one is extremely simple:
db.recipes.find()[.limit()[.skip()]]
Now, how could we find by ingredient? Simple answer: do a text index on ingredient names (and probably some other fields, as you can only have one text index per collection. Then, the query is equally simple:
db.recipes.find({$text:{$search:"ingredient name"}})
"Hey, wait a moment! How do I get a list of all ingredients?" Let us assume we want a simple list of ingredients, with a number on how often they are actually used:
db.recipes.aggregate([
// We want all ingredients as single values
{$unwind:"$Ingredients"},
// We want the response to be "Ingredient"
{$project:{_id:0,"Ingredient":"$Ingredients.Name"}
// We count the occurrence of each ingredient
// in the recipes
{$group:{_id:"$Ingredient",count:{$sum:1}}}
])
This would actually be sufficient, unless you have a database of gazillions of recipes. In that case, you might want to have a deep look into incremental map/reduce instead of an aggregation. Hint: You should add a timestamp to the recipes to be able to use incremental map/reduce.
If you have a couple of hundred K to a couple of million recipes, you can also add an $out stage to preaggregate your data.
On measurements
Imho, it makes no sense to have defined measurements. There are teaspoons, tablespoons, metric and imperial measurements, groupings like "dozen" or specifications like "clove". Which you really do not want to convert to each other or even set to a limited number of measurements. How many ounces is a clove of garlic? ;)
Bottom line: Make it a free text field, maybe with some autocomplete suggestions.
Revised data model
Recipe
{
_id: new ObjectId(),
Name: "Surf & Turf Kebap",
Ingredients: [
{
Name: "Flunk Steak",
Measurement: "200 g"
},
{
Name: "Prawns",
Measurement: "300g",
Note: "Fresh ones!"
},
{
Name: "Garlic Oil",
Measurement: "1 Tablespoon",
Link: "/recipes/5c2cc4acd98df737db7c5401"
}
]
}
And the example of the text index:
db.recipes.createIndex({Name:"text","Ingredients.Name":"text"})
The theory behind it
A recipe is you basic data structure, as your application is supposed to store and provide them, potentially based on certain criteria. Ingredients and measurements (to the extend where it makes sense) can easily be derived from the recipes. So why bother to store ingredients and measurements independently. It only makes your data model unnecessarily complicated, while not providing any advantage.
hth

How do travel websites implement the sorting of search results?

For example you make a search for a hotel in London and get 250 hotels out of which 25 hotels are shown on first page. On each page user has an option to sort the hotels based on price, name, user-reviews etc. Now the intelligent thing to do will be to only get the first 25 hotels on the first page from the database. When user moves to page 2, make another database query for next 25 hotels and keep the previous results in cache.
Now consider this, user is on page 1 and sees 25 hotels sorted by price and now he sorts them based on user-ratings, in this case, we should keep the hotels we already got in cache and only request for additional hotels. How is that implemented? Is there something built in any language (preferably php) or we have to implement it from scratch using multiple queries?
This is usually done as follows:
The query is executed with order by the required field, and with a top (in some databases limit) set to (page_index + 1) * entries_per_page results. The query returns a random-access rowset (you might also hear of this referred to as a resultset or a recordset depending on the database library you are using) which supports methods such as MoveTo( row_index ) and MoveNext(). So, we execute MoveTo( page_index * entries_per_page ) and then we read and display entries_per_page results. The rowset generally also offers a Count property which we invoke to get the total number of rows that would be fetched by the query if we ever let it run to the end (which of course we don't) so that we can compute and show the user how many pages exist.

Simple search by value?

I would like to store some information as follows (note, I'm not wedded to this data structure at all, but this shows you the underlying information I want to store):
{ user_id: 12345, page_id: 2, country: 'DE' }
In these records, user_id is a unique field, but the page_id is not.
I would like to translate this into a Redis data structure, and I would like to be able to run efficient searches as follows:
For user_id 12345, find the related country.
For page_id 2, find all related user_ids and their countries.
Is it actually possible to do this in Redis? If so, what data structures should I use, and how should I avoid the possibility of duplicating records when I insert them?
It sounds like you need two key types: a HASH key to store your user's data, and a LIST for each page that contains a list of related users. Below is an example of how this could work.
Load Data:
> RPUSH page:2:users 12345
> HMSET user:12345 country DE key2 value2
Pull Data:
# All users for page 2
> LRANGE page:2:users 0 -1
# All users for page 2 and their countries
> SORT page:2:users By nosort GET # GET user:*->country GET user:*->key2
Remove User From Page:
> LREM page:2:users 0 12345
Repeat GETs in the SORT to retrieve additional values for the user.
I hope this helps, let me know if there's anything you'd like clarified or if you need further assistance. I also recommend reading the commands list and documentation available at the redis web site, especially concerning the SORT operation.
Since user_id is unique and so does country, keep them in a simple key-value pair. Quering for a user is O(1) in such a case... Then, keep some Redis sets, with key the page_id and members all the user_ids..

Best way to store user-submitted item names (and their synonyms)

Consider an e-commerce application with multiple stores. Each store owner can edit the item catalog of his store.
My current database schema is as follows:
item_names: id | name | description | picture | common(BOOL)
items: id | item_name_id | picture | price | description | picture
item_synonyms: id | item_name_id | name | error(BOOL)
Notes: error indicates a wrong spelling (eg. "Ericson"). description and picture of the item_names table are "globals" that can optionally be overridden by "local" description and picture fields of the items table (in case the store owner wants to supply a different picture for an item). common helps separate unique item names ("Jimmy Joe's Cheese Pizza" from "Cheese Pizza")
I think the bright side of this schema is:
Optimized searching & Handling Synonyms: I can query the item_names & item_synonyms tables using name LIKE %QUERY% and obtain the list of item_name_ids that need to be joined with the items table. (Examples of synonyms: "Sony Ericsson", "Sony Ericson", "X10", "X 10")
Autocompletion: Again, a simple query to the item_names table. I can avoid the usage of DISTINCT and it minimizes number of variations ("Sony Ericsson Xperia™ X10", "Sony Ericsson - Xperia X10", "Xperia X10, Sony Ericsson")
The down side would be:
Overhead: When inserting an item, I query item_names to see if this name already exists. If not, I create a new entry. When deleting an item, I count the number of entries with the same name. If this is the only item with that name, I delete the entry from the item_names table (just to keep things clean; accounts for possible erroneous submissions). And updating is the combination of both.
Weird Item Names: Store owners sometimes use sentences like "Harry Potter 1, 2 Books + CDs + Magic Hat". There's something off about having so much overhead to accommodate cases like this. This would perhaps be the prime reason I'm tempted to go for a schema like this:
items: id | name | picture | price | description | picture
(... with item_names and item_synonyms as utility tables that I could query)
Is there a better schema you would suggested?
Should item names be normalized for autocomplete? Is this probably what Facebook does for "School", "City" entries?
Is the first schema or the second better/optimal for search?
Thanks in advance!
References: (1) Is normalizing a person's name going too far?, (2) Avoiding DISTINCT
EDIT: In the event of 2 items being entered with similar names, an Admin who sees this simply clicks "Make Synonym" which will convert one of the names into the synonym of the other. I don't require a way to automatically detect if an entered name is the synonym of the other. I'm hoping the autocomplete will take care of 95% of such cases. As the table set increases in size, the need to "Make Synonym" will decrease. Hope that clears the confusion.
UPDATE: To those who would like to know what I went ahead with... I've gone with the second schema but removed the item_names and item_synonyms tables in hopes that Solr will provide me with the ability to perform all the remaining tasks I need:
items: id | name | picture | price | description | picture
Thanks everyone for the help!
The requirements you state in your comment ("Optimized searching", "Handling Synonyms" and "Autocomplete") are not things that are generally associated with an RDBMS. It sounds like what you're trying to solve is a searching problem, not a data storage and normalization problem. You might want to start looking at some search architectures like Solr
Excerpted from the solr feature list:
Faceted Searching based on unique field values, explicit queries, or date ranges
Spelling suggestions for user queries
More Like This suggestions for given document
Auto-suggest functionality
Performance Optimizations
If there were more attributes exposed for mapping, I would suggest using a fast search index system. No need to set aliases up as the records are added, the attributes simply get indexed and each search issued returns matches with a relevance score. Take the top X% as valid matches and display those.
Creating and storing aliases seems like a brute-force, labor intensive approach that probably won't be able to adjust to the needs of your users.
Just an idea.
One thing that comes to my mind is sorting the characters in the name and synonym throwing away all white space. This is similar to the solution of finding all anagrams for a word. The end result is ability to quickly find similar entries. As you pointed out, all synonyms should converge into one single term, or name. The search is performed against synonyms using again sorted input string.

Resources