Best way to store user-submitted item names (and their synonyms) - database

Consider an e-commerce application with multiple stores. Each store owner can edit the item catalog of his store.
My current database schema is as follows:
item_names: id | name | description | picture | common(BOOL)
items: id | item_name_id | picture | price | description | picture
item_synonyms: id | item_name_id | name | error(BOOL)
Notes: error indicates a wrong spelling (eg. "Ericson"). description and picture of the item_names table are "globals" that can optionally be overridden by "local" description and picture fields of the items table (in case the store owner wants to supply a different picture for an item). common helps separate unique item names ("Jimmy Joe's Cheese Pizza" from "Cheese Pizza")
I think the bright side of this schema is:
Optimized searching & Handling Synonyms: I can query the item_names & item_synonyms tables using name LIKE %QUERY% and obtain the list of item_name_ids that need to be joined with the items table. (Examples of synonyms: "Sony Ericsson", "Sony Ericson", "X10", "X 10")
Autocompletion: Again, a simple query to the item_names table. I can avoid the usage of DISTINCT and it minimizes number of variations ("Sony Ericsson Xperia™ X10", "Sony Ericsson - Xperia X10", "Xperia X10, Sony Ericsson")
The down side would be:
Overhead: When inserting an item, I query item_names to see if this name already exists. If not, I create a new entry. When deleting an item, I count the number of entries with the same name. If this is the only item with that name, I delete the entry from the item_names table (just to keep things clean; accounts for possible erroneous submissions). And updating is the combination of both.
Weird Item Names: Store owners sometimes use sentences like "Harry Potter 1, 2 Books + CDs + Magic Hat". There's something off about having so much overhead to accommodate cases like this. This would perhaps be the prime reason I'm tempted to go for a schema like this:
items: id | name | picture | price | description | picture
(... with item_names and item_synonyms as utility tables that I could query)
Is there a better schema you would suggested?
Should item names be normalized for autocomplete? Is this probably what Facebook does for "School", "City" entries?
Is the first schema or the second better/optimal for search?
Thanks in advance!
References: (1) Is normalizing a person's name going too far?, (2) Avoiding DISTINCT
EDIT: In the event of 2 items being entered with similar names, an Admin who sees this simply clicks "Make Synonym" which will convert one of the names into the synonym of the other. I don't require a way to automatically detect if an entered name is the synonym of the other. I'm hoping the autocomplete will take care of 95% of such cases. As the table set increases in size, the need to "Make Synonym" will decrease. Hope that clears the confusion.
UPDATE: To those who would like to know what I went ahead with... I've gone with the second schema but removed the item_names and item_synonyms tables in hopes that Solr will provide me with the ability to perform all the remaining tasks I need:
items: id | name | picture | price | description | picture
Thanks everyone for the help!

The requirements you state in your comment ("Optimized searching", "Handling Synonyms" and "Autocomplete") are not things that are generally associated with an RDBMS. It sounds like what you're trying to solve is a searching problem, not a data storage and normalization problem. You might want to start looking at some search architectures like Solr
Excerpted from the solr feature list:
Faceted Searching based on unique field values, explicit queries, or date ranges
Spelling suggestions for user queries
More Like This suggestions for given document
Auto-suggest functionality
Performance Optimizations

If there were more attributes exposed for mapping, I would suggest using a fast search index system. No need to set aliases up as the records are added, the attributes simply get indexed and each search issued returns matches with a relevance score. Take the top X% as valid matches and display those.
Creating and storing aliases seems like a brute-force, labor intensive approach that probably won't be able to adjust to the needs of your users.

Just an idea.
One thing that comes to my mind is sorting the characters in the name and synonym throwing away all white space. This is similar to the solution of finding all anagrams for a word. The end result is ability to quickly find similar entries. As you pointed out, all synonyms should converge into one single term, or name. The search is performed against synonyms using again sorted input string.

Related

How to find the pointer node of a relationship using full text search on the edge node in neo4j

This question is related to Neo4j databases. Suppose I have a relationship (employee)-[WORKS-IN]->(company).. Imagine an employee works in multiple companies. I should be able to find the companies that a specific employee is working using full text search in neo4j. I'll be searching from the users name and I should be able to return company nodes..how to do that??
Full text search must be used.
So you want to search for a Person by name with full text and then retrieve the companies he worked for.
Compare this easily with the default Movies graph in Neo4j, you want to search for a Person by name with full text and then retrieve the movies the person acted in .
CALL db.index.fulltext.queryNodes('Person', 'kea*')
YIELD node
MATCH (node)-[:ACTED_IN]->(movie)
RETURN node.name, movie.title
This is an example when I created this node:
CREATE (e:Employee {name: 'Nirmana Testing'})
Then create the full text index on Employee.name
CREATE FULLTEXT INDEX employeeNameIdx FOR (e:Employee) ON EACH [e.name]
Then run a query using this full text index. Noted that the keyword 'nirmana' can be upper case or any case.
CALL db.index.fulltext.queryNodes("employeeNameIdx", "nirmana") YIELD node as employee
MATCH (employee)-[:WORKS-IN]->(company:Company)
RETURN employee, company
reference:
https://neo4j.com/docs/cypher-manual/current/indexes-for-full-text-search/
Thank you very much. Sorted it out. And one more thing. Suppose for a particular worker there can be various relationships except [WORKS-IN] , such as [PART TIME WORKER] , [FREELANCER], [PROJECT MANAGER] and so on. So for a particular user, If we want to find the place or company that he is working, freelancing, managing projects by searching the relationship type how could it be done using full text search.

In cucumber background steps are passed for first scenario outline but failing for the second scenario outline

Feature: search by customer
Background:
Given user selects search type as customer
Scenario Outline: search customer
When slects customer type as customer
Then enter the customer id as "<customer>" in search
And clicks on search icon to search
Examples:
|customer|
|248069 |
Scenario Outline: Search hierarchy
When slects customer type as hierarchy
Then enter the hierarchy id as "<hierarchy>" in search
And clicks on search icon to search
Examples:
|hierarchy |
|3779213 |
If the second scenario results in an error when executed, I would modify the first scenario outline so that it could run both scenarios. You will need to parameterize the step definition for the first scenario outline something like this:
Scenario Outline: search user types
When selects customer type as <type>
Then enter the customer id as <id> in search
And clicks on search icon to search
Examples:
| type | id |
| customer | 248069 |
| hierarchy | 3779213 |
You will need to modify the step definitions that work (the ones used by the first scenario outline). My suspicion is that the step definition for the first step in the second scenario (slects customer type as hierarchy) is broken and causing your issue. Even if it is not defective, there is no good reason to have two step definition that basically do the same thing. Pass a parameter and make a decision inside the method body to make a decision based on the parameter passed if you need to execute an alternate path.
If you make these changes and the second scenario example fails, you can assume that it is due to a bad parameter being passed. In this case, the id parameter is one character longer in the second scenario example. It is possible this could be the problem.
Since you haven't provided a specific description of the error you are getting, it is impossible to say for certain what solution will work for you. That said, this is my best guess.

Show UniData SELECT results that are not record keys

I'm looking over some UniData fields for distinct values but I'm hoping to find a simpler way of doing it. The values aren't keys to anything so right now I'm selecting the records I'm interested in and selecting the data I need with SAVING UNIQUE. The problem is, in order to see what I have all I know to do is save it out to a savedlist and then read through the savedlist file I created.
Is there a way to see the contents of a select without running it against a file?
If you are just wanted to visually look over the data, use LIST instead of SELECT.
The general syntax of the command is something like:
LIST filename WITH [criteria] [sort] [attributes | ALL]
So let's say you have a table called questions and want to look over all the author for questions that used the tag unidata. Your query might look something like:
LIST questions WITH tag = "unidata" BY author author
Note: The second author isn't a mistake, it's the start of the list of attributes you want displayed - in this case just author, but you might want the record id as well, so you could do #ID author instead. Or just do ALL to display everything in each record.
I did BY author here as it will make spotting uniques easier, but you can also use other query features like BREAK.ON to help here as well.
I don't know why I didn't think of it at the time but I basically needed something like SQL's DISTINCT statement since I just needed to view the unique values. Replicating DISTINCT in UniData is explained here, https://forum.precisonline.com/index.php?topic=318.0.
The trick is to sort on the values using BY, get a single unique value of each using BREAK-ON, and then suppress everything except those unique values using DET-SUP.
LIST BUILDINGS BY CITY BREAK-ON CITY DET-SUP
CITY.............
Albuquerque
Arlington
Ashland
Clinton
Franklin
Greenville
Madison
Milton
Springfield
Washington

"2d Search" in Solr or how to get the best item of the multivalued field 'items'?

The title is a bit awkward but I couldn't found a better one. My problem is as follows:
I have several users stored as documents and I am storing several key-value-pairs or items (which have an id) for each document. Now, if I apply highlighting with hl.snippets=5 I can get the first 5 items. But every user could have several hundreds items, so
you will not get the most relevant 5 items. You will get the first 5 items ...
Another problem is that
the highlighted text won't contain the id and so retrieving additional information of the highlighted item text is ugly.
Example where items are emails:
user1 has item1 { text:"developers developers developers", id:1, title:"ms" }
item2 { text:"c# development", id:2, title:"nice!" }
...
item77 ...
user2 has item1 { text:"nice restaurant", id:3, title:"bla"}
item2 { text:"best cafe", id:4, title:"blup"}
...
item223 ...
Now if I use highlighting for the text field and query against "restaurant" I get user2 and the text nice <b>restaurant</b>. But how can I determine the id of the highlighted text to display e.g. the title of this item? And what happens if more relevant items are listed at the end of the item-list? Highlighting won't display those ...
So how can I find the best items of a documents with multiple such items?
I added my two findings as answers, but as I will point out each of them has its own drawbacks.
Could anyone point me to a better solution?
One of my rules of thumb for designing Solr schemas is: the document is what you will search for.
If you want to search for 'items', then these 'items' are your documents. How you store other stuff, like 'users', is secondary. So 'users' could be in another index like you mentioned, they could be "denormalized" (e.g. their information duplicated in each document), in a relational database, etc. depending on RDBMS availability, how many 'users' there are, how many fields these 'users' have, etc.
EDIT: now you explain that the 'items' are emails, and a possible search is 'restaurant X' and you want to find the best 'items' (emails). Therefore, the document is the email. The schema could be as simple as this: (id, title, text, user).
You could enable highlighting to get snippets of the 'text' or 'title' fields matching the 'restaurant X' query.
If you want to give the end-user information about the users that wrote about 'restaurant X', you could facet the 'user' field. Then the end-user would see that John wrote 10 emails about 'restaurant X' and Robert wrote 6. The end-user thinks "This John dude must know a lot about this restaurant" so he drills down into a search by 'restaurant x' with a filter query user:John
You could use use two indices: users->items as described in the question and an index with 'pure items' referencing back to the user.
Then you will need 2 queries (thats the reason I called the question '2d Search in Solr'):
query the user index => list of e.g. 10 users
query the items index for each user of the 1. step => best items
Assume the following example:
userA emails are "restaurant X is bad but restaurant X is cheap", "different topic", "different topicB" and
userB emails are "restaurant X is not nice", "revisited restaurant X and it was ok now", "again in restaurant X and I think it is the best".
Now I query the user index for "restaurant X" and the first user will be userB, which is what I want. If I would query only the item-index I would get the item1 of less relevant userA.
Drawbacks:
bad performance, because you will need one query against the user index and e.g. 10 more to get the most relevant items for each user.
maintaining two indices.
Update to avoid many queries I will try the following: using the user index to get some highlighted snippets and then offering a 'get relevant items'-button for every user which then triggers a query against the item index.
You can use the collapse patch and store each item as separate document linking back to the user.
The problem of that approach is that you won't get the most relevant user. Ie. the most relevant item is not necessarily from the most relevant user (because he can have several slightly less relevant items)
See the "Assume the following example:" part in my second answer.

Questions about DB modelling

How would you model these relationships in a db?
You have a Page entity that can contain PageElements.
A PageElement can for instance be an Article, or a Picture. An Article table obviously has other members / columns than a Picture. An article could have ie. "Title", "Lead", "Body" columns that are all of type nvarchar, while a Picture might have something like "AltText", "Path", "Width", "Height". I like this to be extensible, who knows what PageElements I might need in 3 months? So I guess I'd need a PageElementTypes table.
For the relationships, what about tables like these:
Pages with an Id, and other mumbo jumbo. (Create Date, Visible, what not)
Pages_PageElements with PageId and PageElementId.
PageElements with an Id and a PageElementTypeId and more mumbojumbo (SortOrder, Visibility etc.).
PageElementTypes with an Id and a Name (for instance "Article", "Picture", "AddressBlock")
Now, should I create a PageElementId column in every Articles, Pictures, AddressBlocks table to finish things up? That's where I'm a bit stuck, it's a simple 1:1 relationship so this should work, but somehow I might miss something.
Follow up:
The recommended solutions below with separate attributes would force me to store all attributes as the same type, or not? What If one PageElement has attributes that are nvarchar(255) and some are nvarchar(1000), what if some are integers?
If I got the EAV way I would have to create tons of tables for holding the attribute values for all the different data types out there.
The two common choices are Single Table Inheritance and Multi Table Inheritance. Other approaches include having tables for each concrete class which I've never used, and what I'd call a meta-table implementation, where the attribute definitions are moved into data rather than any sort of schema.
I've had generally good experiences with STI, and provided you don't expect a plethora of classes and attributes it's the simplest solution. Simple is very good in my book.
Unless new page element types need to be created by users at runtime, I'd avoid the meta-tables approach and anything that begins to look like it. In my experience such code quickly becomes a quagmire and rarely delivers much value compared to a more concrete implementation updated at regular intervals by developers.
Just as you have configured Page Elements, you need to configure the Attributes associated with the Page Elements.
So we have two items that are extensible Page Elements & their Attributes.
I sugges the following tables:
Page : Page ID | ...
Page Elements : Page Element ID | Element Type ID | Page ID | ...
Page Element Type : Element Type ID | Page Element Type Label
Page Element Attribute Type : Attribute Type ID | Element Type ID | Attribute Label
Page Element Attributes : Page Element ID | Attribute Type ID | Attribute Value
The Page Element Attribute Type table will contain the list of attributes associated with an element. Example :
Atttibute Type ID 1 | Article | "Title"
Atttibute Type ID 2 | Article | "Lead"
Atttibute Type ID 3 | Picture | "AltText"
The Page Element Attributes table will store the actual value for the attributes assciated with a page element. Example :
Page Element ID 1 | Attribute Type ID 1 | "Everybody Loves Raymond"
Page Element ID 2 | Attribute Type ID 3 | "World Map"
The universal solution would be:
PageElementType: ID, Name, [Mumbo Jumbo]
PageElementTypeParameter: ID, PageElementTypeID, [Mumbo Jumbo]
Page: ID, [Mumbo Jumbo]
PageElement: ID, PageElementTypeID, [Mumbo Jumbo]
PageElementParameters: ID, PageElementID, PageElementTypeParameterID, Value, [Mumbo Jumbo]
In human words: There is a table for page element types, and an associated table, which lists possible parameters for each page element (like SRC and ALT for an image; TEXT for an article, etc).
Then there is a table with all the pages; an associated table which lists elements in each page; and a table which lists parameter values for each element.
I use a different naming convention then you but this is essentially what I would do:
PageElementType(PageElementTypeID, PageElementTypeName)
PageElement(PageElementID, PageElementTypeID)
Article(ArticleID, PageElementID, ...)
Picture(PictureID, PageElementID, ...)
Page(PageID, ...)
PageHasPageElement(PageHasPageElementID, PageID, PageElementID) => {PageID, PageElementID} are unique
This what I do and seems to be fairly well normalized and performs fine.
I guess I'll just go with what I got, EAV is no option for me. What I got now is a somewhat hybrid approach.

Resources