identify product after web crawling, price comparison - solr

I am currently working on developing a price comparison site for which I crawl some e-commerce websites and extract some data from their HTML pages like price, title, metadata etc. I am at a point now that I need two identify if two products crawled from different websites are actually the same and assign a common label for both of them.
For example, lets say site 1 has as a title for a product the following string:
"Smartphone Samsung Galaxy S6 4G 32GB"
and site 2 has as a title for the same product this string:
"Samsung Galaxy S6 White"
How can I identify if these two products are actually the same product, which I want to label in my site as "Samsung Galaxy S6"?
I have thought of using some machine learning techniques like classification or clustering. However, classification will probably require a big set of already well formatted products' labels (plus frequently updated) to act as the possible classes e.g. class "Samsung Galaxy S6", is there such a thing? Also with such a huge number of classes it may not be feasible.
I am using Apache Nutch for crawling and Solr for indexing and search. If there is any specific library or tool for those it will be very helpful, but my question is not specifically for those and I will be very happy to read any suggestion.
Thanks

I have done something similar for my project where we tag people names with their IDs, so basically the same person can have their name listed as the full name or initials, or only the first name etc. and we tag it to the same ID.
So for your case this will basically entail building an inverted index for your products and then scanning the title field for the product names and tagging them to a particular product ID. This way all Samsung Galaxy S6s get mapped to the same product.
This does not require any learning to be performed, you just need to have database to pick up all unique products from and keep updating your index as your product database changes.
All of this can be done at index time by writing an update processor for solr.
The implementation is a bit complex to put it all here so I've just outlined the basic idea that could help you out.

Related

Best way to create product with “specifications” field?

I am using Ubercart for products on a Drupal site. I want to extend this product to show a wide range if different information on products. One of the bits of info I want to show, is "specifications".
Take this product as an example. It has various categories of specs, e.g:
Attachment and Capacities
General Specifications
Function and Size
But each of those categories, has actual values underneath them. E.g. Under "attachment and capacities" you have:
Citrus press:Yes
Dough tool:Yes
Mini bowl :Yes
Etc
When I create a new product, I want to be able to add infinite amounts of specifications, that fall under a group.
The best way I have found to do this is using Inline Entities. But this is very slow. Is there no better "field type" I could use to demonstrate this relationship? I.e. one where I can infinitely add categories of values to a specific node?
You can use Product Specification module to achieve the same.
This module allows you to attach product specification data to any
entity in Drupal 7. Mostly we can use this module to store product
specifications on product display page.

Multiple datasources for one model in CakePHP

I have a very interesting problem in relation to CakePHP. I have a client that is currently selling products on eBay, and wants to start selling products on their own website as well. There would, then, be two separate sales avenues: (1) eBay, and (2) website.
However, they do want to have a seamless website experience for their customers. Basically, they want their current eBay sites to be categories within their current website, and their current eBay auction items to be searchable on their website.
A simple CakePHP website would have two tables: products and categories, with the simple table relation of "products belongsTo categories" and "categories hasMany products". How would I then add in the eBay categories and products? Basically, I want the http://site/products/index to return a list of ALL products, both in the products table and on eBay. I want http://site/categories/index to return a list of all defined categories in the categories table plus the categories items are listed in on eBay.
eBay has a very good pretty much real-time request query API, so I've been thinking about an option to do this, but am wondering if there is a better way... I don't think this option would work very well with PaginatorComponent...
Method:
(1) In beforeFind, capture the request parameters and save to a persistent variable
(2) In afterFind, make a request to the eBay API based on the request parameters, then manually add the results to the $results array.
Again, I think this would work for basic find operations, but I'm not sure this would work with pagination because I'm not sure how to deal with, say, a 20 item page limit (i.e. How do I deal with a page 2 when only 18 items from the database were on page 1, and now on page 2 I need to start at 19 instead of 21 from the database?)
Is there a CakePHP syntax that I'm overlooking here, or do I just start working on coding for all of these eventualities?
I'm coding on a CakePHP 2.6.0 platform.
Thanks so much for your help!
One possible approach would be to read all the products from each datasource, sort them in memory and paginate from there. The number of products would of course play a major role here.
A second one, if the number of products does not change too often, would involve a background process to retrieve their ID, and the most essential data, from eBay and write them in the common Product table, flagging the rows created.
Then Paginator can then show your results, but other fast-changing data can be retrieved on-the-fly from eBay for the flagged products while composing the page.

Detect the category of a product from product attributes using AI

I have a list of products (700,000+) in an excel file and I want to import this data to my database.
On my system, each product should be assigned a category and I want to develop an importer desktop application that loops on all the products and can detect the category of each product based on the values in the excel sheet.
The excel sheet product details are:Title, Description, Brand, Manufacturer, Features
I am looking for an AI solution that can learn progressively by experience, where I can teach him the keywords for each category to use in the matching decision.
It is difficult to say if Neural Networks would be the ideal approach for the data given, or the number of categories that you are trying to break it down to.
I suspect that you would need to do a lot of pre-processing to get the data into a format that would be trainable by a neural network.
If you are looking for an AI-Type approach that may categorise the information, perhaps some kind of rule-based approach (like Decision Trees) may assist such that it builds the rules based on trainable data:
If Features Include X Then
If Title Contains Y Then
Category is A
Else
Category is B
End If
Else
Category is C
End If

Solr customizing with Websphere Commerce

I have a strange requirement in Solr.
The business model is like for each store in state (say victoria), we have different sales catalog (like Richmond, Brunswick etc) which in turn act as fulfillment centers on their own.
so my url of storeId- vic and catalogId-Richmond will retrieve me catalogues with richmond's store.
Now the requirement is I need to filter out the products based on the inventory for each of these sales catalogues.
I constructed a TI table which has the following structure
catentry_id -------- QUANTITY_RICFUL-------------QUANTITY_BrunFUL
1234-------------------0------------------------------------20
I had incorporate the changes in solr query to add these columns in the final result too.
But I do not know how to filter out the products in the front end during catalogue navigation or during search.
Any help would be much appreciated!!!
So basically you want to tie the returned catalog entries in a list with inventory? For instance, when they click on a category you do not want to display products with no inventory?
This would be a customization you can either do at the Solr Level or at the JSP level. You should probably track inventory in commerce (import it) into a field Solr can key off of and then only return items with the flag set to greater than zero. I am not sure if you need actual inventory or just a boolean. Are you using a single fulfillment center or multiple ones? Multiple gets a bit trickier and it would require them to log in most likely but then fulfillment would be addressed by the ship to address.
If the store is set up with ATP inventory then you should just get this for free, as products not in stock will simply not be displayed. Check out this page in the infocenter - http://publib.boulder.ibm.com/infocenter/wchelp/v6r0m0/index.jsp?topic=%2Fcom.ibm.commerce.user.doc%2Fconcepts%2Fcosatpatpandnonatp.htm
I am not sure what you are trying to ask here but it seems you are trying to display a Quantity dropdown or display an Quantity field under each product on a search page which to me makes no sense from a UI perspective. Also keep in mind if you have integrated with a 3rd party inventory model that runs every few mins/hours etc. How often do you plan to run indexing etc?
I would rather leave such complexity to a Prodcut Detail page. If you do require to show an Quantity field on the search page I would rather prefer displaying a QuickView popup/modal that displays the color/size attributes with the quantity dropdown etc and enable a user to add an item to his/her shopping cart.

Data mapping across websites?

I am starting work on a project which has multiple websites for a client. For 'analytic' purposes need to measure metrics across websites over a period of time. This means i need to centralize the data model and all the possible questions/answers/lookup values so it can be used across websites. The question i have is:
Example: Age range of a user visiting website 1 is say: 30-39 years old. (We ask for age range when they enter). so in the data model I have a lookup table for answers which has all possible answers used across all websites. So (30-39) has PK ID of say 102. Now in website 2, same thing, so again (30-39) has PK of 102. This way i can measure across websites the same age range. But the problem is where or how to store the user's answer and map that to this ID?
If i have a table called say UserAnsers, it has an AgeRange colunm. Do i make this a FK to the Answer table at PK 102 to store (30-39) for the user? if yes, then what value gets written in the Useranswer table, would it be 102?
Secondly I need to measure textfields also. Like how many are complete across websites. So say "email address" field. I give this textfield a field Id of 10. Again when i write the consumer' email of say xyz#abc.com in the 'email' colunm of the the answers table how will i link this to the field ID 10?
I'd probably make an API for each website that dumps the data I want in a consistent format, which I then could load into a separate database, where I would then do the metrics on.
This gets unfeasible if the data is so large it takes hours every night to migrate the data.
But otherwise it's a solution that has the benefit that you don't have to touch the current schemas.
If the sites doesn't exist already, then I'd just force a consistent datamodel on all sites, which is trivial. It doens't matter where the mapping between "30-39" and 102 is mapped, the only thing that matters is that it's the same everywhere, which you then set up when you create the databases.
If you want the values and schemas to change a lot, then using just one database for all sites would probably be better, but if you can't do this, then make an export for each site.

Resources