Solr: facets merge based on mapper

Solr: facets merge based on mapper - solr

I was wondering if there's a way in Solr 4 or higher, to merge facets together based on a mapper. I have a lot of products inside my database with terms (inside the taxonomy "color" for example) which actually mean the same. For example the color red is described as:
red
bordeaux
light red
dark red
I would like to merge such terms, cause I don't want to bother the user with hundreds of choices, when I can reduce that number to dozens. I'm trying to figure out what's the best way to do this, create a separate table inside my database (used to map the id's of terms together) or use functionality in Solr to do this on index-time. I read something about Pivot Facets, but I guess that's the way to go if there is already an hierarchy? Because those different terms of red are just flat (are not grouped together, yet). Any advices?
EDIT:
Guess I found a solution: http://www.wunderkraut.com/blog/how-to-combine-two-facet-items-in-facet-api/2015-03-26. Any thoughts about this? Looks good to me though.

Related

Azure Search - Hierarchical facets guidance

I'm developing a project where I want to have hierarchical facets.
I have an index with a complex structure, like:
Index
-field1
-List
And othercomplexfield contains another list with anothercomplexfield inside.
I'd like to be able to give to users the possibility to:
Have the facets of field1.
When one is selected, I'd like to give the user the possibility to select one of the values of a certain field of "othercomplexfield" while filtering by the selected field1.
I can do that.
I'd then like to give the user the possibility to select one of the possible values of "anothercomplexfield" while filtering by field1 AND by the selected othercomplexfield.
The difficulty here is that I don't want every possible facet value, but only the ones CONTAINED by the othercomplexfield that I'm filtering for.
So far I had to do this inside of c# and i did not find a way to write a query that gives me back from azure search the distinct values that I want.
Someone has a similar problem?
Did I explain the problem well enough?
I saw no clear guidance online, everything is easy if you only have level 1 facets but when you get into nested objects it's not that clear anymore.

I'm not sure I fully understand the context of your question. What I can tell you is that filters only apply at the document level and not at the complex collection level. What I mean by that is that if a filter matches an item in a complex collection, the entire document will be returned, not just the item in the complex collection that matched. The same is true for facets--facets will count all documents in the result set that match the filter and can't be scoped down just to parts of documents. With that, it seems like having this logic in your application like you mentioned might be the best approach for your current index schema.
We do have this old blog post that talks about one way to implement hierarchical facets with Azure Cognitive Search which may give you some other ideas on how you could implement the functionality you're looking for: https://learn.microsoft.com/en-us/archive/blogs/onsearch/multi-level-taxonomy-facets-in-azure-search

How to use CREATE STATISTIC object in PostgreSQL?

I am trying to make different STATISTIC objects using different attributes in a database.
Problem - 1
My aim is to find the error in selectivity by choosing different attribute combinations. I wanted to compare this results with some other experiments. Here is what I have done,
I have made each of the attribute combinations (nC1, nC2, ..., nC_len_of_attributes). One attribute combination, two attribute combination etc. For example (name), (name, age), (name, age, zip), (age, zip), etc.
Made STATISTIC objects for each of the combinations using the command CREATE STATISTICS <name> on <one_attrib_combination> from <table_name>
I ran ANALYSE on the table, <table_name>.
Now I want to run a set of queries on each of this STATISTIC objects and get selectivity for each of the STATISTIC objects.
How can I go about this problem? I am using PostgreSQL 10. Any ideas?
Problem - 2
The second problem is, I wanted to know the size of each of these STATISTIC objects? How can I find the size of each of the unique STATISTIC objects that I have created before?
Thanks in advance for answering my queries.

Purpose for STATISTICS is different. You can create extended stats, so that planner can be aware about relations between columns, functions and so on. That way DBA can provide better, dynamic hint for the planner. Docs for CREATE STATISTIC has nice explanation for that.
To see information about that object there is dedicated catalog pg_statistic_ext.
To get something you can use explain analyze, but I would say - this is dead-end, and choose other path... Sorry for bad news.

Creating Index in Cloudant

Scenario.
I have a document in the database which has thousands of item in
'productList' as below.
here
All the object in array 'productList' has the same shape and same fields with different values.
Now I want to search in the following way.
when a user writes 'c' against 'Ingrediants' field, the list will show all 'Ingrediants' start with alphabet 'c'.
when a user write 'A' against 'brandName' field, the list will show
all 'brandName' start with alphabet 'A'.
please give an example using this to search for it, either it is by
creating an index(json,text).
creating a Search index (design document) or
using views etc
Note: I don't want to create an index at run-time(I mean index could be defined by Cloudant dashboard) I just want to query it, by this library in the application.
I have read the documentation's, I got the concepts.
Now, I want to implement it with the best approach.
I will use this approach to handle all such scenarios in future.
Sorry if the question is stupid :)
thanks.

CouchDB isn't designed to do exactly what you're asking. You'd need one index for Ingredient, and another for Brand Name - and it isn't particularly performant to do both at once. The best approach I think would be to check out the Mango query feature http://docs.couchdb.org/en/2.0.0/api/database/find.html, try the queries you're interested in and then add indexes as required (it has the explain plan to help make this more efficient).

Extracting information triples form Tables

I have a very large dataset of HTML tables (extracted originally from Wikipedia). I want to extract meaningful tripleSet from each of these tables (This is not to be conflicted with extracting triples from wikipedia infoboxes which is relatively a lot easier task).
The triples has to be semantically meaningful, to the humans, not like DBpedia where triples are extracted to be URIs and other formats. So I am ok with just extracting the table text values.
Keep in mind the variety of table orientation and shapes.
The main task I see is to extract the main Entity of the table records (The student name in a school record for example), so that it can be used as the triple's "Subject".
Example
for a table like this, we should know the main entity is "Server" and the others are only objects, so relations should be like:
<AOLserver> <Developed by> <NaviSoft>.
<AOLserver> <Open Source> <Yes>.
<AOLserver> <Software license> <Mozilla>.
<AOLserver> <Last stable version> <4.5.1>.
<AOLserver> <Release date> <2009-02-02>.
Also, keep in mind that not always the main Entity lies in the First column of the table, there's even tables that are not by any means talk about the same subject.
This is a table where the main Entity is the last column not the first:
This table should generate relations like:
<Arsène Wenger> <Position> <Manager>.
<Steve Bould> <Position> <Assistant manager>
Questions
My first question is can this be done using rule based methods, to craft some rules around examples and try to generalize so that I can detect the right Entity? can you suggest example rules?
Second question is about evaluation, how can I evaluate such a system? how can I measure my performance, so that I can enhance it?

Fantastic Project!! If you get it to work, def try to get it incorporated into dbpedias crawlers/extractors - http://wiki.dbpedia.org/Documentation.
For reference - http://en.wikipedia.org/wiki/Comparison_of_web_server_software
If you look at the HTML, the column titles are in a thead element, while the rows are all contained in tr elements inside tbody elements, with the title of the entity (/rdfs:label) in a th element - this should go a long way to solving your problem without getting too dirtya nd imprecise.
I suppose that checking the html structure to see how many rows have th elements woudl be worthwhile to evaluate this approach.
In the second example (http://en.wikipedia.org/wiki/Arsenal_F.C.) does the fact that it doesnt have a thead element help ie. - allow us to assume that the page itself ie. arsenal is the subject of the data in the table.
There are also microformats like vcard scatter about wikipedia that might halp elucidate the relationships
I'm not sure how generalisable it is across all of the tables in wikipedia, but should be a good start. I would imagine that its vastly superior to stick to html structure and microformats as much as possible rather than getting into anything too tricky
Also - each link has a dbpedia uri to identify it, which is very useful in these circumstances. eg. http://example.com/resource/AOLserver http://example.com/property/Server http://dbpedia.org/resource/AOLserver. http://example.com/resource/AOLserver http://example.com/property/Developed_by http://dbpedia.org/resource/NaviSoft. http://example.com/property/Developed_by a rdf:Property. http://example.com/property/Developed_by rdfs:label "Developed by"#en
have you seen - http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/ -could be worthwhile for generating mappings

So, Finally I've been able to achieve the goal of my project, it required a lot of work and testing but it was achieved.
The idea rested mainly in a pipeline like the following:
1-a component to extract the tables and import them into an in-memory object
2-a component to exclude bad tables, these are things that are used in table tags but they're not really tables (sometimes the writers of a page want to organize data appearance, so they put them in a table)
3- a component to strip off the styling of the tables and also to resolve column/row spans by repeating the data by the number of the span
4-a Machine learning based classifier to classify the orientation of the table (horizontal/vertical) and the header row/column for that table.
5-a Machine learning based classifier to classify the rows/columns that should be the "subject" of the relationship triple < subject > < predicate > < object >
The first classifier is a support vector machine classifier that takes features like character count, table/row cells count ratio, numbers to text ratio, capitalization..etc.
we achieved about 80~85% on both precision and recall
The second classifier is a Random Forest classifier that takes features that are more related to the relevance of cells inside one row/column. we achieved about 85% also on both precision and recall.
some other refinement components and heuristics was involved in the process to make the output more clear and related to the context of the table
Generally there were no additional data used from Wikipedia to make the tool more general to any html table on the web. but the training data of the classifiers were mainly biased towards Wikipedia content!
I'll be updating the question code with the source code once it's finalized.

Geoclusters in SOLR

We're reimplementing a search that includes locations that need to be clustered on a map. I've been searching without luck for an implementation in SOLR.
The current search with map clustering implemented is at http://www.uship.com/find
Has anyone seen similar or have ideas about how to best do this?
Regards,
Nick

If the requirement is to cluster a fairly small number of points, perhaps less than 1000, then Solr needn't be involved. Grab the points and plot them using something like HeatmapJS.
I presume the requirement is to cluster all results in a search which may potentially be many thousands or even millions of documents. I suggest starting with generating a heatmap of the densities over a grid of the search area. You can do this by indexing each point encoded in geohash form at each length (e.g. D2RY, D2R, D2, D). But then precede the length by how long it is: 4_D2RY, 3_D2R, 2_D2, 1_D. These little strings go into a multi-valued "string" type field in Solr that you will then facet on. When faceting, you'll come up with a suitable grid resolution (e.g. goehash prefix length) and then use that as a prefix query, like facet.prefix=4_ You can index the point using a LatLonType field separately and do a standard bounding box query there. At this point, you're faceted search results will give you the information to fill in a grid of numbers. The beauty of this scheme is that it is fast -- you could generate such heat-maps on the fly. It will use a fair amount of RAM though since this is faceting on a multi-valued field that will have a ton of values. This is something I want to add to the new Lucene spatial module (or perhaps at the Solr layer) in a way that won't need extra memory and to make it easy. It won't make it to Solr 4.0, but maybe 4.1.
At this stage, perhaps a heatmap is fine as-is. But you may want to apply clustering on top of this, as your question states. Someone tipped me off to some interesting geo clustering algorithms that can be applied to heatmaps.

I don't know whether you searched lucidworks, but there are many interesting resources there:
Search with Polygons: Another Approach to Solr Geospatial Search
Go through these:
http://www.lucidimagination.com/search/?q=geospatial#%2Fn
Already implemented in Solr:
http://wiki.apache.org/solr/SpatialSearch/ (what's wrong with this approach?)
http://wiki.apache.org/solr/SpatialSearchDev
https://issues.apache.org/jira/browse/SOLR-3304

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight