Doing complex multi-table search queries using Solr - solr

I just started learning Solr and the reason I learned it is because I want to do advance search queries (something considered simple in SQL) but on large amount of data. From what I read up to now (using solarium) I can index, update, select and delete but only on a single kind of data (relation/table). What I would like to do is to be able to perform operation between table (like SQL would in his own way). Here is an example scenario of what I could be working on.
Here are samples of data based on the relation above.
<root>
<!-- ID for Solr --->
<id>some_id</id>
<table>house</table>
<house_id>1</house_id>
<house_name>Gryffindor</house_name>
</root>
<root>
<!-- ID for Solr --->
<id>some_other_id</id>
<table>student</table>
<student_id>1</student_id>
<firstname>Albus</firstname>
<lastname>Dumbledore</lastname>
<house_id>1</house_id>
</root>
<root>
<!-- ID for Solr --->
<id>some_different_id</id>
<table>battle</table>
<student_id_1>1</student_id_1>
<student_id_2>3</student_id_2>
</root>
An example search query would be "full name of students from different houses who fought each other and the name of their respective house.
In SQL I would do something like:
SELECT * FROM houses housA, students studA, houses housB, students studB, battles
WHERE studA.id_house == housA.house_name AND studB.id_house == housB.house_name AND
((studA.id == battles.id_1 AND studB.id == battles.id_2) OR (studA.id == battles.id_2 AND studB.id == battles.id_1));
And the solution would be every field (all three tables) for Dumbledore vs Snape and Potter vs Who.
Can it be done with Solr?

You have to think about Solr backwards starting from the queries. And then you flatten the information to match your needs. In your case, it seems the entity in Solr would be a fight and then you flatten all the other information (house, name, etc) into that record. That allows you to do queries like "what house had most matches", etc.
Solr also support nested documents, but their use is not quite the same as with database joins and does not seem to match your use case. I am just mentioning it there for you to be aware of it.

Related

Solr spatial search advanced! Solr field value in the query? Solr 4.10

Let's say we have solr document representing building with multiple location fields. Every building document has at least one location, which indicates building's location. While all others location fields are dynamic, and represents facilities around the building.
Let's say that these facilities are type based, for an example; 1 - schools, 2 - parks, 3 - parking lots.
Therefore each building may have variety of these facilities, some of the buildings may be pointing to the same type facility and same location, while others may have pointing same type, but with different location.
In essence we have:
building: {
...
main_location: "lat:long",
facility_1_location: "lat:long",
facility_2_location: "lat:long",
...
}
How to construct query, if we want to find all buildings that have facility of type "schools" or "1" with 5 kilometers radius?
One potential solution is to make sub queries, while each sub-query takes main_location of the building and queries against facility_1_location, however query will grow in size very repeatedly if we have a lot of building to store.
Another solution, would be to use documents itself field as main_location to construct query, but I am not sure if that's possible in Solr. Tried and searched for it, but I couldn't find a solution.
Are there any experts on this? I am using Solr 4.10

how to index xml in solr with same tags but different values

An xml has 2 sets of similar tags with different data.
<address>
<door_num>100</door_num>
<street>hundred street</street>
<city>XYZ</city>
</address>
<address>
<door_num>200</door_num>
<street>two hundred street</street>
<city>ABC</city>
<active>1</active>
</address>
What is the best way to index this? Search by door_num 100 and city XYZ must return the document; whereas search by door_num 100 and city ABC must not return any document. Storing as multivalues does not help here. Also note that, the second set of address with door_num 200 may or may not be present in the xml. Please suggest
Model this data as nested documents, the Address info would be stored in nested docs, and then you can query them so that both door_num and city need to match on the same nested doc.
Regarding how to actually get them into the index, you have several options:
write some java (or groovy or any other jvm lang) code with SolrJ, build your docs on the client side, and index them.
if you don't like java, you can still write any other lang code on the client side, and build your docs as xml/json that Solr can ingest, index them.
if you don't want to write any code at all, try with DIH and XPathEntityProcessor, you might achieve all you need.

How do I create a Solr query that returns results even if one field in my query has no matches?

Suppose I want to create a recommendation system to suggest people you should connect with based off of certain attributes that I know about you and attributes I have about other people that are stored in a Solr index. Is it possible to query the index with a list of attributes (along with boosts for each attribute) and have Solr return scored results even if some of my fields return no matches? The way that I understand that Solr works is that if one of your fields doesn't contain a match in any documents found in your index, you get zero results for the entire query (even if other fields in the query matched) - is that right? What I would hope is that I could query the index and get a list of results back in order of a score given based on how many (and which) fields matched to something, even if some fields have no matches, for example:
Say that there are 2 people documents stored in the index as follows (figuratively):
Person 1:
Industry: Manufacturing
City: Oakland
Person 2:
Industry: Manufacturing
City: San Jose
And say that I perform a pseudo-Solr query that basically says "Search for everyone whose industry is equal to manufacturing and whose city is equal to Oakland". What I would like is to receive both results back in the result set, even though one of the "Persons" does not reside in Oakland. I just want that person to come back as a result with a lower score than Person1. Is this possible? What might a solr query look like to handle this? Assume that I have many more than 2 attributes for each person (so saying that I can use "And" and "Or" in my solr query isn't really feasible.. or is it?) Thanks in advance for your helpful input! (PS I'm using Solr 3.6)
You mention using the AND operator, which is likely your problem.
The default behavior of Lucene, and Solr, query syntax is exactly what you are asking for. A query like:
industry:manufacturing city:oakland
Will match either, with scoring preference on those that match both. See the lucene query syntax documentation
You can use the bq parameter (boost query) does not affect matching, but affects the scores only.
http://localhost:8983/solr/persons/select?q=industry:manufacturing&bq=City:Oakland^2
play with the boosting factor at the end to get the correct balance between matching score, and boosting score.

SOLR: Is it it possible to index multiple timestamp:value pairs per document?

Is it possible in solr to index key-value pairs for a single document, like:
Document ID: 100
2011-05-01,20
2011-08-23,200
2011-08-30,1000
Document ID: 200
2011-04-23,10
2011-04-24,100
and then querying for documents with a specific value aggregation in a specific time range, i.e. "give me documents with sum(value) > 0 between 2011-08-01 and 2011-09-01" would return the document with id 100 in the example data above.
Here is a post from the Solr User Mailing List where a couple of approaches for dealing with fields as key/value pairs are discussed.
1) encode the "id" and the "label" in the field value; facet on it;
require clients to know how to decode. This works really well for simple
things where the the id=>label mappings don't ever change, and are
easy to encode (ie "01234:Chris Hostetter"). This is a horrible approach
when id=>label mappings do change with any frequency.
2) have a seperate type of "metadata" document, one per "thing" that you
are faceting on containing fields for id and the label (and probably a
doc_type field so you can tell it apart from your main docs) then once
you've done your main query and gotten the results back facetied on id,
you can query for those ids to get the corrisponding labels. this works
realy well if the labels ever change (just reindex the corrisponding
metadata document) and has the added bonus that you can store additional
metadata in each of those docs, and in many use cases for presenting an
initial "browse" interface, you can sometimes get away with a cheap
search for all metadata docs (or all metadata docs meeting a certain
criteria) instead of an expensive facet query across all of your main
documents.

How can I find a city and country based on a user search?

I am trying to search a SQL Server 2008 table (containing about 7 million records) for cites and countries based on a user input type text. The search string that I get from the user can be anything like:
"Hotels in San Francisco, US" or "New York, NY" or "Paris sddgdfgxx" or "Toronto Canada" terms are not allways separated by comma and not in a specific order and there might be unusefull data.
This is what I tried:
Method 1: FTS with contains:
ex: select * from cityNames where contains(cityname,'word1 and word2') -- with AND
select * from cityNames where contains(cityname,'word1 or word2') -- with OR
This didn't work very well because a term like 'sddgdfgxx' would return nothing if used with 'AND'. Using OR will work for one word cities like 'Paris' but not for 'San Diego' or 'San Francisco'
Method 2: this is actually a reverse search, the logic of it is to search if the user imput string contains any of the cities or countries from my table. This way I'll know for sure that 'Aix en Provence' or 'New York' was searched for.
ex: select * from cityCountryNames where 'Ontario, Canada, Toronto' like cityCountryNames
notes: I wasn't able to get results for two words cities and the query was slow.
Any help is appreciated.
I would strongly recommend using a 3rd-party API like the Google Geocoding API to take such input and parse it into a location with discrete parts (street address, city, state, country, etc.) Then you could use those discrete parts to search your database if necessary.
Map services like Google and Bing have solved this problem way better than you or I ever would, so why not leverage all the work they've done?
SQL isn't designed for the kinds of queries you are performing, certainly not scale.
My recommendation would be as follows:
Index all your places (cities + countries) into a Solr Index. Solr is a FOSS search server built using Lucene and can easily query the 7MM records index in milliseconds or less.
Query solr with the user typed string and voila the first match is the best match.
So even if the user typed "Paris sddgdfgxx", Paris should be your first hit. If you want to get really sophisticated use an n-gram approach (known as Lucene Shingles)
Since Solr offers a RESTful (HTTP) API should easily integrate into whatever platform you are on.

Resources