Splitting data, indexing and querying

Splitting data, indexing and querying - solr

I have a table in the DB with 2 columns, id and detail.
The id column has unique ids and the detail column has data like below -
A 20% B 30% C 50%
B 50% D 50%
X 10% A 40% Z 50%
I can do nothing about the way it is in the DB.
I want to let my users search using the following queries -
A < 20. - meaning all documents where A is less than 20%.
B > 30, X > 5%. - meaning all documents where B is greater than 30 and X is greater than 5.
I am unable to figure out the combination of tokenizer, filter to get this going.
What I have done is found the total number of unique types (A, B, C, ...), created those many fields in the Solr schema which are typeCode1 for A, typeCode2 for B, etc and the corresponding values which are typeValue1, typeValue2, etc. If A is not available for a document then the typeCode1 is null and so is it's typeValue1 field. I also have a mapping table in the DB where I look for which type is entered by the user and then get the corresponding field that is in Solr and then search.
EDIT - Adding a few more details
The data from the DB is fetched. Let us say it is A 20% D 30% C 50%.
Then I split on the basis of %<space> (String.split("")). So I have 3 rows in my array.
Then I check the mapping of the type in the DB to find out which Solr field name corresponds to which type.
Once I have the field then I submit A to typeCode1 and 20 to typeValue1, D to typeCode4 and 30 to typeValue4 and so on.
Currently the total number of unique types I have is 45, however, it can increase and my current approach is not scalable.

One possible solution is to add a dynamic field for each typeCode, such as A_code with 20 as the value. That will allow you to use the field as you'd use any field in Solr, and query it using intervals, above/under, do faceting on the field etc.
<dynamicField name="*_code" type="int" indexed="true" stored="true" />
The only "real" downside is that your cache size will grow, since you'll get one internal cache per field. This cache will be sized according to the total number of documents in the index. For a small index like the one you describe and with only 45 different field names that shouldn't be an issue.

Related

Solr field for storing range

Is there a solr field type that would work well storing a range of two values?
For example, I'm trying to store a min and max cost for each document i.e. $0 to $100, or $50 to $100
I'd then want to be able to query a single value to see if it falls in the range. i.e. which documents' range allows $25?
I realize a workaround would be to store min and max separately, but wondering if any native fields support this to simplify querying?

There is no field which stores data range as integer and providing results according to that data. You can have a look at Solr field here
As you said you can keep min and max as separate fields and it will not make your query complicated. You only need to have value < field_max && value > field_min. this query in your solr query.

Can someone explain to me the meaning of field SELECTABILITY in relation to Cardinality?

read this http://www.programmerinterview.com/index.php/database-sql/cardinality-versus-selectivity/
but still doesnt really sink in.
so let's say, we have 993 records, and a cardinality of 13, that means there are 13 unique/ possible values out of 993 records. Its selectability is 0.0130 or 1.3% right?
Now, what does 1.3% mean? All I know that lower the worse, and the higher selectability is better meaning more unique values and the sql engine optimizer is happy. BUT, how can i explain 1.3% ?
1.3% of???
when i select a row, variability is only 1.3% of the 13 possible records?
Sorry, it has been like 20+ years since i had my stat classes.

The 1.3% is of all the rows in the table, but you are confusing yourself by treating it as a percentage.
When you query a table, you want to get to the relevant rows as quickly as possible. The database has to choose which index to search first, and you want this index to return as small a set of rows as possible, with the relevant rows inside.
Imagine that you are looking for John Smith the guitar repairer in the Yellow Pages. There are 10,000 names and you have 2 choices:
Browse through the Last Name index, where all last names are grouped by their first character. This gives you a cardinality of 26, selectivity = 0.26%.
Browse through the Guitar Repair category. There are 500 business categories in your city so cardinality = 500, selectivity = 5%.
If you choose the first index, you then have to search through S-group, which contains on average 10,000 / 26 = 384.6 names.
If you choose the second index, you will have to search through the Guitar Repairers, which contains on average 10,000 / 500 = 20 names.
Clearly, the Business Category is a better index than the Last Name because you can narrow down your search range a lot faster. That's all it means by selectivity: you can get to the rows you want as quickly as possible.

How to apply boosting in solr

I am new to solr, please help me in boosting fields.
I have a query like this,
q=name:test* OR description:test*
i want to apply boosting/weight age for name its 500 and for description its 50.
for example:
lets consider "test" term is appearing for 1 time in name field in one record and 20 times in description field its from another record, then boosting calculation should happen like below.
for name: 1 X 500 = 500
for Description: 20 X 50 = 1000.
as result the records with high boosting value should come at top.
so based on above calculation the record which having description field with 20 matches should come on top after that record with 1 match in name field.
If any one have solution for this, please provide
Thanks in advance.

You can boost a field at index time with the boost attribute, or you can apply a boost in the query, such as q=name:test*^50 OR description:test* (and there are some more advanced features here as well).
I bears noting though, Lucene, by default, applies a length normalization that effectively weighs matches on shorter fields more heavily than longer fields. It sounds a bit like that is what you are trying to recreate.
If you need the scoring calculation to be as simple as what you have provided, you would need to write your own Similarity class, I believe.

SOLR faceting slower than manual count?

I'm trying to get SOLR range query working. I have a database with over 12 milion documents, and i am filtering by few parameters for example:
product_category:"category1" AND product_group:"group1" AND product_manu:"manufacturer1"
The query itself returns about 700 documents and executes in two-three seconds on average.
But when i want to add date range facet to that query (i want to see how many products were added each day for past x years) it executes in 50 seconds or more. So it seems that it would be faster to just retrieve all matching documents and perform manual counting in java.
So i guess i must be doing something wrong with faceting?
here is an example faceted query:
start=0&rows=0&facet.query=productDate%3A[0999-12-26T23%3A36%3A00.000Z+TO+2012-05-22T15%3A58%3A05.232Z]&q=source%3A%22source1%22+AND+productCategory%3A%22category1%22+AND+type%3A%22type1%22&facet=true&facet.limit=-1&facet.sort=count&facet.range=productDate&facet.range.start=NOW%2FDAY-5000DAYS&facet.range.end=NOW%2FDAY%2B1DAY&facet.range.gap=%2B1DAY
My only explanation is that SOLR is counting fields on some larger document pool than my 700 documents resulting from "q=" parameter. Or maybe i should filter documents in another way?
I have tried changing filterCache size and it works, but it seems to be a waste of memory for queries like these. After all aggregating over 700 documents should be very fast shouldnt it?

How can accent can be more accurate with bf and query with solr

i work with solr, i can't fix my problem of result's accuracy (q vs bf taking into account accents)
i have a solr index with 2 fields indexed (this is simplified):
town, population
Félines, 100
Ferrand, 10000
when i query: q=Fé&qf=town town_ascii&bf=population^2&defType=dismax
I'd like this order on my results : Félines > Ferrand.
When i query: q=Fe&qf=town town_ascii&bf=population^2&defType=dismax I'd like this order on my results : Ferrand > Félines
The trouble is that Ferrand beats every time Félines because its population is bigger, how can i solve that? I didn't find how to use the score of the query and use it in bf to balance population

You didn't post your schema.xml but I suppose you're using the ASCIIFoldingFilterFactory for the town_ascii field. It means that if you're indexing the word Félines the following are the indexed terms:
town: Félines
town_ascii: Felines
Therefore, you're saying that a match for the town field is more important than a match for town_ascii. You should change the qf parameter to something like qf=town^3 town_ascii to give more weight to the town field. Then you can adjust the weight depending on what is the desired weight for town compared to population.