Indexing EAV model using Solr - solr

The database I have at hand uses the EAV model describing all objects one can find in a house. Good or bad isn't the question, there is no choice but to keep and use this model. 6.000+ items point to 3.000+ attributes and 150.000+ attribute-values.
My task is to get this data into a Solr index for quick searching/sorting/faceting.
In Solr, using DIH, a regular SQL query is used to extract data. Each column name returned from the query is a 'field' (defined or not in a schema), and each row of the query's resultset is a 'document'.
Because the EAV model uses rows for attributes instead of columns, a simple query will not work, I need to flatten each item row. What should my SQL query look like in order to extract all items from the DB ? Is there a special Solr/DIH configuration which I should consider ?
There are some similar questions on SO, but none really helped.
Any pointers are much appreciated!

Related

Apache Solr Querying by search term from multiple tables and in all columns

I am new to Apache Solr and have worked with single table and importing it in Solr to get data using query.
Now I want to do following.
query from multiple tables ..... Like if I find by a word, it should return all occurances in multiple tables.
Search in all fields of table ....like I query by word in all fields in single table too.
Do I need to create single document by importing data from multiple tables using joins in data-config.xml? And then querying over it?
Any leads and guidance is welcome.
TIA.
Do I need to create single document by importing data from multiple tables using joins in data-config.xml? And then querying over it?
Yes. Solr uses a document model (rather than a relational model) and the general approach is to index a single document with the fields that you need for searching.
From the Apache Solr guide:
Solr’s basic unit of information is a document, which is a set of data
that describes something. A recipe document would contain the
ingredients, the instructions, the preparation time, the cooking time,
the tools needed, and so on. A document about a person, for example,
might contain the person’s name, biography, favorite color, and shoe
size. A document about a book could contain the title, author, year of
publication, number of pages, and so on.

Cassandra, how to filter and update a big table dynamically?

I'm trying to find the best data model to adapt a very big mysql table in Cassandra.
This table is structured like this:
CREATE TABLE big_table (
social_id,
remote_id,
timestamp,
visibility,
type,
title,
description,
other_field,
other_field,
...
)
A page (which is not here) can contain many socials, which can contain many remote_ids.
Social_id is the partitioning key, remote_id and timestamp are the clustering key: "Remote_id" gives unicity, "Time" is used to order the results. So far so good.
The problem is that users can also search on their page contents, filtering by one or more socials, one or more types, visibility (could be 0,1,2), a range of dates or even nothing at all.
Plus, based on the filters, users should be able to set visibility.
I tried to handle this case, but I really can find a sustainable solution.
The best I've got is to create another table, which I need to keep up with the original one.
This table will have:
page_id: partition key
timestamp, social_id, type, remote_id: clustering key
Plus, create a Materialized View for each combination of filters, which is madness.
Can I avoid creating the second table? What wuold be the best Cassandra model in this case? Should I consider switching to other technologies?
I start from last questions.
> What would be the best Cassandra model in this case?
As stated in Cassandra: The Definitive Guide, 2nd edition (which I highly recommend to read before choosing or using Cassandra),
In Cassandra you don’t start with the data model; you start with the query model.
You may want to read an available chapter about data design at Safaribooksonline.com. Basically, Cassandra wants you to think about queries only and don't care about normalization.
So the answer on
> Can I avoid creating the second table?
is You shouldn't avoiding it.
> Should I consider switching to other technologies?
That depends on what you need in terms of replication and partitioning. You may end up creating master-master synchronization based on RDBMS or something else. In Cassandra, you'll end up with duplicated data between tables and that's perfectly normal for it. You trade disk space in exchange for reading/writing speed.
> how to filter and update a big table dynamically?
If after all of the above you still want to use normalized data model in Cassandra, I suggest you look on secondary indexes at first and then move on to custom indexes like Lucene index.

SQL with NoSQL ?

I'm designing an application where one table would be really useful for me to be NoSQL,
and others SQL.
So I have a table where I need to store multiple and unknown attributes and then be able to search on them. The rest of db tables are just simple relational.
Example
item
id
name
one item can have attribute : color, shape, other item can have attribute : height, width but not any other ones .
So it smells like NoSQL, but I do a lot more dev with SQL and I always want to choose technology that I know best.
I won't be needing a lot of selects by those attributes at the moment so I will just add
a field "attributes" where I will be keeping all attributes as json_encoded string .
And if i need to select anything by attributes I will write a script for that .
But maybe there's an extra feature of MySQL ( this is what i'm using as RDBMS) that I don't know of ? Or any better ideas ?
I was also thinking of keeping parallel Mongo DB just for 'items' but I generally detest having
same data in 2 places .
Maybe anyone knows a technology that is Relational DB with NoSQL extension like this ?
I ended using RDBMS and table with attributes ( id, name, value ) and it's doing great job

Database mapping to Solr

I'm building a Java app using a relational database and I wish to map it's primary data to a Solr index/es. However, I'm not sure how to map the components of a database. At the momement I've mapped a single row cell to a Solr/Lucene Document.
A doc would be something like this (each line is a field):
schema: "schemaName"
table: "tableName"
column: "columnName"
row: "rowNumber"
data: "data on schemaName.tableName.columnName.row"
This allows me to have a "fixed" Solr schema.xml(as far as I know it has to be defined "before" creating indexes). Also dynamic fields doesn't seem to serve my purpose.
What I've found while searching is that a single row is usually mapped to a Solr Document and each column is mapped as a Field. But, how can I add the column names as fields into schema.xml (when I don't know the columns a table has)? Also, I would need the info to be queried as if it was SQL. I.e, search for all rows of a column in a table, etc, etc.
With my current "solution" I can do that kind of queries but I'm worried with performance as I'm new to Solr and I don't know the implications it may have.
So, what do you say about my "solution"? Is there another way map a database to a Solr index concerning the schema.xml fields should be set before indexing? I've also "heard" that a table is usually mapped to a index: how could I achieve that?
Maybe I'm just being noob but by the research I did I don't see how I can map a database Row to a Solr Document without messing with schema.xml Fields.
I would appreciate any thoughts :) Regards.
You can specify your table columns in the schema before hand or use dynamic fields and then use the solr DIH to import the data into solr from the database. Select your dynamic fields name in the queries for DIH.
Please go through Solr DIH for database integration

Full-text Search on Joined, Hierarchical Records in SQL Server 2008

Probably a noob question, but I'll go for it nevertheless.
For sake of example, I have a Person table, a Tag table and a ContactMethod table. A Person will have multiple Tag records and multiple ContactMethod records associated with them.
I'd like to have a forgiving search which will search among several fields from each table. So I can find a person by their email (via ContactMethod), their name (via Person) or a tag assigned to them.
As a complete noob to FTS, two approaches come to mind:
Build some complex query which addresses each field individually
Build some sort of lookup table which concatenates the fields I want to index and just do a full-text query on that derived table.
(Feel free to edit for clarity; I'm not in it for the rep points.)
If your sql server supports it you can create an indexed view and full text search that; you can use containstable(*,'"chris"') to read all the columns.
If it doesn't support it as the fields are all coming from different tables I think for scalability; if you can easily populate the fields into a single row per record in a separate table I would full text search that rather than the individual records. You will end up with a less complex FTS catalog and your queries will not need to do 4 full text searches at a time. Running lots of separate FTS queries over different tables at the same time is a ticket to query performance issues in my experience. The downside with doing this is you lose the ability to search for Surname on its own; if that is something you need you might need to look at an alternative.
In our app we found that the single table was quicker (we can't rely on customers having enterprise sql at hand); so we populate the data with spaces into an FTS table through an update sp then our main contact lookup runs a search over the list. We have two separate searches to handle finding things with precision (i.e. names or phone numbers) or just for free text. The other nice thing about the table is it is relatively easy and low cost to add further columns to the lookup (we have been asked for social security number for example; to do it we just added the column to the update SP and we were away with little or no impact.
One possibility is to make a view which has these columns: PersonID, ContentType, Content. ContentType would be something like "Email", "PhoneNumber", etc... and Content would hold that. You'd be searching on the Content column, and you'd be able to see what the person's ID is. I'm not 100% sure how full text search works though, so I'm not sure if you could use that on a view.
The FTS can search multiple fields out-of-the-box. The CONTAINS predicate accepts a list of columns to search. Also CONTAINSTABLE.

Resources