indexing huge table record in apache solr cloud - solr

I have a Cassandra table with 9 million records and my data size is 500 MB . I have a Solr cloud with 3 nodes(3 shards and 2 replicas)with three external Zookeeper ensemble. My Cassandra is a 1 node cluster. I am trying to index this table using Apache Solr but my query is getting timeout as soon as i am starting full import .
I am able to cqlsh and fetch records but i am failing in indexing it .
Here is my attached solr.log...
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: SELECT * from counter.series Processing Document # 1
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:318)
at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:279)
at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:54)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
... 5 more
Caused by: java.sql.SQLTransientConnectionException: TimedOutException()
at org.apache.cassandra.cql.jdbc.CassandraStatement.doExecute(CassandraStatement.java:189)
at org.apache.cassandra.cql.jdbc.CassandraStatement.execute(CassandraStatement.java:205)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.executeStatement(JdbcDataSource.java:338)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:313)
... 12 more
Caused by: TimedOutException()
at org.apache.cassandra.thrift.Cassandra$execute_cql3_query_result.read(Cassandra.java:37865)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_cql3_query(Cassandra.java:1562)
at org.apache.cassandra.thrift.Cassandra$Client.execute_cql3_query(Cassandra.java:1547)
at org.apache.cassandra.cql.jdbc.CassandraConnection.execute(CassandraConnection.java:468)
at org.apache.cassandra.cql.jdbc.CassandraConnection.execute(CassandraConnection.java:494)
at org.apache.cassandra.cql.jdbc.CassandraStatement.doExecute(CassandraStatement.java:164)
... 15 more
I want some help in indexing the table either batch wise or by using multiple threads . Any help or suggestion is welcomed..
db-data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource" driver="org.apache.cassandra.cql.jdbc.CassandraDriver" url="jdbc:cassandra://192.168.0.7:9160/counter" user="cassandra" password="cassandra" autoCommit="true" />
<document>
<entity name="counter" query="SELECT * from counter.series;" autoCommit="true">
<field column="serial" name="serial" />
<field column="random" name="random" />
<field column="remarks" name="remarks" />
<field column="timestamp" name="timestamp" />
</entity>
</document>
</dataConfig>
solrconfig.xml
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
schema.xml
<field name="remarks" type="string" indexed="false" stored="false" required="false" />
<field name="serial" type="string" indexed="true" stored="true" required="true" />
<field name="random" type="string" indexed="false" stored="true" required="true" />
<field name="timestamp" type="string" indexed="false" stored="false" required="false" />

The problem most likely is the size of the payload of data sending to Solr. By default when no batchSize specified in JdbcDataSource it's got defaulted to 500. It looks like in your case it's too much. You should use smaller numbers or increase timeout settings on Solr side

Related

Indexing joined records in Solr

I am new to Solr and stuck at something basic (I think), which is probably a lack of understanding/comprehension on my behalf. I've read the documentation on DIH and spent a lot of time searching this issue, without finding my solution.
My use case is a messaging/email system, where users can message each other and start a thread, to which they can reply (so it's more like email than direct messages on a user base).
The question is simple; I have one table, threads, that is the base for this and contains searchable data like user info and subject. Then joined from that is the emails table, with the html column searchable.
When I run below collection in Solr and do a search, it will only pick up a single email for a thread and search that, as opposed to what I'm hoping for; get all emails belonging to that thread. So say I have 10 threads, but 100 messages, it says Fetched: 100, but Processed: 10.
How do I get Solr to index all of this content properly and allow for a search on it? In this particular use case, I have also created a reversed example, getting messages first, then the threads it belongs to and then de-dupe the results (which works to some extent), but the next step is that there is also a left join for email attachments. So looking for a solution with this setup.
Using Solr 6.6
<dataConfig>
<dataSource name="ds-db" type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="${dataimporter.request.url}"
user="${dataimporter.request.user}"
password="${dataimporter.request.password}"/>
<document name="threads">
<entity name="thread" dataSource="ds-db"
query="
SELECT threads.id
, threads.user_id
, threads.subject
, users.first_name
, users.last_name
, users.email
FROM threads
LEFT JOIN users ON users.user_id=threads.user_id
">
<field column="id" name="thread_id"/>
<field column="user_id" name="user_id"/>
<field column="subject" name="subject"/>
<field column="first_name" name="first_name"/>
<field column="last_name" name="last_name"/>
<field column="email" name="email"/>
<entity name="message" dataSource="ds-db" transformer="HTMLStripTransformer"
query="
SELECT id
, html
FROM emails
WHERE thread_id = ${thread.id}
">
<field column="id" name="id"/>
<field column="html" name="html" stripHTML="true"/>
</entity>
</entity>
</document>
</dataConfig>
managed-schema
<schema name="example-data-driven-schema" version="1.6">
...
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="thread_id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="first_name" type="string_lowercase" indexed="true" stored="true"/>
<field name="last_name" type="string_lowercase" indexed="true" stored="true"/>
<field name="email" type="string_lowercase" indexed="true" stored="true"/>
<field name="subject" type="string_lowercase" indexed="true" stored="true"/>
<field name="html" type="string_lowercase" indexed="true" stored="true"/>
...
<copyField source="first_name" dest="_text_"/>
<copyField source="last_name" dest="_text_"/>
<copyField source="email" dest="_text_"/>
<copyField source="subject" dest="_text_"/>
<copyField source="html" dest="_text_"/>
...
</schema>
If you want all the emails in a single field, that field has to be set as multiValued="true" - otherwise you'll only get one of the dependent entities indexed.

Query multiple collections with different fields in solr

Given the following (single core) query's:
http://localhost/solr/a/select?indent=true&q=*:*&rows=100&start=0&wt=json
http://localhost/solr/b/select?indent=true&q=*:*&rows=100&start=0&wt=json
The first query returns "numFound":40000"
The second query returns "numFound":10000"
I tried putting these together by:
http://localhost/solr/a/select?indent=true&shards=localhost/solr/a,localhost/solr/b&q=*:*&rows=100&start=0&wt=json
Now I get "numFound":50000".
The only problem is "a" has more columns than "b". So the multiple collections request only returns the values of a.
Is it possible to query multiple collections with different fields? Or do they have to be the same? And how should I change my third url to get this result?
What you need is - what I call - a unification core. That schema itself will have no content, it is only used as a sort of wrapper to unify those fields you want to display from both cores. In there you will need
a schema.xml that wraps up all the fields that you want to have in your unified result
a query handler that combines the two different cores for you
An important restriction beforehand taken from the Solr Wiki page about DistributedSearch
Documents must have a unique key and the unique key must be stored (stored="true" in schema.xml) The unique key field must be unique across all shards. If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic.
As example, I have shard-1 with the fields id, title, description and shard-2 with the fields id, title, abstractText. So I have these schemas
schema of shard-1
<schema name="shard-1" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="description"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
schema of shard-2
<schema name="shard-2" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="abstractText"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
To unify these schemas I create a third schema that I call shard-unification, which contains all four fields.
<schema name="shard-unification" version="1.5">
<fields>
<field name="id"
type="int" indexed="true" stored="true" multiValued="false" />
<field name="title"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="abstractText"
type="text" indexed="true" stored="true" multiValued="false" />
<field name="description"
type="text" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- type definition left out, have a look in github -->
</schema>
Now I need to make use of this combined schema, so I create a query handler in the solrconfig.xml of the solr-unification core
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">*:*</str>
<str name="qf">id title description abstractText</str>
<str name="fl">*,score</str>
<str name="mm">100%</str>
</lst>
</requestHandler>
<queryParser name="edismax" class="org.apache.solr.search.ExtendedDismaxQParserPlugin" />
That's it. Now some index-data is required in shard-1 and shard-2. To query for a unified result, just query shard-unification with appropriate shards param.
http://localhost/solr/shard-unification/select?q=*:*&rows=100&start=0&wt=json&shards=localhost/solr/shard-1,localhost/solr/shard-2
This will return you a result like
{
"responseHeader":{
"status":0,
"QTime":10},
"response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
{
"id":1,
"title":"title 1",
"description":"description 1",
"score":1.0},
{
"id":2,
"title":"title 2",
"abstractText":"abstract 2",
"score":1.0}]
}}
Fetch the origin shard of a document
If you want to fetch the originating shard into each document, you just need to specify [shard] within fl. Either as parameter with the query or within the requesthandler's defaults, see below. The brackets are mandatory, they will also be in the resulting response.
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">*:*</str>
<str name="qf">id title description abstractText</str>
<str name="fl">*,score,[shard]</str>
<str name="mm">100%</str>
</lst>
</requestHandler>
<queryParser name="edismax" class="org.apache.solr.search.ExtendedDismaxQParserPlugin" />
Working Sample
If you want to see a running example, checkout my solrsample project on github and execute the ShardUnificationTest. I have also included the shard-fetching by now.
Shards should be used in Solr
When an index becomes too large to fit on a single system, or when a single query takes too long to execute
so the number and names of the columns should always be the same. This is specified in this document (where the previous quote also come from):
http://wiki.apache.org/solr/DistributedSearch
If you leave your query as it is and make the two shards with the same fields this shoudl just work as expected.
If you want more info about how the shards work in SolrCould have a look at this docuemtn also:
http://wiki.apache.org/solr/SolrCloud

Tika fetches the binary content stored in database but does not indexes it

I am trying to parse the binary content data stored in database in table document_attachment in column file_data and trying to index the same so that it's content becomes available for searching using Solr.
When I run the indexer it fetches the rows which is twice in number to the rows returned by running the query in entity named "dcs" and throws no errors or exceptions. it however does not indexes the binary content(the field that I associate with tika despite of fetching it from the table).
I am using apache-solr-3.6.1 and Tika 1.0
My configuration files look something like :
data-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/espritkm_1?zeroDateTimeBehavior=convertToNull"
user="root"
password=""
autoCommit="true" batchSize="-1"
convertType="false"
name="test"
/>
<dataSource name="fieldReader" type="FieldStreamDataSource" />
<document name="items">
<entity name="dcs"
query="SELECT 222000000000000000+d.id AS common_id_attr,d.id AS id,UNIX_TIMESTAMP(d.created_at) AS date_added,d.file_name as common1, d.description as common2, d.file_mime_type as common3, 72 as common4,(Select group_concat(trim(tags) ORDER BY trim(tags) SEPARATOR ' | ') from tags t where t.type_id = 72 AND t.feature_id = d.id group by t.feature_id) as common5,d.created_by as common6, df.name as common7,CONCAT(d.file_name,'.',d.file_mime_type) as common8,'' as common9,(Select da.file_data from document_attachment da where da.document_id = d.id) as text FROM document d LEFT JOIN document_folder df ON df.id = d.document_folder_id WHERE d.is_deleted = 0 and d.parent_id = 0 " dataSource="test" transformer="TemplateTransformer">
<field column="common_id_attr" name="common_id_attr" />
<field column="id" name="id" />
<entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="dcs.text" format="text" pk="dcs.id" >
<field column="text" name="text" />
</entity>
</entity>
schema.xml
<schema>
<fields>
<field name="common_id_attr" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="true" multiValued="true"/>
</fields>
<uniqueKey>common_id_attr</uniqueKey>
<solrQueryParser defaultOperator="OR"/>
<defaultSearchField>text</defaultSearchField>
</schema>
Though the rows it fetches is double the number of documents counting the rows of tika as separate (I don't understand why?). It does not indexes binary content.
I am stuck in this problem from long. Can someone please help
I was able to index the documents using Apache Solr version 3.6.2. I have described the steps here:
http://tuxdna.wordpress.com/2013/02/04/indexing-the-documents-stored-in-a-database-using-apache-solr-and-apache-tika/
I think it should be doable in 3.6.1 as well. I was only impatient to search for a tarball of version 3.6.1 when only 3.6.2 was avaiable from the official site.
I hope that helps.

solr indexes documents but does not search in them

I am a novice with Solr and i was trying the example that comes in the example folder of Solr(3.6) package(apache-solr-3.6.0.tgz). I started the server and posted the sample xml files in example/exampledocs and then i could search for stuff and Solr would return matches and it was all good. But then i tried posting another xml file with more than 10,000 documents. I modified the example/solr/conf/schema.xml file to add the fields of my xml file and then restarted the server and posted my xml file. I checked the statistics in Solr admin panel(http://localhost:8983/solr/admin/stats.jsp) and it shows numDocs : 10020. Now this means that the documents were successfully posted. But when i search for anything that was present in my posted documents(from the 10,000 document xml file),it returns 0 results. But Solr is still able to return results from searches that match content in the documents that come by default in the example/exampledocs folder. I am clueless about what has happened here. The value of numDoc clearly suggests that the documents i posted in the xml file were indexed.
Anything else i can inspect to see what's wrong with this?
The schema which comes in the example with the Solr package is like this
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/><field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
<field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="features" type="text_en_splitting" indexed="true" stored="true" multiValued="true"/>
<field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
<field name="weight" type="float" indexed="true" stored="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="popularity" type="int" indexed="true" stored="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="inStock" type="boolean" indexed="true" stored="true"/>
and more....
The schema of the xml file which i posted had some fields in common with the above schema like title,description,price,etc so i entered the rest of the fields in schema.xml like this
<field name="cid" type="int" indexed="false" stored="false"/>
<field name="discount" type="float" indexed="true" stored="true"/>
<field name="link" type="string" indexed="true" stored="true"/>
<field name="status" type="string" indexed="true" stored="true"/>
<field name="pubDate" type="string" indexed="true" stored="true"/>
<field name="image" type="string" indexed="false" stored="false"/>
If you are using the default settings from the Solr example site, then by virtue of the df setting in the solrconfig.xml file for the /select request handler, it is setting the default search field to the text field.
<requestHandler name="/select" class="solr.SearchHandler">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">text</str>
</lst>
....
</requestHandler>
If you look in the schema.xml file just below the field definitions you will see the multiple copyField settings that are moving the values from certain fields into the text field and therefore making them searchable via the default field setting. In your example of searching for Sony in the title field, if you look at the copyField statements, you will see that the title field is not being copied to the text default search field. Therefore, the documents with the Sony title value are not being returned in your query.
I would suggest the following:
Try a query by specifying the following: title:Sony that should return what you are expecting.
If you want the title field to be included in the default query field, then add the following copyField statement to the schema.xml file and reload your 10000 document file.
<copyField source="title" dest="text">
I hope this helps.

solr dataimport error: Indexing failed. Rolled back all changes

When I run the "Full import with cleaning" command, error is "Indexing failed. Rolled back all changes"
My dataimport config file:
<dataConfig>
<dataSource type="JdbcDataSource" name="ds-1" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://my.ip/my_db" user="my_db_user" password="my_password" readOnly="True"/>
<document>
<entity name="videos" pk="ID" transformer="TemplateTransformer" dataSource="ds-1"
query="SELECT * FROM videos LIMIT 100">
<field column="id" name="unid" indexed="true" stored="true" />
<field column="title" name="baslik" indexed="true" stored="true" />
<field column="video_img" name="img" indexed="true" stored="true" />
</entity>
</document>
</dataConfig>
I kept receiving the same error message at some point in time.For me there were the following reasons:
bad connection string.
Bad driver (com.mysql.jdbc.Driver)
bad query
bad mapping of columns to solrfields ( I think it might be your problem too)
Make sure the name of the columns in the database is the same (case sensitive) as the name of the columns in SOLR. If not rename the colmuns name in the query:
select id as uniqueid, title as Tittle
or using the field element in the entity you defined like this:
<field column="ID" name="id" />
You are using the field element wrong. See here how you can use this element: http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml
If you can share other relevant data and logs we can give you more specific information.
Good luck.

Resources