Solr Cloud: How to disable document (pdf, office) metadata as fields - solr

I am new to Solr and using Solr 7.3.1 in solr cloud mode
and trying to index pdf, office documents in solr, using contentextraction in solr.
I created a collection with
bin\solr create -c tsindex -s 2 -rf 2
in SolrJ my code looks like
public static void main(String[] args) {
System.out.println("Solr Indexer");
final String solrUrl = "http://localhost:8983/solr/tsindex/";
HttpSolrClient solr = new HttpSolrClient.Builder(solrUrl).build();
String filename="C:\\iSampleDocs\\doc-file.doc";
ContentStreamUpdateRequest solrRequest = new ContentStreamUpdateRequest("/update/extract");
try {
solrRequest.addFile(new File(filename), "application/msword");
solrRequest.setParam("litral.ts_ref", "ts-456123");
//solrRequest.setParam("defaultField", "text");
solrRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
NamedList<Object> result= solr.request(solrRequest);
System.out.println(result);
} catch (IOException e) {
e.printStackTrace();
}catch ( SolrServerException e) {
e.printStackTrace();
}
}
I am getting multiple issues
Although I have created field ts_ref as text_general in Solr Admin UI, this field does not get set at all.
My goal is to index the complete document including its metadata in one field and then set couple of more fileds refrencing document in another system like e.g. ts_ref field. But what actually happens is the solr extracts the metadata of files and create seperate fileds for each metadata value.
I have tried disabling data driven schema functionality by bin\solr config -c tsindex -zkHost localhost:9983 -property update.autoCreateFields -value false
When I uncomment line solrRequest.setParam("defaultField", "text"); from beginning, there is not separate fields for all metadata extracted, but as soon as I comment this line and upload the files, the meta data are again in separate fields afterwards (even if I uncomment its again).

"litral.ts_ref" there is a typo here, missing an e
you can achieve ignoring all metadata fields by using uprefix field, and a dynamic field that goes with it. See the doc that shows exactly that case.

Related

SOLRJ giving me strange error when trying to add a pdf to a new core. "You must type correct path"

So starting to update ancient solr app to 9.1 and also the SolrJ indexer. When I try to add a document, I am getting
Exception in thread "main" org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://my.host:8983/solr/qmap:
Searching for Solr
You must type the correct path
Solr will respond
I can see the qmap core in the solr admin and solr is running.
Code is:
public class DocumentIndexer {
private final String fileToIndex;
private final ConcurrentUpdateHttp2SolrClient solrClient;
private final Http2SolrClient http2Client;
public DocumentIndexer(String solrUrl, String fileToIndex) {
this.fileToIndex =fileToIndex;
http2Client = new Http2SolrClient.Builder().build();
solrClient = new ConcurrentUpdateHttp2SolrClient.Builder(solrUrl, http2Client).build();
}
public void indexDocuments() throws IOException, SolrServerException{
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
req.addFile(new File(fileToIndex),"application/xml");
req.setParam("id", fileToIndex);
req.process(solrClient);
solrClient.commit(true, true);
}
}
Simple enough - update/extract was not defined in the solrconfig. Recreating the core using the sample_techproducts_examples as template supplies this or alternatively setting up the solrconfig with the update/extract path defined.
Also, req.setParam("id", fileToIndex) needs to be changed to req.setParam("literal.id", fileToIndex)

trouble in solr connect java

I try to use solr 6.5.0 to connect java . I have added following .jar files to the library:
commons-io-2.5
httpclient-4.4.1
httpcore-4.4.1
httpmine-4.4.1
jcl-over-slf4j-1.7.7
noggit-0.6
slf4j-api-1.7.7
stax2-api-3.1.4
woodstox-core-asl-4.4.1
zookeeper-3.4.6
solr-solrj-6.5.0
but when i try use following code to connect the solr:
import org.apache.http.impl.bootstrap.HttpServer;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocumentList;
public class SolrQuery {
public static void main(String[] args) throws SolrServerException {
HttpSolrServer solr = new HttpServer("http://localhost:8983/solr/collection1");
SolrQuery query = new SolrQuery();
query.setQuery("*");
QueryResponse response = solr.query(query);
SolrDocumentList results = response.getResults();
for (int i = 0; i < results.size(); ++i) {
System.out.println(results.get(i));
}
}
}
before i compile it, I got an error in the:
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.SolrQuery;
HttpSolrServer solr = new HttpServer("http://localhost:8983/solr/collection1");
Can anyone help me how to solve it?
The piece of code in your question was written for an old version of Solr before ver. 5.0. You'll find many sources and example around written for old Solr versions, but in most of the cases all you have to do is change the old SolrServer class with the new SolrClient (and now correct) class.
Both were the representations of the Solr instances you want to use.
Read the Solr Documentation - Using SolrJ
I warmly suggest to not use for your classes the same name of an already existing class (in your example your class is named SolrQuery).
The catch all string for Solr queries is *:* which means: search any match for all available fields. So change the statement query.setQuery into:
query.setQuery("*:*");
I suppose you're using a Solr client for a standalone instance so, as you're already aware, the correct way to instance a SolrClient is:
String urlString = "http://localhost:8983/solr/gettingstarted";
SolrClient solr = new HttpSolrClient.Builder(urlString).build();
And this is an easier way I suggest to iterate through all returned document:
for (SolrDocument doc : response.getResults()) {
System.out.println(doc);
}
Have a look at the documentation of SolrDocument class that explain how to use it and correctly read field values.
I founded that i need to import a .jar file which is not contain in the /dist library which named slf4j-simple-1.7.25 , and also
HttpSolrServer solr = new HttpServer("http://localhost:8983/solr/gettingstarted");
SolrQuery query = new SolrQuery();
need to change to the
String urlString = "http://localhost:8983/solr/gettingstarted";
SolrClient solr = new HttpSolrClient.Builder(urlString).build();
after that it finally can run already!!!

How to index data in a specific shard using solrj

I am using solrj as client to index documents into solr cloud (Using solr4.5)
I had a requirement to save documents based on tenant_id, so i am trying to do document routing. Which is possible only if the collection is created using numShards parameter (http://searchhub.org/2013/06/13/solr-cloud-document-routing/)
I have two instances of solr in solr cloud(example1/solr and example2/solr) and exrenal zookeeper which is running in 2181 port.
Both the instances consist collection called collection1
I created one more collection called newCollection(With two shards and two replicas) using
http://localhost:8501/solr/admin/collectionsaction=CREATE&name=newCollection&numShards=2&replicationFactor=2&maxShardsPerNode=2&router.field=id
So in example1/solr-> I have newCollection_shard1_replica1 & newCollection_shard2_replica1,
In example2/solr -> I have newCollection_shard1_replica2 & newCollection_shard2_replica2
I copied example1/solr/collection1/conf to all shards and replicas
I restarted zookeeper server as well as solr instances:
zookeeper->zkServer.cmd
example1/solr-> java -Dbootstrap_confdir=./solr/newCollection_shard1_replica1/conf -Dcollection.configName=myconf -DzkHost=localhost:2181 -jar start.jar
example2/solr->java -DzkHost=localhost:2181 -jar start.jar
(Both instances are running at different port, one is at 8081 and other at 8051)
I am using solrj client to index documents
Here is my sample code
String url="http://localhost:8081/solr"
ConcurrentUpdateSolrServer solrServer= new ConcurrentUpdateSolrServer(url, 10000, 4);
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "shard1!513");
doc.addField("name", "Santhosh");
solrServer.add(documents);
solrServer.commit();
But it is saving document in collection1 with id shard1!513, is there any configuration changes required in solrconfig.xml (I am using default solrconfig.xml which came with solr4.5)
How to save documents in my newCollection? and how to do document routing?
Please help me out with issue.
Thanks!
You can Use CloudSolrServer and UpdateRequest
SolrServer solrServer = new CloudSolrServer(zkHost) // zkHost is your solr zookeeper host string
SolrInputDocument doc = new SolrInputDocument();
UpdateRequest add = new UpdateRequest();
add.add(document);
add.setParam("collection", "newCollection");
add.process(solrServer);
UpdateRequest commit = new UpdateRequest();
commit.setAction(UpdateRequest.ACTION.COMMIT, true, true);
commit.setParam("collection", "newCollection");
commit.process(solrServer);
I appended Core name of new Collection to the URL. so it is working fine now.
Instead of:
String url="http://localhost:8081/solr"
I used:
String url="http://localhost:8081/solr/newCollection_shard1_replica1"
ConcurrentUpdateSolrServer solrServer= new ConcurrentUpdateSolrServer(url, 10000, 4);
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "shard1!513");
doc.addField("name", "Santhosh");
solrServer.add(documents);
solrServer.commit();
You should use CloudSolrServer http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrServer.html
Because in solrcloud, updates must be routed via zookeeper, as zookeeper knows the status of leaders in cloud.One more thing you need not to append collection name to url, just use setDefaultCollection(collectionName); method of CloudSolrServer to send your updates to 'collectionName' collection

Solr - Multiple attachments under one Data Import Handler record

I'm using Data Import Handler (DIH) to create documents in solr. Each document will have zero or more attachments. The attachments' (e.g. PDFs, Word docs, etc.) content is parsed (via Tika) and stored along with a path to the attachment. The attachment's content (and path) is (are) not stored in the database (and I prefer not to do that).
I currently have a schema with all the fields needed by DIH. I then also added an attachmentContent and attachmentPath field as multiValued. However, when I use Solrj to add the documents, only one attachment (the last one added) is stored and indexed by solr. Here's the code:
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
up.setParam("literal.id", id);
for (MultipartFile file : files) {
// skip over files where the client didn't provided a filename
if (file.getOriginalFilename().equals("")) {
continue;
}
File destFile = new File(destPath, file.getOriginalFilename());
try {
file.transferTo(destFile);
up.setParam("literal.attachmentPath", documentWebPath + acquisition.getId() + "/" + file.getOriginalFilename());
up.addFile(destFile);
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
try {
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(up);
} catch (SolrServerException sse) {
sse.printStackTrace();
}catch (IOException ioe) {
ioe.printStackTrace();
}
How can I get multiple attachments (content and paths) to be stored by solr? Or is there a better way to accomplish this?
Solr has a limitation of having only one document indexed with the API.
If you want to have multiple documents indexed you can club them as a zip file (and apply patch) and have it indexed.

Logback dbAppender Custom SQL

Is there a way to change the tables that logback writes its data to using the dbAppender, It has three default tables that must be created before using dbAppender, but I want to customise it to write to one table of my choosing. Something similar to Log4J where I can specify the SQL that gets executed when inserting the log to the database.
Tomasz, maybe I'm missing something but I don't see how just using custom DBNameResolver could be the answer to what Magezy asked. DBNameResolver is used by DBAppender via SQLBuilder to construct 3 SQL insert querys - via DBNameResolve one can only affect names of tables and columns where data will be inserted, but can not limit inserting to just one table, not to mention that by just implementing DBNameResolver there are no means to control what actually gets inserted.
To match log4j's JDBCAppender IMO one has to extend logback's DBAppender, or DBAppenderBase, or maybe even implement completely new custom Appender.
The easiest way for me was to make an appender from scratch. I'm appending to a single table, using Spring JDBC. It works something like this:
public class MyAppender extends AppenderBase<ILoggingEvent>
{
private String _jndiLocation;
private JDBCTemplate _jt;
public void setJndiLocation(String jndiLocation)
{
_jndiLocation = jndiLocation;
}
#Override
public void start()
{
super.start();
if (_jndiLocation == null)
{
throw new IllegalStateException("Must have the JNDI location");
}
DataSource ds;
Context ctx;
try
{
ctx = new InitialContext();
Object obj = ctx.lookup(_jndiLocation);
ds= (DataSource) obj;
if (ds == null)
{
throw new IllegalStateException("Failed to obtain data source");
}
_jt = new JDBCTemplate(ds);
}
catch (Exception ex)
{
throw new IllegalStateException("Unable to obtain data source", ex);
}
}
#Override
protected void append(ILoggingEvent e)
{
// log to database here using my JDBCTemplate instance
}
}
I ran into trouble with SLF4J - the substitute logger error described here:
http://www.slf4j.org/codes.html#substituteLogger
This thread on multi-step configuration enabled me to work around that issue.
You need to implement ch.qos.logback.classic.db.names.DBNameResolver and use it in the configuration:
<appender name="DB" class="ch.qos.logback.classic.db.DBAppender">
<dbNameResolver class="com.example.MyDBNameResolver"/>
<!-- ... -->
</appender>
<appender name="CUSTOM_DB_APPENDER" class="com.....MyDbAppender">
<filter class="com......MyFilter"/>
<param name="jndiLocation" value="java:/comp/env/jdbc/....MyPath"/>
</appender>
And your java MyDbAppender should have a string jndiLocation with setter.
Now do a jndi lookup (see the solution answered in Oct 17 '11 at 16:03)

Resources