Lily with Morphline and HBase - solr

I'm trying to use an tutorial from Cloudera. (http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_hbase_batch_indexer.html)
I have a code to insert objects in Avro format in HBase and I want to insert them to Solr but I don't get anything.
I have been taking a look to the logs:
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]}
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={0Name178721/data:avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]}
15/06/12 00:45:00 DEBUG indexer.Indexer$RowBasedIndexer: Indexer _default_ will send to Solr 0 adds and 0 deletes
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeNotify: {lifecycle=[START_SESSION]}
15/06/12 00:45:00 TRACE morphline.ExtractHBaseCellsBuilder$ExtractHBaseCells: beforeProcess: {_attachment_body=[keyvalues={1Name134339/data:avroUser/1434094131495/Put/vlen=237/seqid=0}], _attachment_mimetype=[application/java-hbase-result]}
So, I'm reaing them but I don't know why it isn't indexed anything in Solr.
I guess that my morphline.conf is wrong.
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**", "com.ngdata.**"]
commands : [
{
extractHBaseCells {
mappings : [
{
inputColumn : "data:avroUser"
outputField : "_attachment_body"
type : "byte[]"
source : value
}
]
}
}
#for avro use with type : "byte[]" in extractHBaseCells mapping above
{ readAvroContainer {} }
{
extractAvroPaths {
flatten : true
paths : {
name : /name
}
}
}
{ logTrace { format : "output record: {}", args : ["#{}"] } }
]
}
]
I wasn't sure if I had to have an "_attachment_body" field in Solr, but it seems that it isn't necessary, so I guess that readAvroContainer or extractAvroPaths are wrong.
I have a "name" field in Solr and my avroUser has a "name" field as well.
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}

I have all this things working well here.
I did this steps:
1) Install hbase-solr-indexer as a service:
Fist of all you have to install hbase-solr-indexer.
installing hbase-solr-indexing as a service
Add cloudera repos to yum repos for this.
After that type:
sudo yum install hbase-solr-indexer
2) Criate morphline files:
ok, you did it.
2) Set the Replication scope for every column family and register a hbase-indexer configuration
Using the Lily HBase NRT Indexer Service
$ hbase shell
hbase shell> disable 'record'
hbase shell> alter 'record', {NAME => 'data', REPLICATION_SCOPE => 1}
hbase shell> enable 'record'
Try to follow the others tutorials above. ;)
I was with problems with a NRT solution, but when I followed all that tutorial step by step It worked.
I hope this help someone.

Related

SolrCloud Autoscaling Automatically Adding Replicas

I'm using the 7.7 version of Solr cluster. I have three prescale AWS
EC2 servers with a Target group, an NLB attached, and autoscaling
turned on just for the Solr cluster.
Using the command below, I created the solr collection (collection Name: abc) with 1 shard and 3 replicas. Currently, the cluster now includes autoscale servers (4 replica for collection Name: abc)
There are currently 4 active nodes in that cluster.
From the application team, a collection (collection Name: xyz) will be created from code. They are unsure of the number of active nodes in the Solr cluster. In the zookeeper properties, the replicationFactor value is set to 3.
Three replication factors were used to create the collection from the application side (one of the replica is autoscale server). Now that the load is decreasing, the autoscale server will shut down.
There are only 2 replicationFactors available at that time (collection Name: xyz). Prescale server does not automatically add new collection xyz.
Please provide a solution for this.
Create collection command:
curl "http://localhost:8983/solr/admin/collections?action=CREATE&name=test_50&numShards=1&replicationFactor=3&collection.configName=conf_product&autoAddReplicas=true"
###Below the autoscaling policy i applied for solr.
curl http://localhost:8983/solr/admin/autoscaling -H 'Content-type:application/json' -d '{ "set-cluster-policy" : [{ "replica" : "1", "shard" : "#EACH", "node" : "#ANY", }] }'
curl http://localhost:8983/solr/admin/autoscaling -H 'Content-type:application/json' -d '{ "set-trigger": {"name" : "node_added_trigger","event" : "nodeAdded","waitFor" : "5s", "preferredOperation": "ADDREPLICA", "enabled" : true, "actions" : [{ "name" : compute_plan", "class": "solr.ComputePlanAction" }, { "name" : "execute_plan", "class": "solr.ExecutePlanAction" } ] }}'
curl http://localhost:8983/solr/admin/autoscaling -H 'Content-type:application/json' -d '{ "set-trigger": { "name" : "node_lost_trigger", "event" : "nodeLost", "waitFor" : "5s", "preferredOperation": "DELETENODE", "enabled" : true, "actions" : [ { "name" : "compute_plan", "class": "solr.ComputePlanAction" }, { "name" : "execute_plan", "class": "solr.ExecutePlanAction" } ] }}'

Solr COLSTATUS and LIST show deleted collection that cannot be deleted

For some of our collections when we run a Collections API DELETE synchronously followed immediately by a Configset API DELETE for the underlying configset we end up with a messed up collection state.
I have been unable to reproduce this issue in a test environment, it only happens on the live production instances inconsistently, so it may be load/race condition related.
Running a COLSTATUS against the broken collection provides the following response,
{
"responseHeader": {
"status": 404,
"QTime": 33
},
"collection_19744": {
"stateFormat": 2,
"znodeVersion": 51,
"properties": {
"autoAddReplicas": "false",
"maxShardsPerNode": "1",
"nrtReplicas": "3",
"pullReplicas": "0",
"replicationFactor": "3",
"router": {
"name": "compositeId"
},
"tlogReplicas": "0"
},
"activeShards": 1,
"inactiveShards": 0
},
"error": {
"metadata": [
"error-class",
"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException",
"root-error-class",
"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException"
],
"msg": "Error from server at http://solr1.prod-internal:8983/solr/collection_19744_shard1_replica_n4: Expected mime type application/octet-stream but got text/html. <html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html;charset=utf-8\"/>\n<title>Error 404 Not Found</title>\n</head>\n<body><h2>HTTP ERROR 404</h2>\n<p>Problem accessing /solr/collection_19744_shard1_replica_n4/admin/segments. Reason:\n<pre> Not Found</pre></p>\n</body>\n</html>\n",
"code": 404
}
}
The underlying shard data for the collection has been successfully removed from disk and is not present on any of the solr nodes, the /collections/collection_19744 node has also been successfully deleted from Zookeeper, which I tested using the zkcli script. Receiving a NoNode for /collections/collection_19744 message.
As the COLSTATUS is broken we cannot delete the associated configset, doing so results in a "Can not delete ConfigSet as it is currently being used by collection [collection_19744]" message. Which is false.
Where exactly does the COLSTATUS get its collection meta information, as the /collections/collection_19744 node is absent in zookeeper?
I want to remove the broken collection metadata so I can then remove the configset and recreate the collection with the original naming.

How do I properly extract/convert JSON data into objects?

I have structure below coming via webhook and I'm having trouble understanding if there is built in action in Logic App to get nicely formatted object array of rows which is inside table, where each item will have name and associated value with it.
"SearchResults": {
"tables": [
{
"name": "PrimaryResult",
"columns": [
{
"name": "TimeGenerated",
"type": "datetime"
},
{
"name": "ResourceGroup",
"type": "string"
},
{
"name": "ActivityStatusValue",
"type": "string"
},
{
"name": "d_resource",
"type": "dynamic"
},
{
"name": "c_title",
"type": "dynamic"
},
{
"name": "c_details",
"type": "dynamic"
}
],
"rows": [
[
"2020-06-18T16:30:07.89Z",
"USEASTPROD",
"Updated",
"aueglbwvhypap07",
"Remote disk disconnected",
"We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
],
[
"2020-06-18T16:30:07.89Z",
"USEASTPROD",
"Updated1",
"agggggypap07",
"Remote disk disconnected",
"We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
]
]
}
]
}
I'd like it to be an array where columns are entity name and each value from rows is it's value like below
"rows": [
[
"timeGenerated" :"2020-06-18T16:30:07.89Z",
"ResourceGroup": "USEASTPROD",
"ActivityStatusValue":"Updated",
"d_resource" : "aueglbwvhypap07",
"c_title" : "Remote disk disconnected",
"c_details": "We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
],
[
"timeGenerated" :"2020-06-18T16:30:07.89Z",
"ResourceGroup": "USEASTPROD",
"ActivityStatusValue":"Updated1",
"d_resource" : "agggggypap07",
"c_title" : "Remote disk disconnected",
"c_details": "We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
]
]
For this requirement, we can just use liquid with integration account in logic app to implement it. Please refer to the steps below:
1. We need to create an integration account on azure portal first and link it to your logic app. You can refer to this tutorial.
2. In my logic app, I initialize a variable and store your data(add a {} around the data) to simulate your situation.
3. Then use "Parse JSON" action to parse the string above.
4. Create a liquid map in local (I named it as testRow.liquid) with the code below:
{% assign rows = content.SearchResults.tables[0].rows %}
{
"rows": [
{% for item in rows %}
{
"timeGenerated": "{{item[0]}}",
"ResourceGroup": "{{item[1]}}",
"ActivityStatusValue": "{{item[2]}}",
"d_resource": "{{item[3]}}",
"c_title": "{{item[4]}}",
"c_details": "{{item[5]}}"
},
{% endfor %}
]
}
Upload the liquid map(testRow.liquid) to your integration account, for this step you can refer to this tutorial.
5. Then use "Transform JSON to JSON" action", choose the Body from "Parse JSON" action above and use the map which you upload.
6. After running the logic app, we can get the result as below:
The whole result json is:
{
"rows": [
{
"timeGenerated": "6/18/2020 4:30:07 PM",
"ResourceGroup": "USEASTPROD",
"ActivityStatusValue": "Updated",
"d_resource": "aueglbwvhypap07",
"c_title": "Remote disk disconnected",
"c_details": "We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
},
{
"timeGenerated": "6/18/2020 4:30:07 PM",
"ResourceGroup": "USEASTPROD",
"ActivityStatusValue": "Updated1",
"d_resource": "agggggypap07",
"c_title": "Remote disk disconnected",
"c_details": "We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
}
]
}
By the way:
The format of the sample you provided is invalid in json, we must use {} instead of [] in each of the item.
In liquid map, it will convert your datetime to another format automatically.
Hope it helps~

How to add a new query listener via the SolR Config API?

I'm trying to add new query listener with the SolR Config API.
To do that, I POST the following JSON to the Config API endpoint of my SolR core :
{
"add-listener": {
"name": "listener1",
"class": "solr.QuerySenderListener",
"event": "newSearcher",
"queries": [{
"q": "any:whatever",
"rows": "10",
"start": "0"
}
]
}
}
This action populates the "configoverlay.json" file which overrides the standard solrconfig.xml. This file now looks like the following :
{
"listener": {
"listener1": {
"name": "listener1",
"class": "solr.QuerySenderListener",
"event": "newSearcher",
"queries": [{
"q": "any:whatever",
"rows": "10",
"start": "0"
}
]
}
}
}
For some reason, this structure doesn't seem to be OK for SolR. I get the following error in SolR logs :
null:java.lang.ClassCastException: java.util.LinkedHashMap cannot be cast to org.apache.solr.common.util.NamedList
at org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:47)
at org.apache.solr.core.SolrCore$5.call(SolrCore.java:1835)
at java.util.concurrent.FutureTask.run(Unknown Source)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
And of course, the query associated to this event is therefore not executed as intended.
Apparently, the query format doesn't fit what SolR expects... I didn't find any example of updating or creating events in the SolR Config API documentation.
The problem must be related to the syntax of the parameters of the queries objects.
If you know how to format the json message to produce a proper listener, let me know !
In case it matters, I'm using SolR 5.3.1
p.s : The point of doing this via config API instead of solrconfig.xml is to be able to change the warming queries dynamically (to reflect configuration changes in our product which uses SolR for searching purposes).
Guillaume

How to query Index and selector using ektorp Java API

I am using ektorp 1.4.1 Jar to connect to Cloudant database. Now I am able to write map and reduce functions using class EventRepository extends CouchDbRepositorySupport. But my problem here is, how I can query for Index and Selector using ektorp java API? Please help me out here by any one. Thanks in advance.
This is my Query Index :
{
"index": {
"fields": [
{"name": "userName", "type": "string"}
]
},
"type": "text"
}
and here is my selector code for getting all Events from cloudant by User Name Descending order by startDate.
{
"selector": {
"userName": "vekusuma#in.ibm.com"
},
"fields": [
"userName",
"startDate",
"days",
"_id",
"_rev"
],
"sort": [
{
"userName": "desc"
}
]
}
I am using the below code to connect cloudant using Cloudant java API...
CloudantClient client = ClientBuilder.url(new URL("https://userName:password#*****.cloudant.com")).username("*******").
password("*******").build();
List<String> dbsList = client.getAllDbs();
System.out.println("...dbsList size is :: " + dbsList.size());
CloudantClient client = ClientBuilder.account("username").username("username").password("password").
build();
but still same issue...
and even I have tried in different way...
... and I am getting below mentioned error while running on Eclipse Websphere server 7.0 on local...
***********Error*************
[3/9/16 23:53:43:547 IST] 00000031 SystemErr R com.cloudant.client.org.lightcouch.CouchDbException: 400 Bad request: 400 Bad request
Your browser sent an invalid request.
[3/9/16 23:53:43:548 IST] 00000031 SystemErr R at com.cloudant.client.org.lightcouch.CouchDbClient.execute(CouchDbClient.java:501)
Plz help me some thing here... Thanks in advance :)
From my (quick) reading of the ektorp API, it doesn't support Cloudant Query, (the product name for query). However there is a Official Cloudant java library which does support the Cloudant Query endpoints.

Resources