hbase execute batch statement - database

I am using lucene 3.0.1 to index a column in hbase. After making query in lucene I am getting a array of keys (which is of same format I have key in hbase) in java, now for all of these keys I want to make query to hbase and get corresponding rows from database. I am not able to find IN operator in hbase documentation, other option is I loop over set of keys and make query to hbase but in this case I will be making lot of hbase database calls. Is there any other option any help is much appreciated. Thanks

The get method of the HTable class can accept a list of GET objects and fetch them all as batch see the documentation
You essentially need to do something like
List<Get> rowsToGet= new ArrayList<Get>();
for (String id:resultsFromLucene)
rowsToGet.add(new Get(Bytes.toBytes(id)));
Result[] results = htable.get(rowsToGet);

Related

Watson Discovery - Delete documents

Is there any possibility to delete documents from a watson discovery collection by date.
Something I would do in a SQL database like this:
DELETE FROM collection_name
WHERE publication_date < 2018-01-01;
I know, I can delete single documents by name and I could query the documents with a publication_date filer and after querying I could iterate over the document names and delete every single document but this seems a rather annoying approach for a quite simple task.
#user3609367 there is no way to delete multiple documents with a single API call. The approach you have mentioned in your post is the best way to do what you are asking.

MongoDB: Does findOne retrieve the whole collection from the database to the server

I am trying to make a website and I'm using mongoDB to store my database. I have a question about the performance about the query findOne which I've used widely. Does this query take the whole collection from the database to the server and then perform the iteration over it or does it perform the iteration on the database and just return the document to the server? Picking up the whole collection from the server will be an issue because transferring such a huge chunk of data will take time.
Understanding how mongodb uses indexes would help you answer this question. If you pass in parameters to the findOne query, and those parameters match an index on the collection then mongodb will use the index to find your results. Without the index mongodb will need to scan the collection till it finds a match.
For example if you run a query like:
db.coll.findOne({"_id": ObjectId("5a0a0e6f29642fd7a970420c")})
then mongodb will know exactly which document you want since the _id field is unique and contains an index. If you query on another field which isn't indexed then mongodb will need to do a COLLSCAN to find the document(s) to return.
Quoting official MongoDB documentation:
findOne - Returns one document that satisfies the specified query criteria on
the collection or view. If multiple documents satisfy the query, this
method returns the first document according to the natural order which
reflects the order of documents on the disk.
Obviously implied is that the database itself will only return one collection, and in addition, you could always use postman, or console.log to check what the server returns (if you're not sure).

DIH in SOLR based on primary key of a table

currently am using DIH for pulling data from MSSQL server to SOLR. Where in am using dataimporter.last_index_time to pull the records which are into database only after last_index_time. So i was exploring if there are any other option for DIH to use instead of using last_index_time may be something like last_pk_id.
Is such option available? could anyone let me know.
not provided by Solr itself.
But nothing prevents you from doing this:
set your DIH sql for the delta like this:
WHERE (last_pk_id > '${dataimporter.request.LAST_PK_ID}')
when you run some indexing, store, outside Solr, the last_pk_id value you indexed, say 333.
next time you need to delta index, add to your request
...&clean=false&LAST_PK_ID=333
store your new LAST_PK_ID (you can query solr for this)

Querying Twitter JSON File in HBase

I have successfully downloaded twitter data through flume directly into HBase table containing one column family and all of the data is stored in one column like this
hbase(main):005:0> scan 'tweet'
ROW
default00fbf898-6f6e-4b41-aee8-646efadfba46
COLUMN+CELL
column=data:pCol, timestamp=1454394077534, value={"extended_entities":{"media":[{"display_url":"pic.twitter.com/a7Mjq2daKZ","source_user_id":2987221847,"type":"photo"....
Now i want to access structs and arrays through HBase like we can access then in Hive. I have tried googling the issue but still clue less. Kindly Help
You can't query display_url , source_user_id or another json fields in hbase directly. You should use a document store nosql db like mongodb.

Can SOLR perform an UPSERT?

I've been attempting to do the equivalent of an UPSERT (insert or update if already exists) in solr. I only know what does not work and the solr/lucene documentation I have read has not been helpful. Here's what I have tried:
curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"1","name":{"set":"steve"}}]'
{"responseHeader":{"status":409,"QTime":2},"error":{"msg":"Document not found for update. id=1","code":409}}
I do up to 50 updates in one request and request may contain the same id with exclusive fields (title_en and title_es for example). If there was a way of querying whether or not a list of id's exist, I could split the data and perform separate insert and update commands... This would be an acceptable alternative but is there already a handler that does this? I would like to avoid doing any in house routines at this point.
Thanks.
With Solr 4.0 you can do a Partial update of all those document with just the fields that have changed will keeping the complete document same. The id should match.
Solr does not support UPSERT mechanics out of the box. You can create a record or you can update a record and syntax is different.
And if you update the record you must make sure all your other pre-inserted fields are stored (not just indexed). Under the covers, an update creates a completely new record just pre-populated with previously stored values. But that functionality if very deep in (probably in Lucene itself).
Have you looked at DataImportHandler? You reverse the control flow (start from Solr), but it does have support for checking which records need to be updated and which records need to be created.
Or you can just run a solr query like http://solr.example.com:8983/solr/select?q=id%3A(ID1+ID2+ID3)&fl=id&wt=csv where you ask Solr to look for your ID records and return only ID of records it does find. Then, you could post-process that to segment your Updates and Inserts.

Resources