Querying Twitter JSON File in HBase

Querying Twitter JSON File in HBase - arrays

I have successfully downloaded twitter data through flume directly into HBase table containing one column family and all of the data is stored in one column like this
hbase(main):005:0> scan 'tweet'
ROW
default00fbf898-6f6e-4b41-aee8-646efadfba46
COLUMN+CELL
column=data:pCol, timestamp=1454394077534, value={"extended_entities":{"media":[{"display_url":"pic.twitter.com/a7Mjq2daKZ","source_user_id":2987221847,"type":"photo"....
Now i want to access structs and arrays through HBase like we can access then in Hive. I have tried googling the issue but still clue less. Kindly Help

You can't query display_url , source_user_id or another json fields in hbase directly. You should use a document store nosql db like mongodb.

Related

Parse DynamoDB List DataType

I am building an angular 8 application and am storing JSON data in a list data type in DynamoDB. I can insert the records just fine and can query the table for the data but I'm having issues grabbing the data in the list data type.
Here is how it looks in a console log
I don't have any issues grabbing the String data values, only the nested data in the List data type.

If your issue is related parsing the objects returned from dynamo you can use the DynamoDB Converter
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/DynamoDB/Converter.html#unmarshall-property
this will convert the returned dynamo record into a json record.
Also if you're using the sdk, consider also using the https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/DynamoDB/DocumentClient.html where it will automatically convert dynamo records into json records.

Update content of BLOB column data in Cassandra

I have a table in Cassandra, in which one column is a BLOB.
I wish to update only some values in that blob. Is that possible ?
Example :
String form of BLOB is let's say:
{"name":"ABC","rollNum": "1234"}
I want to make it as :
{"name":"ABC","rollNum": "1333"} with an CQL update query.
Originally this column gets update from my JAVA code where I send byte[] to be inserted in this BLOB column.
Now, I want to update just some fields without doing any type of select on this row.

You can't do this in general.
Cassandra as any other database does not know how to interpret your blob. You will
need to read, parse, update and save your blob again
use a map instead
use single fields - which will give the most performance
Apart from that, updates like you want to do can be archived in document databases like MongoDB.

Retrieve all documents but only specific fields from Cloudant database

I want to return all the documents in a Cloudant database but only include some of the fields. I know that you can make a GET request against https://$USERNAME.cloudant.com/$DATABASE/_all_docs, but there doesn't seem to be a way to select only certain fields.
Alternatively , you can POST to /db/_find and include selector and fields in the JSON body. However, is there a universal selector, similar to SELECT * in SQL databases?

You can use {"_id":{"$gt":0}} as your selector to match all the documents, although you should note that it is not going to be performant on large data sets.

Querying large amount of data processed by Hive

Say I have around 10-20GB of data in HDFS as a Hive table. This has been obtained after several Map-Reduce jobs and JOIN over two separate datasets. I need to make this Queryable to the user. What options do I have?
Use Sqoop to transfer data from HDFS to an RDS like Postgresql. But I want to avoid spending so much time on data transfer. I just tested HDFS->RDS in the same AWS region using Sqoop, and 800mb of data takes 4-8 minutes. So you can imagine ~60gb of data would be pretty unmanagable. This would be my last resort.
Query Hive directly from my Webserver as per user request. I haven't ever head of Hive being used like this so I'm skeptical about this. This struck me because I just found out you can query hive tables remotely after some port forwarding on the EMR cluster. But being new to big(ish) data I'm not quite sure about the risks associated with this. Is it commonplace to do this?
Some other solution - How do people usually do this kind of thing? Seems like a pretty common task.
Just for completeness sake, my data looks like this:
id time cat1 cat2 cat3 metrics[200]
A123 1234212133 12 ABC 24 4,55,231,34,556,123....(~200)
.
.
.
(time is epoch)
And my Queries look like this:
select cat1, corr(metrics[2],metrics[3]),corr(metrics[2],metrics[4]),corr(metrics[2],metrics[5]),corr(metrics[2],metrics[6]) from tablename group by cat1;
I need the correlation function, which is why I've chosen postgresql over MySQL.

You have correlation function in Hive:
corr(col1, col2)
Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.

You can simply connect to a hiveserver port via odbc and execute queries.
Here is an example:
http://www.cloudera.com/content/cloudera/en/downloads/connectors/hive/odbc/hive-odbc-v2-5-10.html

Hive User Experience (hue) has a Beeswax query editor designed specifically for the purpose of exposing Hive to end users who are comfortable with SQL. This way they can potentially run ad-hoc queries against the data residing in Hive without needing to move it elsewhere. You can see an example of the Beeswax Query Editor here: http://demo.gethue.com/beeswax/#query
Will that work for you?

What i can understand from the question posted above is you have some data (20GB ) which you have stored in hdfs and using hive. Now you want to access that data to perform some kind of statistics functions like correlation and others.
You have functions in hive that perform correlation.
Otherwise you can directly connect R to hive using RHive or even excel to hive using datasource.
The other solution is installing hue which comes with hive editors where you can directly query the hive.

hbase execute batch statement

I am using lucene 3.0.1 to index a column in hbase. After making query in lucene I am getting a array of keys (which is of same format I have key in hbase) in java, now for all of these keys I want to make query to hbase and get corresponding rows from database. I am not able to find IN operator in hbase documentation, other option is I loop over set of keys and make query to hbase but in this case I will be making lot of hbase database calls. Is there any other option any help is much appreciated. Thanks

The get method of the HTable class can accept a list of GET objects and fetch them all as batch see the documentation
You essentially need to do something like
List<Get> rowsToGet= new ArrayList<Get>();
for (String id:resultsFromLucene)
rowsToGet.add(new Get(Bytes.toBytes(id)));
Result[] results = htable.get(rowsToGet);

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Querying Twitter JSON File in HBase - arrays

You can't query display_url , source_user_id or another json fields in hbase directly. You should use a document store nosql db like mongodb.

Related

Parse DynamoDB List DataType

Update content of BLOB column data in Cassandra

Retrieve all documents but only specific fields from Cloudant database

Querying large amount of data processed by Hive

hbase execute batch statement

Categories

Resources