HiveQL to HBase - database

I am using Hive 0.14 and Hbase 0.98.8
I would like to use HiveQL for accessing a HBase "table".
I created a table with a complex composite rowkey:
CREATE EXTERNAL TABLE db.hive_hbase (rowkey struct<p1:string, p2:string, p3:string>, column1 string, column2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ';'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf:c1,cf:c2")
TBLPROPERTIES("hbase.table.name"="hbase_table");
The table is getting successfully created, but the HiveQL is taking forever:
SELECT * from db.hive_hbase WHERE rowkey.p1 = 'xyz';
Queries without using the rowkey are fine and also using the hbase shell with filters are working.
I don't find anything in the logs, but I assume that there could be an issue with complex composite keys and performance.
Did anybody face the same issue? Hints to solve it? Other ideas, what I could try?
Thank you
Update 16.07.15:
I changed the log4j properties to 'DEBUG' and found some interesting information:
It says:
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(823)) - Pushdown Predicates of FIL For Alias : hive_hbase
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(826)) - (rowkey.p1 = 'xyz')
But some lines later:
2015-07-15 15:56:41,430 DEBUG ppd.OpProcFactory (OpProcFactory.java:pushFilterToStorageHandler(1051)) - No pushdown possible for predicate: (rowkey.p1 = 'xyz')
So my guess is: HiveQL over HBase does not do any predicate pushdown in Hbase but rather starts a MapReduce job.
Could there be a bug with the predicate pushdown?

I tried similar situation using Hive 0.13 and it works fine. I got the result. What version of hive are you working on?

Related

Is there any way we can parse s string expression in Apache Flink Table API?

I am trying to perform aggregation using Flink Table API by accepting group by field and field aggregation expressions as string parameters from the user.
Input
GroupBy Field = department
aggregation Field Expression = count(employeeId), max(salary)
Is there any way we can do it using flink Table API? I tried to do the following, but it didn't help. Does flink have anything equivalent to selectExpr function in spark?
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.selectExpr.html
employeeTable
.groupBy($("department"))
.select(
$("department"),
$("count(employeeId)").as("numberOfEmployees"),
$("max(salary)").as("maxSalary")
)
It is throwing the following exception
Exception in thread "main" org.apache.flink.table.api.ValidationException: Cannot resolve field [count(employeeId)], input field list:[department].
No, I don't believe this will work. Flink's SQL planner wants to know what the query is doing at compile time.
What you can do is construct a SQL query and create a new job to run that query. The SQL gateway that's coming in Flink 1.16 (see FLIP-91) should make this easier.
I think you have wrong syntax.
.select(
$("department"),
$("count(employeeId)").as("numberOfEmployees"),
$("max(salary)").as("maxSalary")
)
count and max you should call like this :
$("employeeId").count().as("numberOfEmployees"),
$("salary").max().as("maxSalary")
You can check Built-in functions here

Error launching query in GAE Firestore DatastoreException: no matching index found

I have a problem with executing a query on Firestore in Google App Engine. Insertion is successful. But when he tries to run a simple query I get the following error:
com.google.cloud.datastore.DatastoreException: no matching index found.
at com.google.cloud.datastore.spi.v1.HttpDatastoreRpc.translate(HttpDatastoreRpc.java:128)
at com.google.cloud.datastore.spi.v1.HttpDatastoreRpc.translate(HttpDatastoreRpc.java:113)
at com.google.cloud.datastore.spi.v1.HttpDatastoreRpc.runQuery(HttpDatastoreRpc.java:181)
at com.google.cloud.datastore.DatastoreImpl$1.call(DatastoreImpl.java:180)
at com.google.cloud.datastore.DatastoreImpl$1.call(DatastoreImpl.java:177)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.datastore.DatastoreImpl.runQuery(DatastoreImpl.java:176)
at com.google.cloud.datastore.QueryResultsImpl.sendRequest(QueryResultsImpl.java:73)
at com.google.cloud.datastore.QueryResultsImpl.<init>(QueryResultsImpl.java:57)
at com.google.cloud.datastore.DatastoreImpl.run(DatastoreImpl.java:170)
at com.google.cloud.datastore.DatastoreImpl.run(DatastoreImpl.java:161)
...
Caused by:
com.google.datastore.v1.client.DatastoreException: no matching index found., code=FAILED_PRECONDITION
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:136)
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:185)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:96)
at com.google.datastore.v1.client.Datastore.runQuery(Datastore.java:119)
at com.google.cloud.datastore.spi.v1.HttpDatastoreRpc.runQuery(HttpDatastoreRpc.java:179)
at com.google.cloud.datastore.DatastoreImpl$1.call(DatastoreImpl.java:180)
at com.google.cloud.datastore.DatastoreImpl$1.call(DatastoreImpl.java:177)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.datastore.DatastoreImpl.runQuery(DatastoreImpl.java:176)
at com.google.cloud.datastore.QueryResultsImpl.sendRequest(QueryResultsImpl.java:73)
at com.google.cloud.datastore.QueryResultsImpl.<init>(QueryResultsImpl.java:57)
at com.google.cloud.datastore.DatastoreImpl.run(DatastoreImpl.java:170)
at com.google.cloud.datastore.DatastoreImpl.run(DatastoreImpl.java:161)
The query I run is as follows:
Query<Entity> q = Query.newEntityQueryBuilder()
.setKind(tableName)
.setOrderBy(OrderBy.asc("t"))
.setFilter(PropertyFilter.le("t", 1000))
.build();
QueryResults<Entity> result = datastore.run(q);
It doesn't seem to me to be a query that needs an index. For only one property I read that the index is created automatically. However, I created a single-field index on firebase. But I always get the same error.
Can someone help me?
Thanks
The problem in your Datastore query is that the order of the functions was altered.
First filter and then order by.
Query<Entity> q = Query.newEntityQueryBuilder()
.setKind(tableName)
.setFilter(PropertyFilter.le("t", 1000))
.setOrderBy(OrderBy.asc("t"))
.build();
QueryResults<Entity> result = datastore.run(q);
The built-in indexes can be used in very specific cases. Some cases always require a composite index, your appears to be one of them. From Index configuration:
For more complex queries, an application must define composite, or
manual, indexes. Composite indexes are required for queries of the
following form:
...
Queries with one or more filters and one or more sort orders
You have both a filter and a sort order in your query.

How to indicate the database in SparkSQL over Hive in Spark 1.3

I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark".
I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line
hc.sql("use myDatabase")
is throwing the following exception:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.
use database
is supported in later Spark versions
https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html
You need to put the statement in two separate spark.sql calls like this:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show
Go back to creating the HiveContext. The hive context gives you the ability to create a dataframe using Hive's metastore. Spark only uses the metastore from hive, and doesn't use hive as a processing engine to retrieve the data. So when you create the df using your sql query, its really just asking hive's metastore "Where is the data, and whats the format of the data"
Spark takes that information, and will run process against the underlying data on the HDFS. So Spark is executing the query, not hive.
When you create the sqlContext, its removing the link between Spark and the Hive metastore, so the error is saying it doesn't understand what you want to do.
I have not been able to implement the use databale command, but here is a workaround to use the desired database:
spark-shell --queue QUEUENAME;
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val res2 = sqlContext.sql("select count(1) from DB_NAME.TABLE_NAME")
res2.collect()

Performance issue with django exclude

I have a Django 1.8 application, and I am using an MsSQL database, with pyodbc as the db backend (using "django-pyodbc-azure" module).
I have the following models:
class Branch(models.Model):
name = models.CharField(max_length=30)
startTime = models.DateTimeField()
class Device(models.Model):
uid = models.CharField(max_length=100, primary_key=True)
type = models.CharField(max_length=20)
firstSeen = models.DateTimeField()
lastSeen = models.DateTimeField()
class Session(models.Model):
device = models.ForeignKey(Device)
branch = models.ForeignKey(Branch)
start = models.DateTimeField()
end = models.DateTimeField(null=True, blank=True)
I need to query the session model, and I want to exclude some records with specific device values. So I issue the following query:
sessionCount = Session.objects.filter(branch=branch)
.exclude(device__in=badDevices)
.filter(end__gte=F('start')+timedelta(minutes=30)).count()
badDevices is a pre-filled list of device ids with around 60 items.
badDevices = ['id-1', 'id-2', ...]
This query takes around 1.5 seconds to complete. If I remove the exclude from the query, it takes around 250 miliseconds.
I printed the generated sql for this queryset, and tried it in my database client. There, both versions executed in around 250 miliseconds.
This is the generated SQL:
SELECT [session].[id], [session].[device_id], [session].[branch_id], [session].[start], [session].[end]
FROM [session]
WHERE ([session].[branch_id] = my-branch-id AND
NOT ([session].[device_id] IN ('id-1', 'id-2', 'id-3',...)) AND
DATEPART(dw, [session].[start]) = 1
AND [session].[end] IS NOT NULL AND
[session].[end] >= ((DATEADD(second, 600, CAST([session].[start] AS datetime)))))
So, using the exclude in database level doesn't seem to be affecting the query performance, but in django, the query runs 6 times slower if I add the exclude part. What could be causing this?
The general issue seems to be that django is doing some extra work to prepare the exclude clause. After that step and by the time the SQL has been generated and sent to the database, there isn't anything interesting happening on the django side that could cause such a significant delay.
In your case, one thing that might be causing this is some kind of pre-processing of badDevices. If, for instance, badDevices is a QuerySet then django might be executing the badDevices query just to prepare the actual query's SQL. Possibly something similar might be happening in the case where device has a non-default primary key.
The other thing might delay the SQL preparation is of course django-pyodbc-azure. Maybe it's doing something strange while compiling the query and it becomes a bottleneck.
This is all wild speculation though, so if you're still having this issue then post the Device and Branch models as well, the exact content of badDevices and the SQL generated from the queries. Then maybe some scenarios can be at least eliminated.
EDIT: I think it must be the Device.uid field. Possibly django or pyodbc is getting confused by the non-default primary key and is fetching all the devices while generating the query. Try two things:
Replace device__in with device_id__in, device__pk__in and device__uid__in and check each one again. Maybe a more explicit query will be easier for django to translate into SQL. You can even try replacing branch with branch_id, just in case.
If the above doesn't work, try replacing the exclude expression with a raw SQL where clause:
# add quotes (because of the hyphens) & join
badDevicesIdString = ", ".join(["'%s'" % id for id in badDevices])
# Replaces .exclude()
... .extra(where=['device_id NOT IN (%s)' % badDevicesIdString])
If neither works, then most likely the problem is with the whole query and not just exclude. There are some more options in that case but try the above first and I will update my answer later if necessary.
Just want to share a similar problem that I had with MySQL and exclude clauses performance and how it was fixed.
When running the exclude clause, the list with the "in" lookup was actually a Queryset that I got using values_list method. Checking the exclude query executed by MySQL, the "in" objects were not values but actually another query. This behavior was impacting performance on specific large queries.
To fix that, instead of passing the queryset, I flat it out in a python list of values. By doing that, each value is passed as an argument inside the in lookup and the performance was really improved.

How to use indexed properties of NodeModels in cypher queries of Neo4django?

I'm a newbie to Django as well as neo4j. I'm using Django 1.4.5, neo4j 1.9.2 and neo4django 0.1.8
I've created NodeModel for a person node and indexed it on 'owner' and 'name' properties. Here is my models.py:
from neo4django.db import models as models2
class person_conns(models2.NodeModel):
owner = models2.StringProperty(max_length=30,indexed=True)
name = models2.StringProperty(max_length=30,indexed=True)
gender = models2.StringProperty(max_length=1)
parent = models2.Relationship('self',rel_type='parent_of',related_name='parents')
child = models2.Relationship('self',rel_type='child_of',related_name='children')
def __unicode__(self):
return self.name
Before I connected to Neo4j server, I set auto indexing to True and and gave indexable keys in conf/neo4j.properties file as follows:
# Autoindexing
# Enable auto-indexing for nodes, default is false
node_auto_indexing=true
# The node property keys to be auto-indexed, if enabled
node_keys_indexable=owner,name
# Enable auto-indexing for relationships, default is false
relationship_auto_indexing=true
# The relationship property keys to be auto-indexed, if enabled
relationship_keys_indexable=child_of,parent_of
I followed Neo4j: Step by Step to create an automatic index to update above file and manually create node_auto_index on neo4j server.
Below are the indexes created on neo4j server after executing syndb of django on neo4j database and manually creating auto indexes:
graph-person_conns lucene
{"to_lower_case":"true", "_blueprints:type":"MANUAL","type":"fulltext"}
node_auto_index lucene
{"_blueprints:type":"MANUAL", "type":"exact"}
As suggested in https://github.com/scholrly/neo4django/issues/123 I used connection.cypher(queries) to query the neo4j database
For Example:
listpar = connection.cypher("START no=node(*) RETURN no.owner?, no.name?",raw=True)
Above returns the owner and name of all nodes correctly. But when I try to query on indexed properties instead of 'number' or '*', as in case of:
listpar = connection.cypher("START no=node:node_auto_index(name='s2') RETURN no.owner?, no.name?",raw=True)
Above gives 0 rows.
listpar = connection.cypher("START no=node:graph-person_conns(name='s2') RETURN no.owner?, no.name?",raw=True)
Above gives
Exception Value:
Error [400]: Bad Request. Bad request syntax or unsupported method.
Invalid data sent: (' expected but-' found after graph
I tried other strings like name, person_conns instead of graph-person_conns but each time it gives error that the particular index does not exist. Am I doing a mistake while adding indexes?
My project mainly depends on filtering the nodes based on properties, so this part is really essential. Any pointers or suggestions would be appreciated. Thank you.
This is my first post on stackoverflow. So in case of any missing information or confusing statements please be patient. Thank you.
UPDATE:
Thank you for the help. For the benefit of others I would like to give example of how to use cypher queries to traverse/find shortest path between two nodes.
from neo4django.db import connection
results = connection.cypher("START source=node:`graph-person_conns`(person_name='s2sp1'),dest=node:`graph-person_conns`(person_name='s2c1') MATCH p=ShortestPath(source-[*]->dest) RETURN extract(i in nodes(p) : i.person_name), extract(j in rels(p) : type(j))")
This is to find shortest path between nodes named s2sp1 and s2c1 on the graph. Cypher queries are really cool and help traverse nodes limiting the hops, types of relations etc.
Can someone comment on the performance of this method? Also please suggest if there are any other efficient methods to access Neo4j from Django. Thank You :)
Hm, why are you using Cypher? neo4django QuerySets work just fine for the above if you set the properties to indexed=True (or not, it'll just be slower for those).
people = person_conns.objects.filter(name='n2')
The neo4django docs have some other querying examples, as do the Django docs. Neo4django executes those queries as Cypher on the backend- you really shouldn't need to drop down to writing the Cypher yourself unless you have a very particular traversal pattern or a performance issue.
Anyway, to more directly tackle your question- the last example you used needs backticks to escape the index name, like
listpar = connection.cypher("START no=node:`graph-person_conns`(name='s2') RETURN no.owner?, no.name?",raw=True)
The first example should work. One thought- did you flip the autoindexing on before or after saving the nodes you're searching for? If after, note that you'll have to manually reindex the nodes either using the Java API or by re-setting properties on the node, since it won't have been autoindexed.
HTH, and welcome to StackOverflow!

Resources