How to indicate the database in SparkSQL over Hive in Spark 1.3

How to indicate the database in SparkSQL over Hive in Spark 1.3 - database

I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark".
I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line
hc.sql("use myDatabase")
is throwing the following exception:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.

use database
is supported in later Spark versions
https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html
You need to put the statement in two separate spark.sql calls like this:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show

Go back to creating the HiveContext. The hive context gives you the ability to create a dataframe using Hive's metastore. Spark only uses the metastore from hive, and doesn't use hive as a processing engine to retrieve the data. So when you create the df using your sql query, its really just asking hive's metastore "Where is the data, and whats the format of the data"
Spark takes that information, and will run process against the underlying data on the HDFS. So Spark is executing the query, not hive.
When you create the sqlContext, its removing the link between Spark and the Hive metastore, so the error is saying it doesn't understand what you want to do.

I have not been able to implement the use databale command, but here is a workaround to use the desired database:
spark-shell --queue QUEUENAME;
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val res2 = sqlContext.sql("select count(1) from DB_NAME.TABLE_NAME")
res2.collect()

Related

Is there any way we can parse s string expression in Apache Flink Table API?

I am trying to perform aggregation using Flink Table API by accepting group by field and field aggregation expressions as string parameters from the user.
Input
GroupBy Field = department
aggregation Field Expression = count(employeeId), max(salary)
Is there any way we can do it using flink Table API? I tried to do the following, but it didn't help. Does flink have anything equivalent to selectExpr function in spark?
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.selectExpr.html
employeeTable
.groupBy($("department"))
.select(
$("department"),
$("count(employeeId)").as("numberOfEmployees"),
$("max(salary)").as("maxSalary")
)
It is throwing the following exception
Exception in thread "main" org.apache.flink.table.api.ValidationException: Cannot resolve field [count(employeeId)], input field list:[department].

No, I don't believe this will work. Flink's SQL planner wants to know what the query is doing at compile time.
What you can do is construct a SQL query and create a new job to run that query. The SQL gateway that's coming in Flink 1.16 (see FLIP-91) should make this easier.

I think you have wrong syntax.
.select(
$("department"),
$("count(employeeId)").as("numberOfEmployees"),
$("max(salary)").as("maxSalary")
)
count and max you should call like this :
$("employeeId").count().as("numberOfEmployees"),
$("salary").max().as("maxSalary")
You can check Built-in functions here

Is thera a way to find forced execution context on datastore objects, somewhere in ODI metadata database?

I have a ODI 12c project with 30 mappings. I need to check if every "Component context" on every datastore object (source or target) is set to "Execution context" (not forced).
Is there a way to achive this by querying ODI underlying database so I don't have to do this manually, and to avoid possible mistakes ?
I have a list of ODI 12c Repository tables and comments on table columns which I got from the Oracle support website, and after hours of digging through database I still can't see this information stored in any table.
My package is located in SNP_PACKAGE, SNP_MAPPING has info about mapping , and SNP_MAP_COMP describes objects in mapping.
I have searched through many different tables as well.

A bit late but for anyone else looking
Messing about the tables is a no-no. APIs are better. Specially if you are to modify anything.
https://docs.oracle.com/en/middleware/data-integrator/12.2.1.3/odija/index.html
Run the following groovy script in ODI (Tools/Groovy/New Script). Should be simple enough to modify. Using the SDK gets a lot easier if you manage to set up a complete development env in IntelliJ or another Java IDE. Groovy in ODI opens up a whole new world.
//Created by DI Studio
import oracle.odi.domain.mapping.Mapping
import oracle.odi.domain.mapping.finder.IMappingFinder
tme = odiInstance.getTransactionalEntityManager()
IMappingFinder mapf = (IMappingFinder) tme.getFinder(Mapping.class)
Collection<Mapping> mappings = mapf.findByProject("PROJECT","FOLDER")
println("Found ${mappings.size()} mappings")
mappings.each { map ->
map.physicalDesigns.each{ phys ->
phys.physicalNodes.each{ node ->
println("${map.project.name}...${map.parentFolder.parentFolder?.name}.${map.parentFolder.name}.${map.name}.${phys.name}.${node.name}.defaultContext=${(node.context.defaultContext) ? "default" : node.context.name}")
}
}
}
It prints default or the set (forced) context. Seems forced context has been deprecated in 12c. Physical.node.context.defaultContext seems to mirror Component Context (Forced) in ODI Studio 12.2.1.3.
https://docs.oracle.com/en/middleware/data-integrator/12.2.1.3/odija/index.html
Update 2019-12-20 - including getExecutionContextName
The following script lists in a hierarchical manner and maybe easier to read the code. Not sure if you get what you are originally was after without having mapping with your exact setup.
//Created by DI Studio
import oracle.odi.domain.mapping.Mapping
import oracle.odi.domain.mapping.finder.IMappingFinder
import oracle.odi.domain.mapping.component.DatastoreComponent
tme = odiInstance.getTransactionalEntityManager()
String project = "PROJECT"
String parentFolder = "PARENT_FOLDER"
IMappingFinder mapf = (IMappingFinder) tme.getFinder(Mapping.class)
Collection<Mapping> mappings = mapf.findByProject(project, parentFolder)
println("Found ${mappings.size()} mappings")
println "Project: ${project}"
mappings.each { map ->
println "\tMapping: ..${map.parentFolder.parentFolder?.name}/${map.parentFolder.name}/${map.name}"
map.physicalDesigns.each{ phys ->
println "\t\tPhysical: ${phys.name}"
phys.physicalNodes.each{ node ->
println "\t\t\tNode: ${node.name}"
println "\t\t\t\tdefaultContext: ${(node.context.defaultContext)}"
println "\t\t\t\tNode context name: ${node.context.name}"
println "\t\t\t\tDatastoreComponent ExecutionContextName: ${DatastoreComponent.getDatastoreComponent(node)?.getExecutionContextName(node).toString()}"
}
}
}

Below is a list of some tables and columns that might hold the value you are looking for.
These tables and columns are from ODI 12.1.2, depending on the exact ODI version you are using, the structure could be a little different.

Here is also a query to retrieve this information directly from database.
-- Forced Contexts on Datastores in Mapping
SELECT MAPP.NAME MAP_NAME, MAPP_COMP.NAME DATASTORE_NAME,
MAPP_REF.QUALIFIED_NAME FORCE_CONTEXT
FROM SNP_MAPPING MAPP
INNER JOIN SNP_MAP_REF MAPP_REF
ON MAPP_REF.I_OWNER_MAPPING = MAPP.I_MAPPING
INNER JOIN SNP_MAP_PROP MAPP_PROP
ON MAPP_REF.I_MAP_REF = MAPP_PROP.I_PROP_XREF_VALUE
INNER JOIN ODIW12.SNP_MAP_COMP MAPP_COMP
ON MAPP_COMP.I_MAP_COMP = MAPP_PROP.I_MAP_COMP
WHERE
MAPP_REF.ADAPTER_INTF_TYPE = 'IContext'
and MAPP.NAME like %yourMapping%

How to look at sql generated by EntityFramework when Count is used

When I have a query generated like this:
var query = from x in Entities.SomeTable
select x;
I can set a breakpoint and after hovering cursor over query I can see what will be the SQL command sent to database. Unfortunately I cannot do it when I use Count
var query = (from x in Entities.SomeTable
select x).Count();
Of course I could see what comes to SqlServer using profiler but maybe someone has any idea how to do it (if it is possible) in VS.

You can use ToTraceString():
ObjectQuery<SomeTable> query = (from x in Entities.SomeTable select x).Count();
Console.WriteLine(query.ToTraceString());

You can use the Database.Log to log any query made like this :
using (var context = new MyContext())
{
context.Database.Log = Console.Write;
// Your code here...
}
Usually, in my context's constructor, I set that to my logger (whether it is NLog, Log4Net, or the stock .net loggers) and not the console, but actual logging tool is irrelevant.
For more information

In EF6 and above, you can use the following before your query:
context.Database.Log = s => System.Diagnostics.Debug.WriteLine(s);
I've found this to be quicker than pulling up SQL Profiler and running a trace.
Also, this post talks more about this topic:
How do I view the SQL generated by the Entity Framework?

HiveQL to HBase

I am using Hive 0.14 and Hbase 0.98.8
I would like to use HiveQL for accessing a HBase "table".
I created a table with a complex composite rowkey:
CREATE EXTERNAL TABLE db.hive_hbase (rowkey struct<p1:string, p2:string, p3:string>, column1 string, column2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ';'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf:c1,cf:c2")
TBLPROPERTIES("hbase.table.name"="hbase_table");
The table is getting successfully created, but the HiveQL is taking forever:
SELECT * from db.hive_hbase WHERE rowkey.p1 = 'xyz';
Queries without using the rowkey are fine and also using the hbase shell with filters are working.
I don't find anything in the logs, but I assume that there could be an issue with complex composite keys and performance.
Did anybody face the same issue? Hints to solve it? Other ideas, what I could try?
Thank you
Update 16.07.15:
I changed the log4j properties to 'DEBUG' and found some interesting information:
It says:
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(823)) - Pushdown Predicates of FIL For Alias : hive_hbase
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(826)) - (rowkey.p1 = 'xyz')
But some lines later:
2015-07-15 15:56:41,430 DEBUG ppd.OpProcFactory (OpProcFactory.java:pushFilterToStorageHandler(1051)) - No pushdown possible for predicate: (rowkey.p1 = 'xyz')
So my guess is: HiveQL over HBase does not do any predicate pushdown in Hbase but rather starts a MapReduce job.
Could there be a bug with the predicate pushdown?

I tried similar situation using Hive 0.13 and it works fine. I got the result. What version of hive are you working on?

Lightweight Groovy persistence

What are some lightweight options for persistence in Groovy? I've considered serialization and XML so far but I want something a bit more robust than those, at least so I don't have to rewrite the entire file every time. Ideally, it would:
Require no JARs in classpath, using Grapes instead
Require no external processes, administration, or authentication (so all embedded)
Support locking
I plan on using it to cache some information between runs of a standalone Groovy script. I imagine responses will focus around SQL and NoSQL databases. Links to pages demonstrating this usage would be appreciated. Thanks!

Full SQL Database
The h2 in-process SQL database is very easy to use. This is the same database engine grails uses by default, but it's simple to use in a groovy script as well:
#GrabConfig(systemClassLoader=true)
#Grab(group='com.h2database', module='h2', version='1.3.167')
import groovy.sql.Sql
def sql = Sql.newInstance("jdbc:h2:hello", "sa", "sa", "org.h2.Driver")
sql.execute("create table test (id int, value text)")
sql.execute("insert into test values(:id, :value)", [id: 1, value: 'hello'])
println sql.rows("select * from test")
In this case the database will be saved to a file called hello.h2.db.
Simple Persistent Maps
Another alternative is jdbm, which provides disk-backed persistent maps. Internally, it uses Java's serialization. The programming interface is much simpler, but it's also much less powerful than a full-blown SQL db. There's no support for concurrent access, but it is synchronized and thread safe, which may be enough depending on your locking requirements. Here's a simple example:
#Grab(group='org.fusesource.jdbm', module='jdbm', version='2.0.1')
import jdbm.*
def recMan = RecordManagerFactory.createRecordManager('hello')
def treeMap = recMan.treeMap("test")
treeMap[1] = 'hello'
treeMap[100] = 'goodbye'
recMan.commit()
println treeMap
This will save the map to a set of files.

just a little groovy update on simple persistence using JDBM. Concurrent access is supported now. Name has changed from JDBM4 to MapDB.
#Grab(group='org.mapdb', module='mapdb', version='0.9.3')
import java.util.concurrent.ConcurrentNavigableMap
import org.mapdb.*
DB db = DBMaker.newFileDB( new File("myDB.file") )
.closeOnJvmShutdown()
.make()
ConcurrentNavigableMap<String,String> map = db.getTreeMap("myMap")
map.put("1", "one")
map.put("2", "two")
db.commit()
println "keySet "+map.keySet()
assert map.get("1") == "one"
assert map.get("2") == "two"
db.close()

Chronicle Map is a persisted ConcurrentMap implementation for JVM.
Usage example:
ConcurrentMap<String, String> store = ChronicleMap
.of(String.class, String.class)
.averageKey("cachedKey").averageValue("cachedValue")
.entries(10_000)
.createPersistedTo(new File("cacheFile"))
store.put("foo", "bar")
store.close()

I am little late to the party. But for sake of posterity, listing one more options here:
gstorm
A simple ORM for databases and CSV files. Intended to be used in groovy scripts and small projects
disclosure: author here :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to indicate the database in SparkSQL over Hive in Spark 1.3 - database

use database is supported in later Spark versions https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html You need to put the statement in two separate spark.sql calls like this: spark.sql("use mydb") spark.sql("select * from mytab_in_mydb").show

I have not been able to implement the use databale command, but here is a workaround to use the desired database: spark-shell --queue QUEUENAME; val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val res2 = sqlContext.sql("select count(1) from DB_NAME.TABLE_NAME") res2.collect()

Related

Is there any way we can parse s string expression in Apache Flink Table API?

Is thera a way to find forced execution context on datastore objects, somewhere in ODI metadata database?

How to look at sql generated by EntityFramework when Count is used

HiveQL to HBase

Lightweight Groovy persistence

Categories

Resources