Query in batch? - google-app-engine

Using gcloud-node is it possible to query in batches (multiple queries in 1 network call)? I know it is possible to get and delete in batches, but can I do the same with queries somehow?

No, Cloud Datastore doesn't support query batches, so none of the client libraries can do this.
The alternative it to send queries in parallel instead.

Related

Flink SQL behavior

I want to execute Flink SQL on batch data. (CSVs in S3)
However, I explicitly want Flink to execute my query in a streaming fashion because I think it will be faster than the batch mode.
For example, my query consists of filtering on two tables and joining the filtered result. I want Flink not to materialize the two tables in blocking batch fashion and then pipe the result through the join, but use a streaming hash join operator like in the datastream API.
How do I make this happen? I am using PyFlink.
You can read at https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution_mode/ how you can set the Execution Mode for a Flink application. Combine this with https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/python_config/ which explains how you can specify configuration options in Python applications.

Setup Flink application with multiple different SQL query(aggregation)

I need to build a pipeline with different column aggregations from a same source, say, one by userId, another by productId, and etc. I also want to have different granularity aggregations, say, by hourly, daily. Each aggregation will have a different sink, say a different nosql table.
It seems simple to build a SQL query with Table API. But I would like to reduce the operation overhead of managing too many Flink apps. So I am thinking putting all different SQL queries in one pyflink app.
This is first time I build Flink app. So I am not sure how feasible this is. In particular, I'd like to know:
Read the Flink doc, I see there are concepts of application vs job. So I am curious if each SQL aggregation query is a single Flink job?
will the overall performance degraded because of too many queries in one Flink app?
since the queries share a same source(from kinesis), will each query get a copy of the source. Basically, I want to make sure each event will be processed by each sql aggregation query.
Thanks!
You can put multiple queries into one job if you use a statement set:
https://docs.ververica.com/user_guide/sql_development/sql_scripts.html

Synchronizing data from MSSQL to Elasticsearch using Apache Kafka

I'm currently running a text search in SQL Server, which is becoming a bottleneck and I'd like to move things to Elasticsearch for obvious reasons, however I know that I have to denormalize data for best performance and scalability.
Currently, my text search includes some aggregation and joining multiple tables to get the final output. Tables, that are joined, aren't that big (up to 20GB per table) but are changed (inserted, updated, deleted) irregularly (two of them once in a week, other one on demand x times a day).
My plan would be to use Apache Kafka together with Kafka Connect in order to read CDC from my SQL Server, join this data in Kafka and persist it in Elasticsearch, however I cannot find any material telling me how deletes would be handled when data is being persisted to Elasticsearch.
Is this even supported by the default driver? If not, what are the possibilities? Apache Spark, Logstash?
I am not sure whether this is already possible in Kafka Connect now, but it seems that this can be resolved with Nifi.
Hopefully I understand the need, here is the documentation for deleting Elasticsearch records with one of the standard NiFi processors:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-elasticsearch-5-nar/1.5.0/org.apache.nifi.processors.elasticsearch.DeleteElasticsearch5/

Need Suggestions: Utilizing columnar database

I am working on a project which is highly performance dashboard where results are mostly aggregated mixed with non-aggregated data. First page is loaded by 8 different complex queries, getting mixed data. Dashboard is served by a centralized database (Oracle 11g) which is receiving data from many systems in realtime ( using replication tool). Data which is shown is realized through very complex queries ( multiple join, count, group by and many where conditions).
The issue is that as data is increasing, DB queries are taking more time than defined/agreed. I am thinking to move aggregated functionality to Columnar database say HBase ( all the counts), and rest linear data will be fetched from Oracle. Both the data will be merged based on a key on App layer. Need experts opinion if this is correct approach.
There are few things which are not clear to me:
1. Will Sqoop be able to load data based on query/view or only tables? on continuous basis or one time?
2. If a record is modified ( e.g. status is changed), how will HBase get to know?
My two cents. HBase is a NoSQL database build for fast lookup queries, not to make aggregated, ad-hoc queries.
If you are planning to use a hadoop cluster, you can try hive with parquet storage formart. If you need near real-time queries, you can go with MPP database. A commercial option is Vertica or maybe Redshift from Amazon. For an open-source solution, you can use InfoBrigth.
These columnar options is going to give you a greate aggregate query performance.

Can run query parallel in SQL?

I know that in .Net provide parallel programming but I don't know if is it possible to run query parallel in SQLServer. If is it possible, please give me for example query parallel or web link to show technology.
if is it possible to run query parallel in SQL Server. If is it possible,
What you mean with parallel?
Multiple queries at the same time? How you thin kSQL Server handles multiple users. Open separate connections, run queries on them.
One query? Let SQL Server parallelize it - as it does automatically. As is written in the documentation.
This may help you or not, but opening two instances of SSMS works too.
This has been covered a number of times. Try this.
Parallel execution of multiple SQL Select queries
If you're using >NET 4.5 then using the new async methods would be a cleaner approach.
Using SqlDataReader’s new async methods in .Net 4.5
Remember that doing this will make your client more responsive at the cost of placing more load on your SQL server. Rather than every client sending a single SQL command at a time they will be sending several so to the SQL server it will appear as though there at many more clients accessing it.
The load on the client will be minimal as it will be using more threads but most of the time these will simply be waiting for results to return from SQL.
Short answer: yes. Here's a technical link: http://technet.microsoft.com/en-us/library/ms178065(v=sql.105).aspx
Parallelism in SQL Server is baked in to the technology; whenever a query exceeds a certain cost, the optimizer will generate a parallel execution plan.

Resources