Pervasive Data Integrator v10 - Transformation statistics (insert/update count) - pervasive

I am trying to find a way to return statistics from transformations in Pervasive Data Integrator v10. I see in the link here (http://help.pervasive.com/display/DI1025/Transformation+Object) That WrittenCount is possible to return; however, when I run the process with debug error log I can see that the counts are tracked respectively:
Execution Statistics: [Map] Total records read: 128
Execution Statistics: [Map] Total records written: 128
Execution Statistics: [Map] Total records inserted: 128
Execution Statistics: [Map] Total records updated: 0
Trying to find a way to reference this data via a macro or some other method to pass back to the db.

Project("Map").OutputRecordCount ==> return Count of rows output.
Other Objects are ReturnCode and status; however, OutputRecordCount is what you want.

Related

I don't understand how some parameters of Query Store in MS SQL Server works

According to official documentation: In sys.database_query_store_options we have options which can adjust Query Store workflow and performance.
From documentation:
"flush_interval_seconds - The period for regular flushing of Query Store data to disk in seconds. Default value is 900 (15 min)"
"interval_length_minutes - The statistics aggregation interval in minutes. Arbitrary values are not allowed. Use one of the following: 1, 5, 10, 15, 30, 60, and 1440 minutes. The default value is 60 minutes."
And now i have a problem:
If Query Store flush data to disk every 15min, why do i see query in QS tables in seconds after execution?
As i understand QS tables are 'permanent' and they are stored in data base (on disk), so how does this parameter (flush_interval_seconds) work?
The same thing about interval_length_minute - when i saved QS output after 1 minute from last query execution and after 61 minutes i realised that they are more less the same, so what about this aggregation?
flush_interval_seconds - The period for regular flushing of Query Store data to disk in seconds. That means flushing from memory to disk so that the information wouldn't be lost after server restart. Before the flushing you just read info from memory.
interval_length_minute - this is aggregation interval for query runtime statistics. The lower it is the finer granularity of the runtime statistics becomes.
None of the options sets a period after which the info will be available.

Why does the log always say "No Data Available" when the cube is built?

In the sample case on the Kylin official website, when I was building cube, in the first step of the Create Intermediate Flat Hive Table, the log is always No Data Available, the status is always running.
The cube build has been executed for more than three hours.
I checked the hive database table kylin_sales and there is data in the table.
And I fount that the intermediate flat hive table kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
has been created successfully in the hive, but there is no data in its.
hive> show tables;
OK
...
kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
kylin_sales
...
Time taken: 9.816 seconds, Fetched: 10000 row(s)
hive> select * from kylin_sales;
OK
...
8992 2012-04-17 ABIN 15687 0 13 95.5336 17 10000975 10000507 ADMIN Shanghai
8993 2013-02-02 FP-non GTC 67698 0 13 85.7528 6 10000856 10004882 MODELER Hongkong
...
Time taken: 3.759 seconds, Fetched: 10000 row(s)
The deploy environment is as follows:
 
zookeeper-3.4.14
hadoop-3.2.0
hbase-1.4.9
apache-hive-2.3.4-bin
apache-kylin-2.6.1-bin-hbase1x
openssh5.3
jdk1.8.0_144
I deployed the cluster through docker and created 3 containers, one master, two slaves.
Create Intermediate Flat Hive Table step is running.
No Data Available means this step's log has not been captured by Kylin. Usually only when the step is exited (success or failed), the log will be recorded, then you will see the data.
For this case, usually, it indicates the job was pending by Hive, due to many reasons. The simplest way is, watch Kylin's log, you will see the Hive CMD that Kylin executes, and then you can run it manually in console, then you will reproduce the problem. Please check if your Hive/Hadoop has enough resource (cpu, memory) to execute such a query.

Presto running slower than SQL Server

Configured the SQL Server connnector in Presto, and tried few simple queries like:
Select count(0) from table_name
or,
Select sum(column_name) from table_name
Both above queries ran in SQL server in 300 ms and in Presto its running over 3 min.
This is the explain analyze of the second query (it seems to do table scan and fetch huge amount of data before doing sum), why it couldnt pushed down the sum operator to SQL Server itself.
Query Plan
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Fragment 1 [SINGLE]
Cost: CPU 2.98ms, Input: 1 row (9B), Output: 1 row (9B)
Output layout: [sum]
Output partitioning: SINGLE []
- Aggregate(FINAL) => [sum:double]
Cost: ?%, Output: 1 row (9B)
Input avg.: 1.00 lines, Input std.dev.: 0.00%
sum := "sum"("sum_4")
- LocalExchange[SINGLE] () => sum_4:double
Cost: ?%, Output: 1 row (9B)
Input avg.: 0.06 lines, Input std.dev.: 387.30%
- RemoteSource[2] => [sum_4:double]
Cost: ?%, Output: 1 row (9B)
Input avg.: 0.06 lines, Input std.dev.: 387.30%
Fragment 2 [SOURCE]
Cost: CPU 1.67m, Input: 220770667 rows (1.85GB), Output: 1 row (9B)
Output layout: [sum_4]
Output partitioning: SINGLE []
- Aggregate(PARTIAL) => [sum_4:double]
Cost: 0.21%, Output: 1 row (9B)
Input avg.: 220770667.00 lines, Input std.dev.: 0.00%
sum_4 := "sum"("total_base_dtd")
- TableScan[sqlserver:sqlserver:table_name:ivpSQLDatabase:table_name ..
Cost: 99.79%, Output: 220770667 rows (1.85GB)
Input avg.: 220770667.00 lines, Input std.dev.: 0.00%
total_base_dtd := JdbcColumnHandle{connectorId=sqlserver, columnName=total_base_dtd, columnType=double}
Both example queries are aggregate queries that produce single row result.
Currently, in Presto it is not possible to push down an aggregation to the underlying data store. Conditions and column selection (narrowing projections) are pushed down, but aggregations are not.
As a result, when you query SQL Server from Presto, Presto needs to read all the data (from given column) to do the aggregation, so there is a lot of disk and network traffic. Also, it might be, that SQL Server could optimize away certain aggregations so it might be skipping data read at all (i am guessing here).
Presto is not suited to be a frontend to some other database. It can be used as such, but this has some implications. Presto shines when it is put to work as a big data query engine (over S3, HDFS or other object stores) or as a federated query engine, where you combine data from multiple data stores / connectors.
Edit there is an ongoing work in Presto to improve pushdown, including aggregate pushdown. You can track it at https://github.com/prestosql/presto/issues/18
Presto doesn't support aggregate-pushdowns but as a workaround, you can create views in the source database (SQL Server in your case) and query those views from Presto.

Slower search response in solr

I have a collection with 3 shards containing 5M records, with 10 fields, index size on disk is less than 1 GB, the document has 1 long valued fields which need to be sorted in every query.
All the queries are filter queries with one range query filter, where sorting on the basis of long value has to be applied.
I am expected to get the response under 50 milliseconds(including elapsed time). however, the actual Qtime range from 50-100 ms while Elapsed time varies from 200-350 ms.
Note: I have used docValues for all the fields, configured newSearcher/firstSearcher. Still, I do not see any improvement in response.
What could be the possible tuning options?
Try to index those values.That may help.
I am not quite sure but you can give a try.

Lucene query performance

I have a lucene index which has close to 480M documents. The size of the index is 36G. And I ran around 10000 queries against the index. Each query is a boolean AND query with 3 term queries inside. That is the query has 3 operands which MUST occur. Executing such 3 word queries gives the following latency percentiles.
50th = 16 ms
75th = 52 ms
90th = 121 ms
95th = 262 ms
99th = 76010 ms
99.9th = 76037 ms
Is the latency expected to degrade when the number of docs is as high as 480M? All the segments in the index are merged into one segment. Even when the segments are not merged, the latencies are not very different. Each document has 5-6 stored fields. But as mentioned above, the above latencies are for boolean queries that don't access any stored fields, but just do a posting list lookup on 3 tokens.
Any ideas on what could be wrong here?

Resources