in-effficiency in flink job like : INSERT INTO hive_table select orgId, 2.0 , pdate, '02' from users limit 10000 where user is a kafka table - apache-flink

The system should just pick 10000 message and finish . Here is what is forever running. it taking in 78 G data and forever going. don't know if this default behavior. Also. it never commit in commit sink.
Above is running on flink1.12 scala 2.12 build with all hive 3.1.2
enter image description here

The streaming file sink only commits while checkpointing. Perhaps you need to enable and configure checkpointing.

Related

How does the file system connector sink work

I am using the following simple code to illustrate the behavior of file system connector.
I have two observations that I want to ask and confirm.
If I didn't enable checkpointing, then all of the genereated part-XXX files always contain inprogress in the file name, Does it mean these files are not committed? Also, does it mean that if I want to use file system connector sink, then I always need to enable checkpointing so that the generated files can be committed and the downstream(like hive or flink) can discover and read these files?
When does the inprogress files are moved to normal in the partition? Does it happen when the new partition is created, and when checkpoint starts to run,then makes the files in the previous partition from inprogress to be formal ? If so, then there may be a deplay(checkpoint interval) for the partition to be visible.
I have set the rolling interval to be 20 seconds in the code, but when I look at the generated part-XXX files, the differnce of the creation time for the consequent files is 25 seconds. I have thought it should be 20 seconds
eg,
part-90e63e04-466f-45ce-94d4-9781065a8a8a-0-10 2021-0‎1-0‎3 ‏‎12:39:04
part-90e63e04-466f-45ce-94d4-9781065a8a8a-0-11 2021-0‎1-0‎3 ‏‎12:39:29
The code is:
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.enableCheckpointing(10*1000)
env.setStateBackend(new FsStateBackend("file:///d:/flink-checkpoints"))
val ds: DataStream[MyEvent] = env.addSource(new InfiniteEventSource(emitInterval = 5 * 1000))
val tenv = StreamTableEnvironment.create(env)
tenv.createTemporaryView("sourceTable", ds)
ds.print()
val ddl =
s"""
create table sinkTable(
id string,
p_day STRING,
p_hour STRING,
p_min STRING
) partitioned by(p_day, p_hour, p_min) with (
'connector' = 'filesystem',
'path' = 'D:/csv-${System.currentTimeMillis()}',
'format' = 'csv',
'sink.rolling-policy.check-interval' = '5 s',
'sink.rolling-policy.rollover-interval' = '20 s',
'sink.partition-commit.trigger'='process-time',
'sink.partition-commit.policy.kind'='success-file',
'sink.partition-commit.delay' = '0 s'
)
""".stripMargin(' ')
tenv.executeSql(ddl)
tenv.executeSql(
"""
insert into sinkTable
select id, date_format(occurrenceTime,'yyyy-MM-dd'), date_format(occurrenceTime, 'HH'), date_format(occurrenceTime, 'mm') from sourceTable
""".stripMargin(' '))
env.execute()
}
Points 1 is covered in the StreamingFileSink docs:
IMPORTANT: Checkpointing needs to be enabled when using the StreamingFileSink. Part files can only be finalized on successful checkpoints. If checkpointing is disabled, part files will forever stay in the in-progress or the pending state, and cannot be safely read by downstream systems.
For point 2, the part file lifecycle is documented here, which explains that in-progress files transition to pending based on the rolling policy, and are only become finished when a checkpoint is completed. Thus, depending on the rolling policy and the checkpoint interval, some files could be pending for quite some time.
For point 3, with a rollover-interval of 20 seconds, and a check-interval of 5 seconds, the rollover will occur after somewhere between 20 and 25 seconds. See the Rolling Policy docs for the explanation of check-interval:
The interval for checking time based rolling policies. This controls the frequency to check whether a part file should rollover based on 'sink.rolling-policy.rollover-interval'.

Flink or kafka stream in case where any change in stream result in processing all data

I have a use case where lets i get balances based on date and I want to show correct balances of each day. If get an update on older date all my balances of that account from that date gets changed.
for eg
Account Date balance Total balance
IBM 1Jun 100 100
IBM 2Jun 50 150
IBM 10Jun 200 350
IBM 12Jun 200 550
Now I get a message of date 4 Jun (this is the scenario some transaction is done back dated, or some correction and its frequent scenario)
Account Date balance Total balance
IBM 1Jun 100 100
IBM 2Jun 50 150
IBM 4Jun 300 450 ----------- all data from this point changes
IBM 10Jun 200 650
IBM 12Jun 200 850
Its a streaming data and at any point I want the correct balance to be shown for each account.
I know flink and kafka are good for streaming use case where if an update of a particular date doesnt trigger update on all data from that point onwards. But can we achieve this scenario as well efficiently or is this NOT a use case of these streaming tech at all ?
Please help
You can't modify a past message in the queue, therefore you should introduce a new message that invalidates previous one. For instance, you can use an ID for each transaction (and repeat it if you need to modified it). In case you have two or more messages with the same ID, you keep with the last one.
Take a look to KTable from Kafka Streams. It can help you to aggregate data using that ID (or any other aggregation factor) and generate a table as a result with the valid resume until now. If a new message arrives, table updates will be emitted

Why does the log always say "No Data Available" when the cube is built?

In the sample case on the Kylin official website, when I was building cube, in the first step of the Create Intermediate Flat Hive Table, the log is always No Data Available, the status is always running.
The cube build has been executed for more than three hours.
I checked the hive database table kylin_sales and there is data in the table.
And I fount that the intermediate flat hive table kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
has been created successfully in the hive, but there is no data in its.
hive> show tables;
OK
...
kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
kylin_sales
...
Time taken: 9.816 seconds, Fetched: 10000 row(s)
hive> select * from kylin_sales;
OK
...
8992 2012-04-17 ABIN 15687 0 13 95.5336 17 10000975 10000507 ADMIN Shanghai
8993 2013-02-02 FP-non GTC 67698 0 13 85.7528 6 10000856 10004882 MODELER Hongkong
...
Time taken: 3.759 seconds, Fetched: 10000 row(s)
The deploy environment is as follows:
 
zookeeper-3.4.14
hadoop-3.2.0
hbase-1.4.9
apache-hive-2.3.4-bin
apache-kylin-2.6.1-bin-hbase1x
openssh5.3
jdk1.8.0_144
I deployed the cluster through docker and created 3 containers, one master, two slaves.
Create Intermediate Flat Hive Table step is running.
No Data Available means this step's log has not been captured by Kylin. Usually only when the step is exited (success or failed), the log will be recorded, then you will see the data.
For this case, usually, it indicates the job was pending by Hive, due to many reasons. The simplest way is, watch Kylin's log, you will see the Hive CMD that Kylin executes, and then you can run it manually in console, then you will reproduce the problem. Please check if your Hive/Hadoop has enough resource (cpu, memory) to execute such a query.

schedule job jboss EAP

I had an old application that was running on websphere and using old cron job scheduling library which was written in house long time ago.
I am trying to convert it to JBOSS EAP6.4 and I could not determine a good way to convert the job scheduler.
Basically, in the old app we were using a config file that lists the jobs and frequency.
This is an example of the config file
year mo dom dow hr mn prio persist package.class parms
# ==== == === === == == ======= ======= ============================================== ============================
* * * * * 15,45 norm false com.shaw.CronClass1 O
* * * 1,2,3,4,5,6 0-17,19-23 00,30 norm false com.CronClass2 B
* * * 0 1-23 00,30 norm false com.CronClass3 B
The format messy, but basically the first line says: run this job every twice every hour at 15 min and 45 min
The second line says: run this job mon-sat between 12AM-5PM and then 7PM-11PM , every 30 minutes.
I want to do something similar with JBOSS , I saw the jboss Timer service
http://docs.oracle.com/javaee/6/tutorial/doc/bnboy.html
But I don't think it has all those options and I cannot use those settings in annotation because they can change. that's why we put them in an external file that is loaded when the app starts.
Is there any library, tool or a way to this easily ?
You can use the Quartz job scheduler API. It allows scheduling both simple timers and CRON timers. The example to set it up with JBoss/ Wildfly is provided here http://www.mastertheboss.com/jboss-frameworks/jboss-quartz/quartz-2-tutorial-on-jboss-as-7

SQL Server 2012 Always on has high logcapture wait type

From reading, I can see the Work_Queue wait can safely be ignored, but I don't find much about logcapture_wait. This is from BOL, "Waiting for log records to become available. Can occur either when waiting for new log records to be generated by connections or for I/O completion when reading log not in the cache. This is an expected wait if the log scan is caught up to the end of log or is reading from disk."
Average disk sec/write is basically 0 for both SQL Servers so I'm guessing this wait type can safely be ignored?
Here are the top 10 waits from the primary:
wait_type pct running_pct
HADR_LOGCAPTURE_WAIT 45.98 45.98
HADR_WORK_QUEUE 44.89 90.87
HADR_NOTIFICATION_DEQUEUE 1.53 92.40
BROKER_TRANSMITTER 1.53 93.93
CXPACKET 1.42 95.35
REDO_THREAD_PENDING_WORK 1.36 96.71
HADR_CLUSAPI_CALL 0.78 97.49
HADR_TIMER_TASK 0.77 98.26
PAGEIOLATCH_SH 0.66 98.92
OLEDB 0.53 99.45
Here are the top 10 waits from the secondary:
wait_type pct running_pct
REDO_THREAD_PENDING_WORK 66.43 66.43
HADR_WORK_QUEUE 31.06 97.49
BROKER_TRANSMITTER 0.79 98.28
HADR_NOTIFICATION_DEQUEUE 0.79 99.07
Don't troubleshoot problems on your server by looking at total waits. If you want to troubleshoot what is causing you problems, then you need to look at current waits. You can do that by either querying sys.dm_os_waiting_tasks or by grabbing all waits (like you did above), waiting for 1 minute, grabbing all waits again, and subtracting them to see what waits actually occurred over that minute.
See the webcast I did for more info: Troubleshooting with DMVs
That aside, HADR_LOGCAPTURE_WAIT is a background wait type and does not affect any running queries. You can ignore it.
No, you can't just simply ignore " HADR_LOGCAPTURE_WAIT". this wait type happens when SQL is either waiting for some new log data to be generated or when there are some latency while trying to read data from the log file. internal and external fragmentation on the log file or slow storage could contribute to this wait type as well.

Resources