azkaban solo - log retention period as 12 weeks? - azkaban

azkaban-solo-2.5.0 execution log seems to deleted after 1 day.
But Its document says azkaban has log retention period as 12 weeks.
Which part am I confusing?
How can I save execution log more longger?

execution log of azkaban solo is saved to two places.
One is saved to H2 database, and retention of that h2 database 'execution_logs' table is controlled by property 'azkaban.execution.logs.retention.ms' in 'conf/azkaban.properties'. azkaban.execution.logs.retention.ms is introduced in this commit
On is saved to filesystem, and retention of that file directory logs is controlled by property 'execution.dir.retention'. (See FlowRunnerManager in azkaban)
So setting property 'execution.dir.retention' to 7776000000 (== 3 * 30 * 24 * 60 * 60 * 1000) should leave executionDir during 3 month.

Related

What is the result of setting the 'creationTime' metadata field of a transaction?

The pact-lang-api library accepts metadata for transactions that it formats to be sent off to a node. One of the metadata fields is creationTime. The closest description of this field I can find is in the YAML section of the Pact docs, which states this metadata field denotes:
optional integer tx execution time after offset
In the pact-lang-api code I've seen in the wild this value is usually set to the current time or a delayed time (for example, in the deploy-contract section of the pact-lang-api cookbook):
Math.round(new Date().getTime() / 1000) - 15
This indicates the current time in seconds from the Unix epoch, minus 15 seconds. In other words, the "optional integer tx execution time after offset" is fifteen seconds before this transaction is sent to a node.
What does this mean? Specifically, I'm trying to understand what the effect of setting this metadata field is, and what might happen if I set it to different values (what if I set it to 15 seconds after the transaction was constructed, instead of 15 seconds before the transaction as the cookbook does?).

Two question about Time Travel storage-costs in snowflake

I read the snowflake document a lot. Snowflake will has storage-costs if data update.
"tables-storage-considerations.html" mentioned that:
As an extreme example, consider a table with rows associated with
every micro-partition within the table (consisting of 200 GB of
physical storage). If every row is updated 20 times a day, the table
would consume the following storage:
Active 200 GB | Time Travel 4 TB | Fail-safe 28 TB | Total Storage 32.2 TB
The first Question is, if a periodical task run 20 times a day, and the task exactly update one row in each micro-partition, then the table still consume 32.2TB for the total storage?
"data-time-travel.html" mentioned that:
Once the defined period of time has elapsed, the data is moved into
Snowflake Fail-safe and these actions can no longer be performed.
So my second question is: why Fail-safe cost 28TB, not 24TB (reduce the time travel cost)?
https://docs.snowflake.com/en/user-guide/data-cdp-storage-costs.html
https://docs.snowflake.com/en/user-guide/tables-storage-considerations.html
https://docs.snowflake.com/en/user-guide/data-time-travel.html
First question: yes, it's the fact that the micro-partition is changing that is important not how many rows within it change
Question 2: fail-safe is 7 days of data. 4Tb x 7 = 28Tb

Why does the log always say "No Data Available" when the cube is built?

In the sample case on the Kylin official website, when I was building cube, in the first step of the Create Intermediate Flat Hive Table, the log is always No Data Available, the status is always running.
The cube build has been executed for more than three hours.
I checked the hive database table kylin_sales and there is data in the table.
And I fount that the intermediate flat hive table kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
has been created successfully in the hive, but there is no data in its.
hive> show tables;
OK
...
kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
kylin_sales
...
Time taken: 9.816 seconds, Fetched: 10000 row(s)
hive> select * from kylin_sales;
OK
...
8992 2012-04-17 ABIN 15687 0 13 95.5336 17 10000975 10000507 ADMIN Shanghai
8993 2013-02-02 FP-non GTC 67698 0 13 85.7528 6 10000856 10004882 MODELER Hongkong
...
Time taken: 3.759 seconds, Fetched: 10000 row(s)
The deploy environment is as follows:
 
zookeeper-3.4.14
hadoop-3.2.0
hbase-1.4.9
apache-hive-2.3.4-bin
apache-kylin-2.6.1-bin-hbase1x
openssh5.3
jdk1.8.0_144
I deployed the cluster through docker and created 3 containers, one master, two slaves.
Create Intermediate Flat Hive Table step is running.
No Data Available means this step's log has not been captured by Kylin. Usually only when the step is exited (success or failed), the log will be recorded, then you will see the data.
For this case, usually, it indicates the job was pending by Hive, due to many reasons. The simplest way is, watch Kylin's log, you will see the Hive CMD that Kylin executes, and then you can run it manually in console, then you will reproduce the problem. Please check if your Hive/Hadoop has enough resource (cpu, memory) to execute such a query.

Why is my Google Cloud SQL instance being billed every hour?

It seems like I'm being overbilled but I want to make sure I am not misunderstanding how Per Use billing works. Here are the details:
I'm running a small test PHP application on Google App Engine with no visitors other than myself every once in a while.
I periodically reset the database via cron: originally every hour, then every 3 hours last month, now every 6 hours.
Pricing plan: Per Use
Storage Used: 0.1% of 250 GB
Type: First Generation
IPv4 address: None
File system replication: Synchronous
Tier: D0
Activation Policy: On demand
Here's the billing through the first 16 days of this months:
Google SQL Service D0 usage - hour 383 hour(s) $9.57
16 days * 24 hours = 384 hours * $.025 = $9.60 . So it appears I've been charged every hour this month. This also happened last month.
I understand that I am charged the full hour for every part of an hour that the SQL instance is active.
Still, with the minimal app usage and the database reset 4 times a day, I would expect the charges (even allowing for a couple extra hours of usage each day) to be closer to:
16 days * 6 hours = 80 hours * $.025 = $2.40.
Any explanation for the discrepency?
The logs are the source of truth usually. Check them to see if you are being visited by an aggressive crawler, a stuck task that keeps retrying etc.
Or you may have a cron job that is running and performing work. You can view that in the "task queue/cron jobs" section in the control panel.
You might be have assigned an Ipv4 address to your instance and Google Developer Console clearly states
You will be charged $0.01 each hour the instance is inactive and has an IPv4 address assigned.
This might be the reason of your extra bill.

Datastore Read Operations Calculation

So I am currently performing a test, to estimate how much can my Google app engine work, without going over quotas.
This is my test:
I have in the datastore an entity, that according to my local
dashboard, needs 18 write operations. I have 5 entries of this type
in a table.
Every 30 seconds, I fetch those 5 entities mentioned above. I DO
NOT USE MEMCACHE FOR THESE !!!
That means 5 * 18 = 90 read operations, per fetch right ?
In 1 minute that means 180, in 1 hour that means 10800 read operations..Which is ~20% of the daily limit quota...
However, after 1 hour of my test running, I noticed on my online dashboard, that only 2% of the read operations were used...and my question is why is that?...Where is the flaw in my calculations ?
Also...where can I see in the online dashboard how many read/write operations does an entity need?
Thanks
A write on your entity may need 18 writes, but a get on your entity will cost you only 1 read.
So if you get 5 entries every 30 secondes during one hour, you'll have about 5reads * 120 = 600 reads.
This is in the case you make a get on your 5 entries. (fetching the entry with it's id)
If you make a query to fetch them, the cost is "1 read + 1 read per entity retrieved". Wich mean 2 reads per entries. So around 1200 reads in one hour.
For more details informations, here is the documentation for estimating costs.
You can't see on the dashboard how many writes/reads operations an entity need. But I invite you to check appstats for that.

Resources