Stackoverflow in Pentaho - loops

I have an issue that I can't seem to resolve. I have a job (A) that calls another job (B). Iteration takes place in job B. There is a loop inside job B which reads one row at a time from a source file and writes to a text file.
Problem is the source file contains 37,000 rows but the execution stops at row 27,000. It crashes and gives me
"ERROR (version 6.0.0.0-353, build 1 from 2015-10-07 13.27.43 by
buildguy) : java.lang.StackOverflowError"
I have tried to gradually increase the stack in spoon.bat from 1g to 7g "-Xms7g" "-Xmx12g" "-XX:MaxPermSize=256m" but continues to crash. Any idea how can I solve this problem?

I finaly solved the the problem, i added one parameter ("-Xss512m") to my spoon.bat "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS="-Xms2048m" "-Xmx4096m" "-Xss512m" "-XX:MaxPermSize=256m"

Related

Spark alternatives for reading incomplete files

I am using Spark 2.3 to read Parquet files. The files average about 20mb each and currently have about 20,000 of them. The file directories are partitioned down to the day level across 24 months.
The issue I am facing is that on an occasional basis most of the files are rewritten by an Hbase application. This does not occur every day, but when it does it takes several days for that process to complete for all files. My process needs to look at the data in ALL files each day to scan for changes based on an update date stored in each record. But I am finding if I read the HDFS dir during the period of the mass file rewrite, my Spark program fails and I get errors like these:
java.io.EOFException: Cannot seek after EOF
OR
java.io.IOException: Could not read footer for file: FileStatus{path=hdfs://project/data/files/year=2020/month=02/day=10/file_20200210_11.parquet
OR
Caused by: java.lang.RuntimeException: hdfs://project/data/files/year=2020/month=02/day=10/file_20200210_11.parquet is not a Parquet file. expected magic number at tail
I assume the errors are because it is trying to read a file that is not finalized (still being written to). But from Spark's perspective it looks like a corrupt file. So my process would fail and not be able to read the files for a couple days until all files are stable.
Code is pretty basic:
val readFilePath="/project/data/files/"
val df = spark.read
.option("spark.sql.parquet.mergeSchema", "true")
.parquet(readFilePath)
.select(
$"sourcetimestamp",
$"url",
$"recordtype",
$"updatetimestamp"
)
.withColumn("end_date", to_date(date_format(lit("2021-08-23"), "yyyy-MM-dd")))
.withColumn("start_date", date_sub($"end_date", 2))
.withColumn("update_date", to_date($"updatetimestamp","yyyy-MM-dd"))
.filter( $"update_date" >= $"start_date" and $"update_date" <= $"end_date" )
Is there anything I can do programmatically to work around a problem like this? It doesn't seem like I can trap an error like that and make it continue. In Spark version 3 there is an option "spark.sql.files.ignoreCorruptFiles", that I think would help, but not in version 2.3 so that doesn't help me.
I considered reading a single file at a time and loop through all files, but that would take forever (about 12 hours based on a test of just a single month). Outside of asking the application owner to write changed files to a temp dir then move and replace them as each is completed into the main dir, I don't see any other alternatives so far. Not even sure if that would help or if I would run into collisions during the small period of time the file was being moved.

Using camel-smb SMB picks up (large) files while still be written to

When trying to create cyclic moving of files encountered strange behavior with readLock. Create a large file (some 100Mb's) and transfer it using SMB from out to in folder.
FROM:
smb2://smbuser:****#localhost:4455/user/out?antInclude=FILENAME*&consumer.bridgeErrorHandler=true&delay=10000&inProgressRepository=%23inProgressRepository&readLock=changed&readLockMinLength=1&readLockCheckInterval=1000&readLockTimeout=5000&streamDownload=true&username=smbuser&delete=true
TO:
smb2://smbuser:****#localhost:4455/user/in?username=smbuser
Create another flow to move the file back from IN to OUT folder. After some transfers the file will be picked up while still being written to by another route and a transfer will be done with a much smaller file, resulting in a partial file at the destination.
FROM:
smb2://smbuser:****#localhost:4455/user/in?antInclude=FILENAME*&delete=true&readLock=changed&readLockMinLength=1&readLockCheckInterval=1000&readLockTimeout=5000&streamDownload=false&delay=10000
TO:
smb2://smbuser:****#localhost:4455/user/out
Question is: why my readLock is not working properly (p.s. streamDownload is required)?
UPDATE: turns out this only happens on windows samba share, and with streamDownload=true. So, something with stream chunking. Any advice welcome.
The solution requires to prevent polling strategy from automatically picking up a file, and become aware of readLock (in progress) on the other side. So I lowered delay to 5 seconds and in FROM part, on both sides, I added readLockMinAge to 5s which will inspect file modification time.
Since streaming goes for every second this is enough time to prevent read lock.
An explanation of why the previously mentioned situation happens:
When a route prepares to pick-up from out folder, a large file (1GB) is in progress chunk by chunk to in folder. At the end of the streaming file is marked for
removal by camel-smbj and file receive status STATUS_DELETE_PENDING.
Now another part of this process starts to send a newly arrived file to the out folder and finds out that this file already exists. Because of the default fileExists=Override strategy
It tries to delete (afterward store) an existing file (which is still not deleted from the previous step) and receives an exception which causes some InputStream
chunks to be lost.

TCL Opaque Handle C lib

I am confused with the following code from tcl wiki 1089,
#define TEMPBUFSIZE 256 /* usually enough space! */
char buf[[TEMPBUFSIZE]];
I was curious and I tried to compile the above syntax in gcc & armcc, both fails. I was looking to understand how tcl handle to file pointer mechanism works to solve the chaos on data logging by multiple jobs running in a same folder [log files unique to jobs].
I have multiple tcl scripts running in parallel as LSF Jobs each using a log file.
For example,
Job1 -> log1.txt
Job2 -> log2.txt
(file write in both case is "intermittent" over the entire job execution)
Some of the text which I expect to be part of log1.txt is written to log2.txt and vice versa randomly. I have tried with "fconfigure $fp -buffering none", the behaviour still persists. One important note, all the LSF jobs are submitted from the same folder and if I submit the jobs from individual folder, the log files dont have text write from other job. I would like the jobs to be executed from same folder to reduce space consumption from repeating the resource in different folder.
Question1:
Can anyone advice me how the tcl "handles" is interpreted to a pointer to the memory allocated for the log file? I have mentioned intermitent because of the following, "Tcl maps this string internally to an open file pointer when it is time for the interpreter to do some file I/O against that particular file - wiki 1089"
Question2:
Is there a possiblity that two different "open" can end up having same "file"?
Somewhere along the line, the code has been mangled; it looks like it happened when I converted the syntax from one type of highlighting scheme to another in 2011. Oops! My original content used:
char buf[TEMPBUFSIZE];
and that's what you should use. (I've updated the wiki page to fix this.)

SSIS truncation error only in control flow

I have a package that giving me a very confusing "Text was truncated or one or more characters had no match in the target code page" error but only when I run the full package in the control flow, not when I run just the task by itself.
The first task takes CSV files and combines them into one file. The next task reads the output of the previous file and begins to process the records. What is really odd is the truncation error is thrown in the flat file source in the 2nd step. This is the exact same flat file source which was the destination in the previous step.
If there was a truncation error wouldn't that be thrown by the previous step that tried to create the file? Since the 1st step created the file without truncation, why can't I just read that same file in the very next task?
Note - Only thing that makes this package different from the others I have worked on is I am dealing with special characters and using code page 65001 UTF-8 to capture the fields that have special characters. My other packages were all referencing flat file connection managers with code page 1252.
The problem was caused by the foreach loop and using the ColumnNamesInFirstDataRow expression where I have the formula "#[User::count_raw_input_rows] < 0". I have a variable initialized to -1 and I assign it to the ColumnNamesInFirstDataRow for the flat file. When in the loop I update the variable with a row counter on each read of a CSV file. This puts the header in the first time (-1) but then avoids repeating on all the other CSV files. When I exit the loop and try to read the input file it treats the header as data and blows up. I only avoided this on my last package because I didn't tighten the column definitions for the flat file like I did with this package. Thanks for the help.

copy lines in range using Python/Jython

I have a server log file which is constantly updated.
after some script execution, I want to collect the relevant part of my execution from the server log into an execution log.
Basically, I need to capture the last line number of my server log before the test started (which I already know how to do), say X, and after execution, copy the lines from X to the new end of server log, now, X+1000.
How do I copy only lines X to X+1000?
Thanks, Assaf
Try this
open("execution.log", "w").write("\n".join(open("server.log").readlines()[X:X+1000]))

Resources