I am performing a scatter-gather operation on NextFlow.
It looks like the following:
reads = PATH+"test_1.fq"
outdir = "results"
split_read_ch = channel.fromFilePairs(reads, checkIfExists: true, flat:true ).splitFastq( by: 10, file:"test_split" )
process Scatter_fastP {
tag 'Scatter_fastP'
publishDir outdir
input:
tuple val(name), path(reads) from split_read_ch
output:
file "${reads}.trimmed.fastq" into gather_fatsp_ch
script:
"""
fastp -i ${reads} -o ${reads}.trimmed.fastq
"""
}
gather_fatsp_ch.collectFile().view().println{ it.text }
I run this code with all the benchmarks options proposed by Nextflow (https://www.nextflow.io/docs/latest/tracing.html):
nextflow run main.nf -with-report nextflow_report -with-trace nextflow_trace -with-timeline nextflow_timeline -with-dag nextflow_dag.html
In these tracing files, I can find the resources and speed of the 10 Scatter_fastP processes.
But I would like to also measure the resources and speed of the creation of the split_read_ch and the gather_fastp_ch channels.
I have tried to include the channels' creation in processes but I cannot find a solution to make it work.
Is there a way to include the channel creation into the tracing files? Or is there a way I have not found to create these channels into processes?
Thank you in advance for your help.
Although Nextflow can parse FASTQ files and split them into smaller files etc, generally it's better to pass off these operations to another process or set of processes, especially if your input FASTQ files are large. This is beneficial in two ways: (1) your main nextflow process doesn't need to work as hard, and (2) you get granular task process stats in your nextflow reports.
The following example uses GNU split to split the input FASTQ files, and gathers the outputs using the groupTuple() operator and the groupKey() built-in to stream the collected values as soon as possible. You'll need to adapt for your non-gzipped inputs:
nextflow.enable.dsl=2
params.num_lines = 40000
params.suffix_length = 5
process split_fastq {
input:
tuple val(name), path(fastq)
output:
tuple val(name), path("${name}-${/[0-9]/*params.suffix_length}.fastq.gz")
shell:
'''
zcat "!{fastq}" | split \\
-a "!{params.suffix_length}" \\
-d \\
-l "!{params.num_lines}" \\
--filter='gzip > ${FILE}.fastq.gz' \\
- \\
"!{name}-"
'''
}
process fastp {
input:
tuple val(name), path(fastq)
output:
tuple val(name), path("${fastq.getBaseName(2)}.trimmed.fastq.gz")
"""
fastp -i "${fastq}" -o "${fastq.getBaseName(2)}.trimmed.fastq.gz"
"""
}
workflow {
Channel.fromFilePairs( './data/*.fastq.gz', size: 1 ) \
| split_fastq \
| map { name, fastq -> tuple( groupKey(name, fastq.size()), fastq ) } \
| transpose() \
| fastp \
| groupTuple() \
| map { key, fastqs -> tuple( key.toString(), fastqs ) } \
| view()
}
Related
I'd like to add a new option to ntpd however I couldn't find how to generate ntpd/ntpd-opts{.c, .h} after adding some lines to ntpd/ntpdbase-opts.def e.g.,
$ git diff ntpd/ntpdbase-opts.def
diff --git a/ntpd/ntpdbase-opts.def b/ntpd/ntpdbase-opts.def
index 66b953528..a790cbd51 100644
--- a/ntpd/ntpdbase-opts.def
+++ b/ntpd/ntpdbase-opts.def
## -479,3 +479,13 ## flag = {
the server to be discovered via mDNS client lookup.
_EndOfDoc_;
};
+
+flag = {
+ name = foo;
+ value = F;
+ arg-type = number;
+ descrip = "Some new option";
+ doc = <<- _EndOfDoc_
+ For testing purpose only.
+ _EndOfDoc_;
+};
Do you have any ideas?
how to generate ntpd/ntpd-opts{.c, .h} after adding some lines to ntpd/ntpdbase-opts.def
It is just in build scripts. Just compile https://github.com/ntp-project/ntp/blob/master-no-authorname/INSTALL#L30 it normally and make will pick it up.
https://github.com/ntp-project/ntp/blob/master-no-authorname/ntpd/Makefile.am#L304
https://github.com/ntp-project/ntp/blob/master-no-authorname/ntpd/Makefile.am#L183
In addition to #KamilCuk's answer, we need to do the following to add custom options:
Edit *.def file
Run bootstrap script
Run configure script with --disable-local-libopts option
Run make
For example,
$ git diff ntpd/ntpdbase-opts.def
diff --git a/ntpd/ntpdbase-opts.def b/ntpd/ntpdbase-opts.def
index 66b953528..a790cbd51 100644
--- a/ntpd/ntpdbase-opts.def
+++ b/ntpd/ntpdbase-opts.def
## -479,3 +479,13 ## flag = {
the server to be discovered via mDNS client lookup.
_EndOfDoc_;
};
+
+flag = {
+ name = foo;
+ value = F;
+ arg-type = number;
+ descrip = "Some new option";
+ doc = <<- _EndOfDoc_
+ For testing purpose only.
+ _EndOfDoc_;
+};
This change yields:
$ ./ntpd --help
ntpd - NTP daemon program - Ver. 4.2.8p15
Usage: ntpd [ -<flag> [<val>] | --<name>[{=| }<val>] ]... \
[ <server1> ... <serverN> ]
Flg Arg Option-Name Description
-4 no ipv4 Force IPv4 DNS name resolution
- prohibits the option 'ipv6'
...
-F Num foo Some new option
opt version output version information and exit
-? no help display extended usage information and exit
-! no more-help extended usage information passed thru pager
Options are specified by doubled hyphens and their name or by a single
hyphen and the flag character.
...
I want to use Flink to read from an input file, do some aggregation, and write the result to an output file. The job is in batch mode. See wordcount.py below:
from pyflink.table import EnvironmentSettings, BatchTableEnvironment
# https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table_api_tutorial.html
env_settings = EnvironmentSettings.new_instance().in_batch_mode().build()
table_env = BatchTableEnvironment.create(environment_settings=env_settings)
my_source_ddl = """
create table mySource (
word VARCHAR
) with (
'connector' = 'filesystem',
'format' = 'csv',
'path' = '/tmp/input'
)
"""
my_sink_ddl = """
create table mySink (
word VARCHAR,
`count` BIGINT
) with (
'connector' = 'filesystem',
'format' = 'csv',
'path' = '/tmp/output'
)
"""
transform_dml = """
INSERT INTO mySink
SELECT word, COUNT(1) FROM mySource GROUP BY word
"""
table_env.execute_sql(my_source_ddl)
table_env.execute_sql(my_sink_ddl)
table_env.execute_sql(transform_dml).wait()
# before run: echo -e "flink\npyflink\nflink" > /tmp/input
# after run: cat /tmp/output
Before running python wordcount.py, I run echo -e "flink\npyflink\nflink" > /tmp/input to make sure data exist in /tmp/input. However, after the run, there are two files in /tmp/output:
> ls /tmp/output
part-305680d0-e680-420f-ab17-3e558ceaeba3-cp-0-task-6-file-0 part-305680d0-e680-420f-ab17-3e558ceaeba3-cp-0-task-7-file-0
> cat /tmp/output/part-305680d0-e680-420f-ab17-3e558ceaeba3-cp-0-task-6-file-0
pyflink,1
> cat /tmp/output/part-305680d0-e680-420f-ab17-3e558ceaeba3-cp-0-task-7-file-0
flink,2
While I expect a single file /tmp/output with content:
pyflink,1
flink,2
Actually, I got the above python program by adjusting the below that produces the single file /tmp/output.
from pyflink.dataset import ExecutionEnvironment
from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment
from pyflink.table.descriptors import Schema, OldCsv, FileSystem
from pyflink.table.expressions import lit
# https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table_api_tutorial.html
exec_env = ExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)
t_env.connect(FileSystem().path('/tmp/input')) \
.with_format(OldCsv()
.field('word', DataTypes.STRING())) \
.with_schema(Schema()
.field('word', DataTypes.STRING())) \
.create_temporary_table('mySource')
t_env.connect(FileSystem().path('/tmp/output')) \
.with_format(OldCsv()
.field_delimiter('\t')
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) \
.with_schema(Schema()
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) \
.create_temporary_table('mySink')
tab = t_env.from_path('mySource')
tab.group_by(tab.word) \
.select(tab.word, lit(1).count) \
.execute_insert('mySink').wait()
Running this version will generate a /tmp/output. Note it doesn't come with comma delimiter.
> cat /tmp/output
flink 2
pyflink 1
Any idea why? Thanks!
The first time you ran it without having specified the parallelism, and so you got the default parallelism -- which is greater than 1 (probably 4 or 8, depending on how many cores your computer has).
Flink is designed to be scalable, and to achieve that, parallel instances of an operator, such as a sink, are decoupled from one another. Imagine, for example, a large cluster with 100s or 1000s of nodes. For this to work well, each instance needs to write to its own file.
The commas were changed to tabs because you specified .field_delimiter('\t').
I'd like to create an array like this
#exclude = ("[INFO] Reading file", "[INFO] All file(s) read");
which contains items that I would like to ignore while looping through another array
The other array is #nyuulog which I've ready into from a file and looks similar to this:
[INFO] Uploading 37 article(s) from 3 file(s) totalling 23.98 MiB```
[INFO] Reading file 157.1.1.par2...
[INFO] Reading file 159.1.1.rar...
[INFO] Reading file 159.1.1.vol0+1.par2...
[INFO] All file(s) read...
[INFO] Finished uploading 23.98 MiB in 00:00:16.083 (1527.03 KiB/s). Raw upload: 2613.34 KiB/s
So I'm using this:
foreach $line(#nyuulog) {print $txtfile("$line\n");}
which writes all of the lines but I'm wanting to not write lines out to the filehandle that contains an element in the #exclude array.
Is there an easy way to do this? I've tried numerous attempts at using grep or the new Perl ~~ command (which I don't think applies in this situation) and can't get the right combination of commands.
Any help or pointing me in the right direction - would be greatly appreciated.
Thank you
Try this:
my $FilterRe = join("|", map({"(^\Q$_\E)"} #exclude));
my #Filtered = grep({!/$FilterRe/} #nyuulog);
Inspired by a question on perlmonks.
Construct a look-up hash for what's to be excluded, and filter the array with it
my %excl = map { $_ => 1 } #exclude;
my #filtered = grep { not $excl{$_} } #original;
This is about as efficient as list processing can be, O(N), and hopefully clear and easy.
Can also have it in a do block to avoid an extra variable (%excl) floating around
my #filtered = do {
my %excl = map { $_ => 1 } #exclude;
grep { not $excl{$_} } #original;
};
I am currently experiencing a really strange bug when compiling for u-boot:
The vector
init_fnc_t *init_sequence[] ={...}
is filled with 0's / NULLs instead of function pointers. I thought I outsmarted the compiler by calling all those functions 'by hand'. However this bug has even more ramifications as the driver struct also gets 0 / NULL pointers:
static struct serial_device my_serial_drv = {
.name = "my_serial",
.start = my_serial_init,
.stop = NULL,
.setbrg = my_serial_setbrg,
.putc = my_serial_putc,
.puts = my_serial_puts,
.getc = my_serial_getc,
.tstc = my_serial_tstc,
};
which, of course, when I call
'my_serial_drv'->start();
sets the pc to 0 and subsequently crashes everything.
Fun fact: the .name reaches the binary, so the .data sections are probably fine once they are set.
I have tested this with aarch64-linux-gnu-*-4.7 and aarch64-linux-gnu-*-4.9 binaries.
You can find 4.9 from:
http://releases.linaro.org/latest/components/toolchain/binaries.
Any help would be greatly appreciated :)
The u-boot.bin file seems allright. After investigating why this works I saw that u-boot's make is running a separate command to fix the bin:
start=$(aarch64-linux-gnu-nm u-boot | grep __rel_dyn_start | cut -f 1 -d ' ');
end=$(aarch64-linux-gnu-nm u-boot | grep __rel_dyn_end | cut -f 1 -d ' ');
tools/relocate-rela u-boot.bin 0x3e900000 $start $end
(these were a single line but I split it for readability )
I have TWO large, CSV files (around 1GB). They both share relation between each other (ID is lets say like a foreign key). Structure is simple, line by line but CSV cells with a line break in the value string can appear
37373;"SOMETXT-CRCF or other other line break-";3838383;"sasa ssss"
One file is P file and other is T file. T is like 70% size of the P file (P > T). I must cut them to smaller parts since they are to big for the program I have to import them... I can not simply use split -l 100000 since I will loose ID=ID relations which must be preserved! Relation can be 1:1, 2:3, 4:6 or 1:5. So stupid file splitting is no option, we must check the place where we create a new file. This is example with simplified CSV structure and a place where I want the file to be cut (and the lines above go to separate P|T__00x file and we continue till P or T ends). Lines are sorted in both files, so no need to search for IDs across whole file!
File "P" (empty lines for clearness):
CSV_FILE_P;HEADER;GOES;HERE
564788402;1;1;"^";"01"
564788402;2;1;"^";"01"
564788402;3;1;"^";"01"
575438286;1;1;"^";"01"
575438286;1;1;"^";"01"
575438286;2;1;"37145859"
575438286;2;1;"37145859"
575438286;3;1;"37145859"
575438286;3;1;"37145859"
575439636;1;1;"^"
575439636;1;1;"^"
# lets say ~100k line limit of file P is somewhere here and no more 575439636 ID lines , so we cut.
575440718;1;1;"^"
575440718;1;1;"^"
575440718;2;1;"10943890"
575440718;2;1;"10943890"
575440718;3;1;"10943890"
575440718;3;1;"10943890"
575441229;1;1;"^";"01"
575441229;1;1;"^";"01"
575441229;2;1;"4146986"
575441229;2;1;"4146986"
575441229;3;1;"4146986"
575441229;3;1;"4146986"
File T (empty lines for clearness)
CSV_FILE_T;HEADER;GOES;HERE
564788402;4030000;1;"0204"
575438286;6102000;1;"0408"
575438286;6102000;0;"0408"
575439636;7044010;1;"0408"
575439636;7044010;0;"0408"
# we must cut here since bigger file "P" 100k limit has been reached
# and we end here because 575439636 ID lines are over.
575440718;6063000;1;"0408"
575440718;6063000;0;"0408"
575441229;8001001;0;"0408"
575441229;8001001;1;"0408"
Can you please help splitting those two files into many 100 000 (or so) lines separate files T_001 and corresponding P_001 file and so on? So ID matches between file parts. I believe awk will be the best tool but I have not got much experience in this field. And the last thing - CSV header should be preserved in each of the files.
I have powerful AIX machine to cope with that (linux also possible since AIX commands are limited sometimes)
You can parse the beginning IDs with awk and then check to see if the current ID is the same as the last one. Only when it is different are you allowed close the current output file and open a new one. At that point record the ID for tracking the next file. You can track this id in a text file or in memory. I've done it in memory but with big files like this you could run into trouble. It's easier to keep track in memory than opening multiple files and reading from them.
Then you just need to distinguish between the first file (output and recording) and the second file (output and using the prerecorded data).
The code does a very brute force check on the possibility of a CRLF in a field - if the line does not begin with what looks like an ID, then it outputs the line and does no further testing on it. Which is a problem if the CRLF is followed immediately by a number and semicolon! This might be unlikely though...
Run with: gawk -f parser.awk P T
I don't promise this works!
BEGIN {
MAXLINES = 100000
block = 0
trackprevious = 0
}
FNR == 1 {
# First line is CSV header
csvheader = $0
if (FILENAME == "-")
{
_error = 1
print "Error: Need filename on command line"
exit 1
}
if (trackprevious)
{
_error = 1
print "Only one file can track another"
exit 1
}
if (block >= 1)
{
# New file - track previous output...
close(outputname)
Tracking[block] = idval
print "Output for " FILENAME " tracks previous file"
trackprevious = 1
}
else
{
print "Chunking output (" MAXLINES ") for " FILENAME
}
linecount = 0
idval = 0
block = 1
outputprefix = FILENAME "_block"
outputname = sprintf("%s_%03d", outputprefix, block)
print csvheader > outputname
next
}
/^[0-9]+;/ {
linecount++
newidval = $0
sub(/;.*$/, "", newidval)
newidval = newidval + 0 # make a number
startnewfile = 0
if (trackprevious && (idval != newidval) && (idval == Tracking[block]))
{
startnewfile = 1
}
else if (!trackprevious && (idval != newidval) && (linecount > MAXLINES))
{
# Last ID value found before new file:
Tracking[block] = idval
startnewfile = 1
}
if (startnewfile)
{
close(outputname)
block++
outputname = sprintf("%s_%03d", outputprefix, block)
print csvheader > outputname
linecount = 1
}
print $0 > outputname
idval = newidval
next
}
{
linecount++
print $0 > outputname
}