pattern matching data sets without regexs - matching

i have a large data set, tcpdump capture file exported to text.
I want to find any like unknown pattern in the data set and return the fields
and data sorted. basically we wish to look for unknown "patterns" in the data set.
i know we can use a regex with like awk and grep, but from a pure research perspective, im given a data set 5Gb and 10Gb, in which i want to find unknown like matching patterns....
they could be from any field or data structure. Ive used wireshark in various modes to sort structures

Related

ADF-Azure Data factory multiple wild card filtering

I have a condition where i have more than 2 types of files which i have to filter out. I can filter out 1 type using wildcard, something like: *.csv but cant do something like *.xls, *.zip.
I have a pipeline which should convert csv, avro, dat files into .parquet format. But, folder also have .zip, excel, powerpoint files and i want them o be filtered out. Instead of using 3-4 activities i am finding if any way i can use (or) condition to filter out multiple extensions using wildcard option of data factory?
Dynamic content can't accept multiple wildcards or Regular expression based on my test.
You have to using multiple activities to match the different types of your files.Or you could consider a workaround that using LookUp activity+For-each Activity.
1.LookUp Activity loads all the file names from specific folder.(Child Item)
2.Check the file format in the for-each activity condition.(using endswith built-in feature)
3.If the file format matches the filter condition, then go into the True branch and configure it as dynamic path of dataset in the copy activity.

How does one split large result set from a Group By into multiple flat files?

I'm far away from an SSIS expert and I'm attempting to correct an error (unspecified in the messages) that began once I modified a variable to increase the size of the data accumulated and exported into a flat file. (Note variable was a date in the WHERE statement that limited the data returned from the SELECT.)
So in the data flow there's a GROUP BY component and I'm trying to find the appropriate component to put in between that and the flat file destination component to chop up the results. I figured there'd be something to export, say flatFile1.csv, flatFile2.csv, etc. based on a number of lines (so if I set a limit of 1-million lines and the results returned 3.5-million, I'd get 4 files with the last one containing 1/2-million lines) or perhaps a max file size with similar results.
Which component should I use from the toolbox to guarantee a manageable file size?
Is a script component the only way to be able to handle any size output? If so would it sit in between the Group By and the Flat File output components or would the script completely obviate the need for the Flat File output?

Find matching patterns in list of log files where patterns are stored in array elements

I have a list of log files which have the same pattern example :
"http://textuploader.com/d02at!"
As you can see it's divided into various columns , I want to extract certain information from each column i.e EBCDIC, BINARY and TRACE HEADERS and display it column wise for each sequence.
I have already written a working script to do so :
http://textuploader.com/d0z0u
Which generates the desired output in the following format :
EBCDIC Header info
"http://textuploader.com/d02aq!"
Bnary Header Info
"http://textuploader.com/d02am!"
Trace Header parts
"http://textuploader.com/d02ap!"
..Similar extraction for other headers based on the first column in the logfile.
What I want to do is to getaway from so much "grep" work in the script and use some sort of array method to store all the attributes that I want to grep.
Then iterate over these array elements to extract the information.
Thanks

Flink's "TypeSerializerOutputFormat" writes weird binary data together

I use Flink to generate array data to be used by the other applications. (I don't need any metainfo for the array)
I compared the binary data and text data generated by Flink, and found an weird data in binary data.
val bin_output_format = new TypeSerializerOutputFormat[(Long, Long)]
bin_output_format.setWriteMode (WriteMode.OVERWRITE)
bin_output_format.setOutputFilePath(new Path (s"${outDir}/NAME_Binary"))
tuple_pair_list.map { tuple => tuple._1 + "\t" + tuple._2}.writeAsText(s"${outDir}/NAME_TXT", WriteMode.OVERWRITE)
tuple_pair_list.output (bin_output_format)
How can I remove the metainfo appended at the end of the binary file?
(it looks like the number of entries)
Why there is some wrong data in it? Can I remove them? You can see the difference between two in the following figure.( two (127, -1 , -1) and one NULLs)
Am I missing something here?
Flink's TypeSerializerOutputFormat is designed to work together with the TypeSerializerInputFormat and allow for parallelized file scans. Flink uses it's internal serializers for binary encoding. Some of these serializers are based on external libraries such as Avro and Kryo. The encoding might change whenever the internal implementation of Flink's serializers (or the used libraries) changes. Moreover, the output format aligns data at fixed block boundaries and uses padding if a record would span a boundary.
Hence, the output of the TypeSerializerOutputFormat is not meant to be consumed by other tools or readers. I would implement a custom OutputFormat.

Reading edge list data set in apache giraph?

I'm using SNAP dataset for social network analysis. SNAP uses simple edge list as a data format. How to read SNAP dataset in Apache Giraph?
As per I know SNAP has various data formats depending upon which dataset you are looking at. If the dataset that you are looking at has the format : sourceid destinationid on each line then you might want to use IntNullTextEdgeInputFormat (it's in giraph-core/src/main/java/org/apache/giraph/io/formats ).
Also take a look at various predefined formats available in the same folder. If none of those fit for your dataset format then you can write your own input format class (it will be really simple if you start from the predefined formats and edit it as you need).
use -eif org.apache.giraph.io.formats.IntNullTextEdgeInputFormat
Yes, SNAP uses Simple Edge List format for representing graph databases. You can use this code for converting it to a JSON format which is accepted by Apache Giraph.

Resources