How to search for a unique sequence in binary data?

How to search for a unique sequence in binary data? - file

I am trying to read a binary file with header. I know certain info is saved after a unique sequence 02 06 08 22 02 02 08 00. How could I find the position of such unique sequence?
I can use
String StreamReadAsText( ScriptObject stream, Number encoding, Number count )
to read the binary file one by one. But I guess it is pretty silly and slow.
Besides, how do i compare the result from StreamReadAsText() when the output is not a actual text (between 00 and 1F in the Ascii Table)?
Then, How do i read the binary file as int8 (the same size as a character in a string).for example, read 02, then 06, then 08 etc...
Any help is welcome and appreciated.
Regards,
Roger

You are already on the right track with reading the file with the streaming commands. However, why would you want to read the stream as text? You can read the stream as any (supported) number, using the tagGroup object as a proxy with TagGroupReadTagDataFromStream().
There is actually an example in the F1 help-section where the streaming commands are listed, which I'm just copying here.
Object stream = NewStreamFromBuffer( NewMemoryBuffer( 256 ) )
TagGroup tg = NewTagGroup();
Number stream_byte_order = 1; // 1 == bigendian, 2 == littleendian
Number v_uint32_0, v_uint32_1, v_sint32_0, v_uint16_0, v_uint16_1
// Create the tags and initialize with default values
tg.TagGroupSetTagAsUInt32( "UInt32_0", 0 )
tg.TagGroupSetTagAsUInt32( "UInt32_1", 0 )
tg.TagGroupSetTagAsLong( "SInt32_0", 0 )
tg.TagGroupSetTagAsUInt16( "UInt16_0", 0 )
tg.TagGroupSetTagAsUInt16( "UInt16_1", 0 )
// Stream the data into the tags
TagGroupReadTagDataFromStream( tg, "UInt32_0", stream, stream_byte_order );
TagGroupReadTagDataFromStream( tg, "UInt32_1", stream, stream_byte_order );
TagGroupReadTagDataFromStream( tg, "SInt32_0", stream, stream_byte_order );
TagGroupReadTagDataFromStream( tg, "UInt16_0", stream, stream_byte_order );
TagGroupReadTagDataFromStream( tg, "UInt16_1", stream, stream_byte_order );
// Show the taggroup, if you want
// tg.TagGroupOpenBrowserWindow("AuxTags",0)
// Get the data from the tags
tg.TagGroupGetTagAsUInt32( "UInt32_0", v_uint32_0 )
tg.TagGroupGetTagAsUInt32( "UInt32_1", v_uint32_1 )
tg.TagGroupGetTagAsLong( "Sint32_0", v_sint32_0 )
tg.TagGroupGetTagAsUInt16( "UInt16_0", v_uint16_0 )
tg.TagGroupGetTagAsUInt16( "UInt16_1", v_uint16_1 )
There is already a post here on site about searching for a pattern within a stream: Find a pattern image (binary file)
This shows how you would use a stream to look in an image, but you can use the filestream directly of course.
As an alternative, you can read a whole array from the stream with ImageReadImageDataFromStream after preparing a suitable image beforehand.
You can then use images to search location. This would be an example:
// Example of reading the first X bytes of a file
// as uInt16 data
image ReadHeaderAsUint16( string filepath, number nBytes )
{
number kEndianness = 0 // Default byte order of the current platform
if ( !DoesFileExist( filePath ) )
Throw( "File '" + filePath + "' not found." )
number fileID = OpenFileForReading( filePath )
object fStream = NewStreamFromFileReference( fileID, 1 )
if ( nBytes > fStream.StreamGetSize() )
Throw( "File '" + filePath + "' has less than " + nBytes + "bytes." )
image buff := IntegerImage( "Header", 2, 0, nBytes/2 ) // UINT16 array of suitable size
ImageReadImageDataFromStream( buff, fStream, kEndianness )
return buff
}
number FindSignature( image header, image search )
{
// 1D images only
if ( ( header.ImageGetNumDimensions() != 1 ) \
|| ( search.ImageGetNumDimensions() != 1 ) )
Throw( "Only 1D images supported" )
number sx = search.ImageGetDimensionSize( 0 )
number hx = header.ImageGetDimensionSize( 0 )
if ( hx < sx )
return -1
// Create a mask of possible start locations
number startV = search.getPixel( 0, 0 )
image mask = (header == startV) ? 1 : 0
// Search all the occurances from the first
number mx, my
while( max( mask, mx, my ) )
{
if ( 0 == sum( header[0,mx,1,mx+sx] - search ) )
return mx
else
mask.SetPixel( mx, 0, 0)
}
return -1
}
// Example
// 1) Load file header as image (up to the size you want )
string path = GetApplicationDirectory( "open_save", 0 )
number maxHeaderSize = 200
if ( !OpenDialog( NULL, "Select file to open", path, path ) ) Exit(0)
image headerImg := ReadHeaderAsUint16( path, maxHeaderSize )
headerImg.ShowImage()
// 2) define search-header as image
image search := [8]: { 02, 06, 08, 22, 02, 02, 08, 00 }
// MatrixPrint( search )
// 3) search for it in the header
number foundAt = FindSignature( headerImg, search )
if ( -1 == foundAt )
Throw( "The file header does not contain the search pattern." )
else
OKDialog( "Found the search pattern at offset: " + foundAt * 16 + "bytes" )

If you're on a modern machine, just load the file into memory, then scan for the sequence using a memory comparison function and a travelling index.
It's not the most memory efficient way of doing things or even the fastest, but it's easy and fast enough, assuming you have resources to burn.

Related

Fast Cumulative Sum?

The menu command "Volume > Projection > Project Along Z" is really fast as compared to scripting (even with intrinsic variables). Cumulative sum (projection) of a 3D image volume of 512x512x200 in z-direction takes <0.5 sec. as compared to >8 sec. by using script. Is there a direct access this script function other than using ChooseMenuItem()?
Script example showing the difference:
// create an image of 512x512x200, assign random numbers
image img := exprsize(512, 512, 200, random());
img.SetName( "test" );
img.ShowImage();
//
number start_tick, end_tick, calc_time;
// do volume projectin with menu command : Volume>Project>Project Along Z
start_tick = GetHighResTickCount();
ChooseMenuItem( "Volume", "Project", "Project Along Z"); // do z-projection
end_tick = GetHighResTickCount();
// calculate execution time
calc_time = CalcHighResSecondsBetween( start_tick, end_tick );
// display result image
Image img_projZ1 := GetFrontImage();
img_projZ1.SetName( "Z-proj.#1 (" + calc_time.format("%.2fs") + ")");
img_projZ1.ShowImage();
// do volume project in z-direction (using intrinsic variable iplane)
image img_projZ2 := exprsize(512, 512, 0.0);
start_tick = GetHighResTickCount();
img_projZ2[icol, irow, iplane] += img; // do z-projection
end_tick = GetHighResTickCount();
// calculate execution time
calc_time = CalcHighResSecondsBetween( start_tick, end_tick );
// display result image
img_projZ2.SetName( "Z-projection#1 (" + calc_time.format("%.2fs") + ")");
img_projZ2.ShowImage();

Using intrinsic variables is not the fastest way to go about this in scripting. (It used to be in GMS 1 a long time ago.)
In fact, if you do it as a for loop over slices you are a bit faster than with the command menu - most likely due to the overheads of calling that command and tagging the results.
// create an image of 512x512x200, assign random numbers
number sx = 512
number sy = 512
number sz = 200
image img := RealImage("",8,sx,sy,sz)
img = random();
img.SetName( "test" );
img.ShowImage();
number start_tick, end_tick, calc_time;
// do volume projectin with menu command : Volume>Project>Project Along Z
start_tick = GetHighResTickCount();
ChooseMenuItem( "Volume", "Project", "Project Along Z"); // do z-projection
end_tick = GetHighResTickCount();
// calculate execution time
calc_time = CalcHighResSecondsBetween( start_tick, end_tick );
// display result image
Image img_projZ1 := GetFrontImage();
img_projZ1.SetName( "Z-proj.#1 (" + calc_time.format("%.2fs") + ")");
img_projZ1.ShowImage();
// do volume project in z-direction (using intrinsic variable iplane)
image img_projZ2 := img_projZ1.ImageClone()
img_projZ2 = 0
start_tick = GetHighResTickCount();
for(number i=0; i<sz; i++)
img_projZ2 += img.slice2(0,0,i,0,sx,1,1,sy,1)
end_tick = GetHighResTickCount();
// calculate execution time
calc_time = CalcHighResSecondsBetween( start_tick, end_tick );
// display result image
img_projZ2.SetName( "Z-projection#1 (" + calc_time.format("%.2fs") + ")");
img_projZ2.ShowImage();
However, to answer your questions for a command: GMS 3.4 has a command
called:
RealImage Project( BasicImage img, Number axis )
RealImage Project( BasicImage img, Number axis, Boolean rescale )
void Project( BasicImage img, BasicImage dst, Number axis )
void Project( BasicImage img, BasicImage dst, Number axis, Boolean rescale )
But this command is not officially documented, so it might be renamed/removed at any time.

Insert multiple（2 times）digital signatures，i found there are 3 info-dictionary in the pdf

i got the same problem( develop with podofo in c++). after Insert multiple（2 times）digital signatures，i found there are 3 info——dictionary in the pdf file：
how to add two digital signature without invalidating the previous one?
thanks!
i open file in notepad++,and i found the different
the first: 97 0 obj<</Title(? G I S e r C l u bThR\n 2 0 1 4 0 7 2 0) /Author(edison qian) /Keywords(GISerClub) /Creator(? M i c r o s o f t ? W o r d 2 0 1 3)
/CreationDate(D:20150601200942+08'00') /ModDate(D:20150601200942+08'00') /Producer(? M i c r o s o f t ? W o r d 2 0 1 3) >>
the second: 97 0 obj<</Author(edison qian)/CreationDate(D:20150601200942+08'00')/Creator(? M i c r o s o f t ? W o r d 2 0 1 3)/Keywords(GISerClub)
/ModDate(D:20190426155330+08'00')/Producer(? M i c r o s o f t ? W o r d 2 0 1 3)/Title(? G I S e r C l u bThR\n 2 0 1 4 0 7 2 0)>>
the third: 97 0 obj<</Author(edison qian)/CreationDate(D:20150601200942+08'00')/Creator(? M i c r o s o f t ? W o r d 2 0 1 3)/Keywords(GISerClub)
/ModDate(D:20190426155428+08'00')/Producer(? M i c r o s o f t ? W o r d 2 0 1 3)/Title(? G I S e r C l u bThR\n 2 0 1 4 0 7 2 0)>>
my code：
bool pdfSign(PdfMemDocument* document,PdfOutputDevice* outputDevice,PKCS12* p12,RSA* rsa,int npage,PdfRect rect,int min_signature_size,const char* ImgFile/*,PdfDate& sigData*/)
{
PdfInfo* pInfo = document->GetInfo();
TKeyMap itm = pInfo->GetObject()->GetDictionary().GetKeys();
PdfObject* pobj = pInfo->GetObject()->GetDictionary().GetKey(PdfName("ModDate"));
PdfString modDate = pobj->GetString();
string sDate = modDate.GetString();
string sutf8Date = modDate.GetStringUtf8();
PdfOutlines* pOutLine = document->GetOutlines();
TKeyMap itm2 = pOutLine->GetObject()->GetDictionary().GetKeys();
const char *field_name = NULL;
bool field_use_existing = false;
int annot_page = npage;
//double annot_left = 80.0, annot_top = 70.0, annot_width = 150.0, annot_height = 150.0;
bool annot_print = true;
const char *reason = "I agree";
int result = 0;
PdfSignatureField *pSignField = NULL;
try
{
PdfSignOutputDevice signer( outputDevice );
PdfAcroForm* pAcroForm = document->GetAcroForm();
if( !pAcroForm )
PODOFO_RAISE_ERROR_INFO( ePdfError_InvalidHandle, "acroForm == NULL" );
if( !pAcroForm->GetObject()->GetDictionary().HasKey( PdfName( "SigFlags" ) ) ||
!pAcroForm->GetObject()->GetDictionary().GetKey( PdfName( "SigFlags" ) )->IsNumber() ||
pAcroForm->GetObject()->GetDictionary().GetKeyAsLong( PdfName( "SigFlags" ) ) != 3 )
{
if( pAcroForm->GetObject()->GetDictionary().HasKey( PdfName( "SigFlags" ) ) )
pAcroForm->GetObject()->GetDictionary().RemoveKey( PdfName( "SigFlags" ) );
pdf_int64 val = 3;
pAcroForm->GetObject()->GetDictionary().AddKey( PdfName( "SigFlags" ), PdfObject( val ) );
}
if( pAcroForm->GetNeedAppearances() )
{
#if 0 /* TODO */
update_default_appearance_streams( pAcroForm );
#endif
pAcroForm->SetNeedAppearances( false );
}
PdfString name;
PdfObject* pExistingSigField = NULL;
PdfImage image( document );
image.LoadFromFile( ImgFile );
double dimgWidth = image.GetWidth();
double dimgHeight = image.GetHeight();
char fldName[96]; // use bigger buffer to make sure sprintf does not overflow
sprintf( fldName, "PodofoSignatureField%" PDF_FORMAT_INT64, static_cast( document->GetObjects().GetObjectCount() ) );
name = PdfString( fldName );
PdfPage* pPage = document->GetPage( annot_page );
if( !pPage )
PODOFO_RAISE_ERROR( ePdfError_PageNotFound );
double dPageHeight = pPage->GetPageSize().GetHeight();
double dPageWidth = pPage->GetPageSize().GetWidth();
PdfRect annot_rect;
annot_rect = PdfRect( rect.GetLeft(),
pPage->GetPageSize().GetHeight() - rect.GetBottom() - rect.GetHeight(),
dimgWidth,
dimgHeight );
PdfAnnotation* pAnnot = pPage->CreateAnnotation( ePdfAnnotation_Widget, annot_rect );
if( !pAnnot )
PODOFO_RAISE_ERROR_INFO( ePdfError_OutOfMemory, "Cannot allocate annotation object" );
if( annot_print )
pAnnot->SetFlags( ePdfAnnotationFlags_Print );
else if( !field_name || !field_use_existing )
pAnnot->SetFlags( ePdfAnnotationFlags_Invisible | ePdfAnnotationFlags_Hidden );
PdfPainter painter;
try
{
painter.SetPage( /*&sigXObject*/pPage );
/* Workaround Adobe's reader error 'Expected a dict object.' when the stream
contains only one object which does Save()/Restore() on its own, like
the image XObject. */
painter.Save();
painter.Restore();
draw_annotation( *document, painter, image, annot_rect );
}
catch( PdfError & e )
{
if( painter.GetPage() )
{
try
{
painter.FinishPage();
}
catch( ... )
{
}
}
}
painter.FinishPage();
//pSignField = new PdfSignatureField( pAnnot, pAcroForm, document );
pSignField = new PdfSignatureField( pPage, annot_rect, document );
if( !pSignField )
PODOFO_RAISE_ERROR_INFO( ePdfError_OutOfMemory, "Cannot allocate signature field object" );
PdfRect annotSize( 0.0, 0.0, dimgWidth, dimgHeight );
PdfXObject sigXObject( annotSize, document );
pSignField->SetAppearanceStream( &sigXObject );
// use large-enough buffer to hold the signature with the certificate
signer.SetSignatureSize( min_signature_size );
pSignField->SetFieldName( name );
pSignField->SetSignatureReason( PdfString( reinterpret_cast( reason ) ) );
pSignField->SetSignatureDate( /*sigData*/PdfDate() );
pSignField->SetSignature( *signer.GetSignatureBeacon() );
pSignField->SetBackgroundColorTransparent();
pSignField->SetBorderColorTransparent();
// The outPdfFile != NULL means that the write happens to a new file,
// which will be truncated first and then the content of the srcPdfFile
// will be copied into the document, follwed by the changes.
//signer.Seek(0);
document->WriteUpdate( &signer, true );
if( !signer.HasSignaturePosition() )
PODOFO_RAISE_ERROR_INFO( ePdfError_SignatureError, "Cannot find signature position in the document data" );
// Adjust ByteRange for signature
signer.AdjustByteRange();
// Read data for signature and count it
// We seek at the beginning of the file
signer.Seek( 0 );
sign_with_signer( signer, g_x509, g_pKey );
signer.Flush();
}
catch( PdfError & e )
{
}
if( pSignField )
delete pSignField;
}
i use the code above two times, and the first signature is invalid.
how to add two digital signature without invalidating the previous one?

Painting on the right PdfCanvas
After analyzing the example pdf the reason why your second signature invalidated your first one became clear: In the course of signing you change the page content of the page with the widget annotation of the signature.
But changing the content of any page invalidates any previous signature! Cf. this answer for details on allowed and disallowed changes of signed documents.
Indeed:
PdfPainter painter;
try
{
painter.SetPage( /*&sigXObject*/pPage );
/* Workaround Adobe's reader error 'Expected a dict object.' when the stream
contains only one object which does Save()/Restore() on its own, like
the image XObject. */
painter.Save();
painter.Restore();
draw_annotation( *document, painter, image, annot_rect );
}
Apparently you here change something in the page content itself. When this code is executed while applying the second signature, the first signature is invalidated.
You confirmed in a comment:
i use '&sigXObject' instead of 'pPage ',All two signatures are working! but the red seal disappeared
Using the right coordinates
Concerning your observation that the red seal disappeared: You use the wrong coordinates for drawing the image on the annotation appearance!
You use coordinates for the page coordinate system, but you have to use the coordinates in the coordinate system given by the appearance's bounding box.
Thus, your
painter.DrawImage( annot_rect.GetLeft(), annot_rect.GetBottom(), &image );
is wrong, instead try something like
painter.DrawImage( 0, 0, &image );
as the bounding box of your appearance is is
[ 0 0 151 151 ]

Unable to add this custom metric to Amibroker backtest report

I would like to add an extra column to indicate volatility in the backtest report.
Here is my code. The extra column volatility_recent appears but no value appears in the column. However, if I were to use the commented line trade.AddCustomMetric( "proceeds", trade.Shares*trade.ExitPrice );, some numerical value appears in the column.
What is wrong with the code?
if ( Status( "action" ) == actionPortfolio )
{
bo = GetBacktesterObject();
// run default backtest procedure without generating the trade list
bo.Backtest( True );
volatility_recent = ATR(30);
// iterate through closed trades
for ( trade = bo.GetFirstTrade( ); trade; trade = bo.GetNextTrade( ) )
{
trade.AddCustomMetric( "volatility_recent", volatility_recent );
//trade.AddCustomMetric( "proceeds", trade.Shares*trade.ExitPrice );
}
// iterate through open positions
for ( trade = bo.GetFirstOpenPos( ); trade; trade = bo.GetNextOpenPos( ) )
{
trade.AddCustomMetric( "volatility_recent", volatility_recent );
//trade.AddCustomMetric( "proceeds", trade.Shares*trade.ExitPrice );
}
// generate trade list
bo.ListTrades( );
}

I'm finding it really interesting that you copy text and code solution by others line by line without giving reference.
Your second post here at stackoverflow is line by line copy of responses by Tomasz and me to you at forum.amibroker.com
https://forum.amibroker.com/t/unable-to-add-this-custom-metric-to-backtest-report/7153

Custom metric needs to be scalar (number), not array. ATR(30) is an array. So, use LastValue to get last value of array or Lookup to get the value at specified bar. Pass ATR array of symbol from 1st phase to 2nd phase of backtest via static variables. Then in custom metric line use lookup to extract array element at certain date time (trade.EntryDateTime or trade.ExitDateTime).
StaticVarSet( "CBT_ATR_" + Name(), ATR(30) );
if ( Status( "action" ) == actionPortfolio )
{
bo = GetBacktesterObject();
// run default backtest procedure without generating the trade list
bo.Backtest( True );
// iterate through closed trades
for ( trade = bo.GetFirstTrade( ); trade; trade = bo.GetNextTrade( ) )
{
trade.AddCustomMetric( "volatility_recent", Lookup( StaticVarGet( "CBT_ATR_" + trade.Symbol ), trade.ExitDateTime ) );
//trade.AddCustomMetric( "proceeds", trade.Shares*trade.EntryPrice );
}
// iterate through open positions
for ( trade = bo.GetFirstOpenPos( ); trade; trade = bo.GetNextOpenPos( ) )
{
trade.AddCustomMetric( "volatility_recent", Lookup( StaticVarGet( "CBT_ATR_" + trade.Symbol ), Trade.ExitDateTime ) );
//trade.AddCustomMetric( "proceeds", trade.Shares*trade.EntryPrice );
}
// generate trade list
bo.ListTrades( );
}
EDIT: The credit goes to fxshrat who posted the answer at https://forum.amibroker.com/t/unable-to-add-this-custom-metric-to-backtest-report/7153/2
His answer was posted here and it was rude to post without references. Apologies to fxshrat and Tomasz.

NAudio 4000Hz WAV?

I am attempting to use the NAudio lib like the below. When I have a WAV file saved as Mono, 4KHz, the AudioBytesOriginal array has all zeroes. The file does play when double-clicked in Windows, so the data is there. It also plays in Audacity.
using ( var waveFileReader = new WaveFileReader( FileNameIn ) )
{
var thisIsWhat = waveFileReader.WaveFormat; // reports as 8KHz
AudioBytesOriginal = new byte[waveFileReader.Length];
int read = waveFileReader.Read( AudioBytesOriginal , 0 , AudioBytesOriginal.Length );
short[] sampleBuffer = new short[read/2];
Buffer.BlockCopy( AudioBytesOriginal , 0 , sampleBuffer , 0 , read );
}
I need the extremely low sample rate for playback on a limited device, but am using .NET Framework 4.6.1 with NAudio to handle the byte work.
Thanks.

a couple of things to check
1) what is the value of read? Is it 0?
2) how far into sampleBuffer did you check? Even half a second of silence at the start of an audio file will result in several thousand samples with 0 value

Splitting two large CSV files preserving relations between file A and B across the resulting files

I have TWO large, CSV files (around 1GB). They both share relation between each other (ID is lets say like a foreign key). Structure is simple, line by line but CSV cells with a line break in the value string can appear
37373;"SOMETXT-CRCF or other other line break-";3838383;"sasa ssss"
One file is P file and other is T file. T is like 70% size of the P file (P > T). I must cut them to smaller parts since they are to big for the program I have to import them... I can not simply use split -l 100000 since I will loose ID=ID relations which must be preserved! Relation can be 1:1, 2:3, 4:6 or 1:5. So stupid file splitting is no option, we must check the place where we create a new file. This is example with simplified CSV structure and a place where I want the file to be cut (and the lines above go to separate P|T__00x file and we continue till P or T ends). Lines are sorted in both files, so no need to search for IDs across whole file!
File "P" (empty lines for clearness):
CSV_FILE_P;HEADER;GOES;HERE
564788402;1;1;"^";"01"
564788402;2;1;"^";"01"
564788402;3;1;"^";"01"
575438286;1;1;"^";"01"
575438286;1;1;"^";"01"
575438286;2;1;"37145859"
575438286;2;1;"37145859"
575438286;3;1;"37145859"
575438286;3;1;"37145859"
575439636;1;1;"^"
575439636;1;1;"^"
# lets say ~100k line limit of file P is somewhere here and no more 575439636 ID lines , so we cut.
575440718;1;1;"^"
575440718;1;1;"^"
575440718;2;1;"10943890"
575440718;2;1;"10943890"
575440718;3;1;"10943890"
575440718;3;1;"10943890"
575441229;1;1;"^";"01"
575441229;1;1;"^";"01"
575441229;2;1;"4146986"
575441229;2;1;"4146986"
575441229;3;1;"4146986"
575441229;3;1;"4146986"
File T (empty lines for clearness)
CSV_FILE_T;HEADER;GOES;HERE
564788402;4030000;1;"0204"
575438286;6102000;1;"0408"
575438286;6102000;0;"0408"
575439636;7044010;1;"0408"
575439636;7044010;0;"0408"
# we must cut here since bigger file "P" 100k limit has been reached
# and we end here because 575439636 ID lines are over.
575440718;6063000;1;"0408"
575440718;6063000;0;"0408"
575441229;8001001;0;"0408"
575441229;8001001;1;"0408"
Can you please help splitting those two files into many 100 000 (or so) lines separate files T_001 and corresponding P_001 file and so on? So ID matches between file parts. I believe awk will be the best tool but I have not got much experience in this field. And the last thing - CSV header should be preserved in each of the files.
I have powerful AIX machine to cope with that (linux also possible since AIX commands are limited sometimes)

You can parse the beginning IDs with awk and then check to see if the current ID is the same as the last one. Only when it is different are you allowed close the current output file and open a new one. At that point record the ID for tracking the next file. You can track this id in a text file or in memory. I've done it in memory but with big files like this you could run into trouble. It's easier to keep track in memory than opening multiple files and reading from them.
Then you just need to distinguish between the first file (output and recording) and the second file (output and using the prerecorded data).
The code does a very brute force check on the possibility of a CRLF in a field - if the line does not begin with what looks like an ID, then it outputs the line and does no further testing on it. Which is a problem if the CRLF is followed immediately by a number and semicolon! This might be unlikely though...
Run with: gawk -f parser.awk P T
I don't promise this works!
BEGIN {
MAXLINES = 100000
block = 0
trackprevious = 0
}
FNR == 1 {
# First line is CSV header
csvheader = $0
if (FILENAME == "-")
{
_error = 1
print "Error: Need filename on command line"
exit 1
}
if (trackprevious)
{
_error = 1
print "Only one file can track another"
exit 1
}
if (block >= 1)
{
# New file - track previous output...
close(outputname)
Tracking[block] = idval
print "Output for " FILENAME " tracks previous file"
trackprevious = 1
}
else
{
print "Chunking output (" MAXLINES ") for " FILENAME
}
linecount = 0
idval = 0
block = 1
outputprefix = FILENAME "_block"
outputname = sprintf("%s_%03d", outputprefix, block)
print csvheader > outputname
next
}
/^[0-9]+;/ {
linecount++
newidval = $0
sub(/;.*$/, "", newidval)
newidval = newidval + 0 # make a number
startnewfile = 0
if (trackprevious && (idval != newidval) && (idval == Tracking[block]))
{
startnewfile = 1
}
else if (!trackprevious && (idval != newidval) && (linecount > MAXLINES))
{
# Last ID value found before new file:
Tracking[block] = idval
startnewfile = 1
}
if (startnewfile)
{
close(outputname)
block++
outputname = sprintf("%s_%03d", outputprefix, block)
print csvheader > outputname
linecount = 1
}
print $0 > outputname
idval = newidval
next
}
{
linecount++
print $0 > outputname
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to search for a unique sequence in binary data? - file

If you're on a modern machine, just load the file into memory, then scan for the sequence using a memory comparison function and a travelling index. It's not the most memory efficient way of doing things or even the fastest, but it's easy and fast enough, assuming you have resources to burn.

Related

Fast Cumulative Sum?

Insert multiple（2 times）digital signatures，i found there are 3 info-dictionary in the pdf

Unable to add this custom metric to Amibroker backtest report

NAudio 4000Hz WAV?

Splitting two large CSV files preserving relations between file A and B across the resulting files

Categories

Resources