Parsing through many files and extract data to a new file?

Parsing through many files and extract data to a new file? - file

Ok! My grey hairs have started popping out because of this.
I have 400 PDF files which I want to extract a line from. The line starts with DIR and then a number follows. But I will need the file name as well!
So do anyone know a way to parse through PDFs (or I can convert them to txt) and then search for a term, expand, append file name to it and save it into a new file.
Any help will be greatly appreciated!!
Thanks,
Tor

You have Itext library that you can use for opening the pdf.
Than you will need to scan each pdf for your pattern
The link to the library www.itextpdf.com

Related

Search bulk files list

I need your help with this problem that I believe it could be solved with a batch file in powershell, I hope to be clear enough with my explanation.
I have a txt file (list.txt). In this file I have a list of more then hundred files name (company.doc, goods.xlx, prices.xls and so on).
Now I should search if these files are in disk L and the search result written a file named search.txt in this way:
company.doc FOUND L:\documents\december
goods.xls NOT FOUND
prices.xls FOUND L:\budget\2022
Thank you in advance for your help
MM
I tried to search a solution but I could not find any working.

How to differentiate .odt and .docx file apart from magic number and extension

I need to detect the document type of the given file. I did it using magic numbers for pdf,RTF, doc files. but whenever I tried to do the same on odt and docx files, unfortunately, I can't because the magic number for both are same. Please help me to sort this issue. I need the answer programmatically in java
Thanks in advance.

What I just tried now :
$ file file.docx
file.docx: Microsoft OOXML
$ file file.odt
file.odt: OpenDocument Text

PDF to PNG ghostscript Batch file issue

I am trying to drag a pdf onto a batch file which will then convert the pdf to a png, in the same directory. It all works well for a single page pdf, I get the right conversion to png format, the problem is when I am converting a pdf with multiple pages. The output says "Processing pages 1, 2, 3 etc... But what I end up with is only the first page of the pdf. Could anyone steer me in the right direction, I would be much appreciated.
I have created a batch file with the code below. Thank's in advance.
path=%PATH%;C:\Program Files (x86)\gs\gs9.20\bin\
cd %~dp1
gswin32c.exe -sDEVICE=pngalpha -sCompression=lzw -r300x300 -dBATCH -sOutputFile=%1.png %1

You have to use %d as part of the output filename to specify the pattern for the page number in the pngs. -sOutputFile=%1_%%d.png in that case.

What is the *.cf7 file extension?

At work we have been given a .cf7 file from a client and been expected to know how to access it's contents. We suspect it is some sort of database or accounting records file, most probably propriety, as it was contained in a folder called "books and records".
Has anyone dealt with .cf7 files before? Does anyone know how to use one such file? Opening it in a text editor reveals it is of a binary nature. Any pointers would be greatly appreciated.
Thanks!

Try a product called Cashflow Manager. It also uses the .cf7 extension.

Detecting the database a .DAT file belongs to

I have a set of .DAT files present along side a set of .IDX files with the same name.
The goal is to be able to open these files and read its contents, parsing it into a new format. The problem: I have no idea what database the data is being stored in! The files contain no headers or clues, they are binary, and the resource from which I have received these has no idea as to its storage mechanism.
So the question is: What are some common databases which store databases in .DAT files and store their indexes in .IDX files with the same name? Is there an application I can use in Linux or Windows which can detect the database?
EDIT :-
File names:
price.dat
price.idx
Here is a hex dump of the beginning of the .DAT file:
030D04806420500FFE3E0500002078581001C000738054E0C0099804138100402550080442090082403C101F7406010080C0A010201002010C006FC0246C0403FE00B041C051F0091BFE042F812FE054F8177E066F81BFE078F8207E08AF824FE09CF8297E0AEF82DFE0C0F8327E0D2F836FE0E4F83B7E0F6F83FE5FEFF47C06608480FA91F003C0213101F1BFDFE804220100F500D2A00388430801E04028D4390D128B46804024010A067269FCA546003C0844060E11F084B9E1377850
Here is a hex dump of the beginning of the .IDX file:
030D04805820100FFD7E0000397FEB60050410007300246A3060068220009BE0401030088B3903F740E010C80402410281402030094004C708004DC058880FFC052F015EBFE042F812FE054F8177E066F81BFE078F8207E08AF824FE09CF8297E0AEF82DFE0C0F8327E0D2F836FE0E4F83B7E0F6F83FFE108F8447E11AF848FE12CF84D7E13EF851FE150F8567E162F85AFE174F85F7E186F863FE198F8687E1AAF86CFE1BCF8717E1CEF875FE1E0F87A7E1F2F87EF5FEFF005E30901714
Both files uniquely start out with 030D04806420500FF wonder if this is a good start?
Did a quick search on Google but it didn't return anything...
END EDIT :-
Any other ideas?
Thanks much in advance!

There is a faircom ODBC driver called 'ctreeODBC_RO.exe' which should be capable.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Parsing through many files and extract data to a new file? - file

You have Itext library that you can use for opening the pdf. Than you will need to scan each pdf for your pattern The link to the library www.itextpdf.com

Related

Search bulk files list

How to differentiate .odt and .docx file apart from magic number and extension

PDF to PNG ghostscript Batch file issue

What is the *.cf7 file extension?

Detecting the database a .DAT file belongs to

Categories

Resources