Python read text file as binary? - file

I was trying to build an encryption program in python 2.7. It would read the binary from a file and then use a key to encrypt it. However, I quickly ran into a problem. Files like image files and executables read as hex values. However, text files do not using open(). Even if i run
file=open("myfile.txt", "rb")
out=file.read()
it still comes out as just text. I'm on windows 7, not linux which i think may make a difference. Is there any way i could read the binary from ANY file (including text files), not just image and executable files?

Even when reading a file with the 'rb' flag,
if your file has the byte '\x41' it will be printed as the letter 'A' in the console.
If you want the hex values, encode the file content as hex, which means:
content = open('text.txt', 'rb').read()
# Since python 3.5:
hex = content.hex()
# else:
hex = content.encode('hex')

Take a look at below code .also it has many points for you
from hashlib import md5
from Crypto.Cipher import AES
from Crypto import Random
def derive_key_and_iv(password, salt, key_length, iv_length):
d = d_i = ''
while len(d) < key_length + iv_length:
d_i = md5(d_i + password + salt).digest()
d += d_i
return d[:key_length], d[key_length:key_length+iv_length]
def encrypt(in_file, out_file, password, key_length=32):
bs = AES.block_size
salt = Random.new().read(bs - len('Salted__'))
key, iv = derive_key_and_iv(password, salt, key_length, bs)
cipher = AES.new(key, AES.MODE_CBC, iv)
out_file.write('Salted__' + salt)
finished = False
while not finished:
chunk = in_file.read(1024 * bs)
if len(chunk) == 0 or len(chunk) % bs != 0:
padding_length = (bs - len(chunk) % bs) or bs
chunk += padding_length * chr(padding_length)
finished = True
out_file.write(cipher.encrypt(chunk))
def decrypt(in_file, out_file, password, key_length=32):
bs = AES.block_size
salt = in_file.read(bs)[len('Salted__'):]
key, iv = derive_key_and_iv(password, salt, key_length, bs)
cipher = AES.new(key, AES.MODE_CBC, iv)
next_chunk = ''
finished = False
while not finished:
chunk, next_chunk = next_chunk, cipher.decrypt(in_file.read(1024 * bs))
if len(next_chunk) == 0:
padding_length = ord(chunk[-1])
chunk = chunk[:-padding_length]
finished = True
out_file.write(chunk)
Usage
with open(in_filename, 'rb') as in_file, open(out_filename, 'wb') as out_file:
encrypt(in_file, out_file, password)
with open(in_filename, 'rb') as in_file, open(out_filename, 'wb') as out_file:
decrypt(in_file, out_file, password)

Your binary file is coming out looking like text because the file is being treated like it is encoded in an 8 bit encoding (ASCII or Latin-1, etc). Also, in Python 2, bytes and (text) characters are used interchangeably... i.e. a string is just an array of ASCII bytes.
You should search the differences between python 2 and 3 text encoding and you will quickly see why anomalies such as you are encountering can develop. Most of the Python 2 version encryption modules use the python byte strings.
Your "binary" non-text files are not really being treated any differently from the text ones; they just don't map to an intelligible coding that you recognize, whereas the text ones do.

Related

Writing to output stream in chunks in Groovy

I'm getting Jenkins console logs and writing them into an output stream like this:
ByteArrayOutputStream stream = new ByteArrayOutputStream()
currentBuild.rawBuild.getLogText().writeLogTo(0, stream)
However, the downside of this approach is that writeLogTo() method is limited to 10000 lines:
https://github.com/jenkinsci/stapler/blob/master/core/src/main/java/org/kohsuke/stapler/framework/io/LargeText.java#L572
In this case, if Jenkins console log is more than a 10000 lines then the data from line 10000 and up is lost and not written into a buffer.
I'm trying to re-write the above approach in the most easiest way to account for cases when the log has more than 10000 lines.
I feel like my attempt is very complicated and error-prone. Is there an easier way to introduce a new logic?
Please note that the code below is not tested, this is just a draft of how I'm planning to implement it:
ByteArrayOutputStream stream = new ByteArrayOutputStream()
def log = currentBuild.rawBuild.getLogText()
def offset = 0
def maxNumOfLines = 10000
# get total number of lines in the log
# def totalLines = (still trying to figure out how to get it)
if (totalLines > maxNumOfLines) {
def numOfExecutions = round(totalLines / maxNumOfLines)
}
for (int i=0; i<numOfExecutions; i++) {
log.writeLogTo(offset, stream)
offset += maxNumOfLines
}
writeLogTo(long start, OutputStream out)
According to comments this method returns the offset to start the next write operation.
Seems code could be like this
def logFile = currentBuild.rawBuild.getLogText()
def start=0
while(logFile.length()>start)
start=logFile.writeLogTo(start, stream)
stream could be a FileOutputStream to avoid reading whole log into memory.
There is another method readAll()
So, the code could be simple as this to read whole log as text:
def logText=currentBuild.rawBuild.getLogText().readAll().getText()
Or if you want to transfer it to a local file:
new File('path/to/file.log').withWriter('UTF-8'){ w->
w << currentBuild.rawBuild.getLogText().readAll()
}

Is there a way to make python print to file for every iteration of a for loop instead of storing all in the buffer?

I am looping over a very large document to try and lemmatise it.
Unfortunately python does not seem to print to file for every line but run through the whole document before printing, which given the size of my file exceeds the memory...
Before I chunk my document into more bite-sized chunks I wondered if there was a way to force python to print to file for every line.
So far my code reads:
import spacy
nlp = spacy.load('de_core_news_lg')
fin = "input.txt"
fout = "output.txt"
#%%
with open(fin) as f:
corpus = f.readlines()
corpus_lemma = []
for word in corpus:
result = ' '.join([token.lemma_ for token in nlp(word)])
corpus_lemma.append(result)
with open(fout, 'w') as g:
for item in corpus_lemma:
g.write(f'{item}')
To give credits for the code, it was kindly suggested here: Ho to do lemmatization on German text?
As described in: How to read a large file - line by line?
If you do your lemmatisation inside the with block, Python will handle reading line by line using buffered I/O.
In your case, it would look like:
import spacy
nlp = spacy.load('de_core_news_lg')
fin = "input.txt"
fout = "output.txt"
#%%
corpus_lemma = []
with open(fin) as f:
for line in f:
result = " ".join(token.lemma_ for token in nlp(line))
corpus_lemma.append(result)
with open(fout) as g:
for item in corpus_lemma:
g.write(f"{item}")

How can I password protect a file regardless of its extension in Java 8 ro Java 10

I have tried doing this by encrypting individual files but I have a lot of data (~20GB) and hence it would take a lot of time. In my test it took 2.28 minutes to encrypt a single file of size 80MB.
Is there a quicker way to be able to password protect that would apply to any any file (text/binary/multimedia)?
If you are just trying to hide the file from others, you can try to encrypt the file path instead of encrypting the whole huge file.
For the path you mentioned: text/binary/multimedia, you can try to encrypt it by a method as:
private static String getEncryptedPath(String filePath) {
String[] tokens = filePath.split("/");
List<String> tList = new ArrayList<>();
for (int i = 0; i < tokens.length; i++) {
tList.add(Hashing.md5().newHasher() // com.google.common.hash.Hashing;
.putString(tokens[i] + filePath, StandardCharsets.UTF_8).hash().toString()
.substring(2 * i, 2 * i + 5)); // to make it impossible to encrypt, add your custom secret here;
}
return String.join("/", tList);
}
and then it becomes an encrypted path as:
72b12/9cbb3/4a5f3
Once you know the real path text/binary/multimedia, any time you want to access the file, you can just use this method to get the real file path 72b12/9cbb3/4a5f3.

Reading file with "US-ASCII" encoding in Haskell: hGetContents: invalid argument (invalid byte sequence)

I'm using Haskell for programming a parser, but this error is a wall I can't pass. Here is my code:
main = do
arguments <- getArgs
let fileName = head arguments
fileContents <- readFile fileName
converter <- open "UTF-8" Nothing
let titleLength = length fileName
titleWithoutExtension = take (titleLength - 4) fileName
allNonEmptyLines = unlines $ tail $ filter (/= "") $ lines fileContents
When I try to read a file with "US-ASCII" encoding I get the famous error hGetContents: invalid argument (invalid byte sequence). I've tried to change the "UTF-8" in my code by "US-ASCII", but the error persist. Is there a way for reading this files, or any kind of file handling encoding problems?
You should hSetEncoding to configure the file handle for a specific text encoding, e.g.:
import System.Environment
import System.IO
main = do
(path : _) <- getArgs
h <- openFile path ReadMode
hSetEncoding h latin1
contents <- hGetContents h
-- no need to close h
putStrLn $ show $ length contents
If your file contains non-ASCII characters and it's not UTF8 encoded, then latin1 is a good bet although it's not the only possibility.

Ways to validate converted code from FORTRAN to C

I have converted around 90+ fortran files into C files using a tool and I need to validate that the conversion is good or not.
Can you give me some ideas on how best to ensure that the functionality has been preserved through the translation?
You need verification tests that exercise those fortran functions. Then you run those tests against the c code.
You can use unit test technology/methodology. In fact I can't see how else you would prove that the conversion is correct.
In lots of unit test methodologies you would write the tests in the same language as the code, but in this case I recommend very very strongly to pick one language and one code base to exercise both sets of functions. Also don't worry about be trying to create pure unit tests rather use the techniques to give you coverage of all the use that the fortran code was supposed to handle.
Use unit tests.
First write your unit tests on the Fortran code and check whether they all run correctly, then rewrite them in C and run those.
The problem in this approach is that you also need to rewrite your unit test, which you normally don't do when refactoring code (except for API changes). This means that you might end up debugging your ported unit testing code as well, beside the actual code.
Therefore, it might be better to write testing code that contains minimal logic and only write the results of the functions to a file. Then you can rewrite this minimal testing code in C, generate the same files and compare the files.
Here is what I did for a "similar" task (comparing fortran 90 to fortran 90 + OpenACC GPU accelerated code):
Analyze what's the output of each Fortran module.
Write these output arrays to .dat files.
Copy the .dat files into a reference folder.
Write the output of the converted modules to files (either CSV or binary). Use the same filename for convenience.
Make a python script that compares the two versions.
I used convenience functions like these in fortran (analogous for 1D, 2D case):
subroutine write3DToFile(path, array, n1, n2, n3)
use pp_vardef
use pp_service, only: find_new_mt
implicit none
!input arguments
real(kind = r_size), intent(in) :: array(n1,n2,n3)
character(len=*), intent(in) :: path
integer(4) :: n1
integer(4) :: n2
integer(4) :: n3
!temporary
integer(4) :: imt
call find_new_mt(imt)
open(imt, file = path, form = 'unformatted', status = 'replace')
write(imt) array
close(imt)
end subroutine write3DToFile
In python I used the following script for reading binary Fortran data and comparing it. Note: Since you want to convert to C you would have to adapt it such that you can read the data produced by C instead of Fortran.
from optparse import OptionParser
import struct
import sys
import math
def unpackNextRecord(file, readEndianFormat, numOfBytesPerValue):
header = file.read(4)
if (len(header) != 4):
#we have reached the end of the file
return None
headerFormat = '%si' %(readEndianFormat)
headerUnpacked = struct.unpack(headerFormat, header)
recordByteLength = headerUnpacked[0]
if (recordByteLength % numOfBytesPerValue != 0):
raise Exception, "Odd record length."
return None
recordLength = recordByteLength / numOfBytesPerValue
data = file.read(recordByteLength)
if (len(data) != recordByteLength):
raise Exception, "Could not read %i bytes as expected. Only %i bytes read." %(recordByteLength, len(data))
return None
trailer = file.read(4)
if (len(trailer) != 4):
raise Exception, "Could not read trailer."
return None
trailerUnpacked = struct.unpack(headerFormat, trailer)
redundantRecordLength = trailerUnpacked[0]
if (recordByteLength != redundantRecordLength):
raise Exception, "Header and trailer do not match."
return None
dataFormat = '%s%i%s' %(readEndianFormat, recordLength, typeSpecifier)
return struct.unpack(dataFormat, data)
def rootMeanSquareDeviation(tup, tupRef):
err = 0.0
i = 0
for val in tup:
err = err + (val - tupRef[i])**2
i = i + 1
return math.sqrt(err)
##################### MAIN ##############################
#get all program arguments
parser = OptionParser()
parser.add_option("-f", "--file", dest="inFile",
help="read from FILE", metavar="FILE", default="in.dat")
parser.add_option("--reference", dest="refFile",
help="reference FILE", metavar="FILE", default="ref.dat")
parser.add_option("-b", "--bytesPerValue", dest="bytes", default="4")
parser.add_option("-r", "--readEndian", dest="readEndian", default="big")
parser.add_option("-v", action="store_true", dest="verbose")
(options, args) = parser.parse_args()
numOfBytesPerValue = int(options.bytes)
if (numOfBytesPerValue != 4 and numOfBytesPerValue != 8):
print "Unsupported number of bytes per value specified."
sys.exit()
typeSpecifier = 'f'
if (numOfBytesPerValue == 8):
typeSpecifier = 'd'
readEndianFormat = '>'
if (options.readEndian == "little"):
readEndianFormat = '<'
inFile = None
refFile = None
try:
#prepare files
inFile = open(str(options.inFile),'r')
refFile = open(str(options.refFile),'r')
i = 0
while True:
passedStr = "pass"
i = i + 1
unpackedRef = None
try:
unpackedRef = unpackNextRecord(refFile, readEndianFormat, numOfBytesPerValue)
except(Exception), e:
print "Error reading record %i from %s: %s" %(i, str(options.refFile), e)
sys.exit()
if (unpackedRef == None):
break;
unpacked = None
try:
unpacked = unpackNextRecord(inFile, readEndianFormat, numOfBytesPerValue)
except(Exception), e:
print "Error reading record %i from %s: %s" %(i, str(options.inFile), e)
sys.exit()
if (unpacked == None):
print "Error in %s: Record expected, could not load record it" %(str(options.inFile))
sys.exit()
if (len(unpacked) != len(unpackedRef)):
print "Error in %s: Record %i does not have same length as reference" %(str(options.inFile), i)
sys.exit()
#analyse unpacked data
err = rootMeanSquareDeviation(unpacked, unpackedRef)
if (abs(err) > 1E-08):
passedStr = "FAIL <-------"
print "%s, record %i: Mean square error: %e; %s" %(options.inFile, i, err, passedStr)
if (options.verbose):
print unpacked
except(Exception), e:
print "Error: %s" %(e)
finally:
#cleanup
if inFile != None:
inFile.close()
if refFile != None:
refFile.close()

Resources