UnicodeDecodeError with processing a csv - arrays

Suddently a "UnicodeDecodeError" arises in a code of mine which worked yesterday.
File
"D:\Anaconda\lib\site-packages\IPython\core\interactiveshell.py", line
3284, in run_code
self.showtraceback(running_compiled_code=True)
File
"D:\Anaconda\lib\site-packages\IPython\core\interactiveshell.py", line
2021, in showtraceback
value, tb, tb_offset=tb_offset)
File "D:\Anaconda\lib\site-packages\IPython\core\ultratb.py", line
1379, in structured_traceback
self, etype, value, tb, tb_offset, number_of_lines_of_context)
File "D:\Anaconda\lib\site-packages\IPython\core\ultratb.py", line
1291, in structured_traceback
elist = self._extract_tb(tb)
File "D:\Anaconda\lib\site-packages\IPython\core\ultratb.py", line
1272, in _extract_tb
return traceback.extract_tb(tb)
File "D:\Anaconda\lib\traceback.py", line 72, in extract_tb
return StackSummary.extract(walk_tb(tb), limit=limit)
File "D:\Anaconda\lib\traceback.py", line 364, in extract
f.line
File "D:\Anaconda\lib\traceback.py", line 286, in line
self._line = linecache.getline(self.filename, self.lineno).strip()
File "D:\Anaconda\lib\linecache.py", line 16, in getline
lines = getlines(filename, module_globals)
File "D:\Anaconda\lib\linecache.py", line 47, in getlines
return updatecache(filename, module_globals)
File "D:\Anaconda\lib\linecache.py", line 137, in updatecache
lines = fp.readlines()
File "D:\Anaconda\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position
2441: invalid start byte
import csv
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
dateiname_TDM = "./TDM_example_small.csv"
dateiname_corpus = "./Topic_Modeling/Input_Data/corpus.mm"
dateiname_dictionary = "./Topic_Modeling/Input_Data/dictionary.dict"
ids = {}
corpus = []
with open(dateiname_TDM, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quotechar='|')
documente = next(reader, None)[1:]
for rownumber, row in enumerate(reader):
for index, field in enumerate(row):
if index == 0:
if rownumber > 0:
ids[rownumber-1] = field
else:
if rownumber == 0:
corpus.append([])
else:
try:
if field > 0:
corpus[index-1].append((rownumber-1, int(field)))
except ValueError:
corpus[index-1].append((rownumber-1, 0))

Without seeing the what's at position 2441 I'm not entirely sure, but it is probably one of the following:
A special, non-ascii/extended ascii character, in which case do the_string.encode("UTF-8") or when opening do encoding = "UTF-8" in the open function
You have \u or \U somewhere and this makes the next characters read as part of a Unicode sequence so do repr(the_string) to add backslashes to nullify backslashes after (Probably not this one)
You are reading a bytes object not a str object. Try opening it with r+b (read & write, bytes) in the open function
I've more or less thrown spaghetti at a wall but I hope this helps!

Related

Is there a way to make python print to file for every iteration of a for loop instead of storing all in the buffer?

I am looping over a very large document to try and lemmatise it.
Unfortunately python does not seem to print to file for every line but run through the whole document before printing, which given the size of my file exceeds the memory...
Before I chunk my document into more bite-sized chunks I wondered if there was a way to force python to print to file for every line.
So far my code reads:
import spacy
nlp = spacy.load('de_core_news_lg')
fin = "input.txt"
fout = "output.txt"
#%%
with open(fin) as f:
corpus = f.readlines()
corpus_lemma = []
for word in corpus:
result = ' '.join([token.lemma_ for token in nlp(word)])
corpus_lemma.append(result)
with open(fout, 'w') as g:
for item in corpus_lemma:
g.write(f'{item}')
To give credits for the code, it was kindly suggested here: Ho to do lemmatization on German text?
As described in: How to read a large file - line by line?
If you do your lemmatisation inside the with block, Python will handle reading line by line using buffered I/O.
In your case, it would look like:
import spacy
nlp = spacy.load('de_core_news_lg')
fin = "input.txt"
fout = "output.txt"
#%%
corpus_lemma = []
with open(fin) as f:
for line in f:
result = " ".join(token.lemma_ for token in nlp(line))
corpus_lemma.append(result)
with open(fout) as g:
for item in corpus_lemma:
g.write(f"{item}")

The filename, directory name, or volume label syntax is incorrect - when trying to readfile in go lang

I want to read contents of text file.
When I am passing the file name as string like this:
stream, err = ioutil.ReadFile("sample.txt")
its working.
Its even working if do in this way:
filename := "sample.txt"
stream, err = ioutil.ReadFile(filename)
But when I get the value of filename from string array, it fails to get the file and throw the error: The filename, directory name, or volume label syntax is incorrect
filename := lines[1] //where lines[] is an array of strings
stream, err = ioutil.ReadFile(filename)
Debug information
fmt.Printf("%q\n", lines[1]) // output: mytext2.txt\r
The application should trim the \r from the end of the string using
strings.TrimSuffix(filename, "\r") or strings.TrimSpace(filename).
If OP used strings.Split(s, "\n", -1) to create lines, then the trailing \r can also be avoided by splitting on "\r\n" instead.

Write a word at the end of a specific line in a text file using erlang

How can I write a word at the end of a specified line in a file in Erlang, let's say
Line 1: "He is john"
write_word("poem.txt",1," doe.").
Line 1: "He is john doe."
This is all I can manage to do:
write_word(Filename, LineNumber, Word) ->
{ok, Data} = file:open(FileName, [read, write]),
% write the word at end of line with the specified line number
Below provided is kind of a pseudo code(untested one). Also you will have to write the contents to a new file
openFile(FileName, Mode, DesiredLine) ->
{ok, FD} = file:open(FileName, Mode),
for_each_line(FD, 0, DesiredLine).
for_each_line(FD, LineNo, DesiredLine) ->
case io:get_line(FD, "") of
eof ->
file:close(FD);
Line ->
case LineNo =:= DesiredLine of
false ->
%% Write into a new file
NewLine = Line,
for_each_line(FD, LineNo + 1);
true ->
%% do your stuff
NewLine = Line ++ "Word",
%% write into your file
end
end.

Trying to store Utf-8 data in datastore getting UnicodeEncodeError

Trying to store utf-8 into datastore and getting error :
Traceback (most recent call last):
File "/sinfo/google_appengine/google/appengine/ext/webapp/__init__.py", line 511, in __call__
handler.get(*groups)
File "/sinfo/siteinfo/siteinfo.py", line 1911, in get
seoEntity.put()
File "/sinfo/google_appengine/google/appengine/ext/db/__init__.py", line 833, in put
return datastore.Put(self._entity, rpc=rpc)
File "/sinfo/google_appengine/google/appengine/api/datastore.py", line 275, in Put
req.entity_list().extend([e._ToPb() for e in entities])
File "/sinfo/google_appengine/google/appengine/api/datastore.py", line 680, in _ToPb
properties = datastore_types.ToPropertyPb(name, values)
File "/sinfo/google_appengine/google/appengine/api/datastore_types.py", line 1499, in ToPropertyPb
pbvalue = pack_prop(name, v, pb.mutable_value())
File "/sinfo/google_appengine/google/appengine/api/datastore_types.py", line 1322, in PackString
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
How do i solve this? The data is already utf-8 encoded and when I enter it into the datastore it uses the ascii codec and fails?
I use following helper in my projects
def force_utf8(string):
if type(string) == str:
return string
return string.encode('utf-8')
Use it to escape all your unicode data before passing to GAE. Also you can find useful the following snippet:
def force_unicode(string):
if type(string) == unicode:
return string
return string.decode('utf-8')

Matlab command to access the last line of each file?

I have 20 text files, and I want to use a matlab loop to get the last line of each file without taking into consideration the other lines. is there any matlab command to solve this problem?
One thing you can try is to open the text file as a binary file, seek to the end of the file, and read single characters (i.e. bytes) backwards from the end of the file. This code will read characters from the end of the file until it hits a newline character (ignoring a newline if it finds it at the very end of the file):
fid = fopen('data.txt','r'); %# Open the file as a binary
lastLine = ''; %# Initialize to empty
offset = 1; %# Offset from the end of file
fseek(fid,-offset,'eof'); %# Seek to the file end, minus the offset
newChar = fread(fid,1,'*char'); %# Read one character
while (~strcmp(newChar,char(10))) || (offset == 1)
lastLine = [newChar lastLine]; %# Add the character to a string
offset = offset+1;
fseek(fid,-offset,'eof'); %# Seek to the file end, minus the offset
newChar = fread(fid,1,'*char'); %# Read one character
end
fclose(fid); %# Close the file
On Unix, simply use:
[status result] = system('tail -n 1 file.txt');
if isstrprop(result(end), 'cntrl'), result(end) = []; end
On Windows, you can get the tail executable from the GnuWin32 or UnxUtils projects.
It may not be very efficient, but for short files it can be sufficient.
function pline = getLastTextLine(filepath)
fid = fopen(filepath);
while 1
line = fgetl(fid);
if ~ischar(line)
break;
end
pline = line;
end
fclose(fid);

Resources