Can't display unicode characters from file properly - file

I'm writing a script which should operate on words from a number of files which have unicode characters in a form of something\u0142somethingelse.
I use python 3 so I suppose after reading line \u0142 should be replaced by 'Å‚' character, but it isn't. I receive "something\u0142somethingelse" in console.
After manually copying "bad" output from console and pasting it to: print("something\u0142somethingelse") it is displayed correctly.
Problematic part of the script:
list_of_files = ['test/stack.txt']
for file in list_of_files:
with open(file,'r') as fp:
for line in fp:
print(line)
print("something\u0142somethingelse")
stack.txt:
something\u0142somethingelse
Output:
something\u0142somethingelse
somethingłsomethingelse
I experimented with utf-8 encoding when opening this file and really I'm out of ideas...

I think you can do what you want with ast.literal_eval. This uses the same syntax as the Python interpreter to understand literals: like eval but safer. So this works, for example:
a = 'something\\u0142somethingelse'
import ast
b = ast.literal_eval('"' + a + '"')
print '"' + a + '"'
print b
The output should be:
"something\u0142somethingelse"
somethingłsomethingelse

Related

Multiple line strings in Apache Zeppelin

I have a very long string that must be broken into multiple lines. How can I do that in zeppelin?
The error is error: missing argument list for method + in class String:
Here is the more complete error message:
<console>:14: error: missing argument list for method + in class String
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `$plus _` or `$plus(_)` instead of `$plus`.
val q = "select count(distinct productId),count(distinct date),count(distinct instock_inStockPercent), count(distinct instock_totalOnHand)," +
In Scala (using Apache Zeppelin as well as otherwise), you can write expressions covering multiple lines by wrapping them in parentheses:
val text = ("line 1"
+ "line 2")
Using parentheses
As Theus mentioned. One way is parentheses.
val text = ("line 1" +
"line 2")
Actually all multiline statements which break by semantics can be included by parentheses. like.
(object.function1()
.function2())
Using """
For multiline string. We can use """, like this,
val s = """line 1
line2
line3"""
The leading space before line2 and line3 will be included. If we don't want to to have the leading spaces. We can use like this.
val s = """line 1
|line2
|line3""".stripMargin
Or using different strip character
val s = """line 1
$line2
$line3""".stripMargin('$')

Octave - Adding '\n' to String Array is Not Creating a New Line

I want to change ',' character to '\n' and save it to the text file
All files are in this format:
546,234,453,685,.....,234
I want to make it like:
546
234
453
685
...
234
My initiation to this problem is like this:
fid=fopen(files{i});
strArr=fscanf(fid,'%s');
newstrArr=strrep(strArr,',','\n');
% Take each .txt input
for j=1:length(newstrArr)
Array=[Array newstrArr(j)];
endfor
Let me explain step by step:
1st I open the current text file
fid=fopen(files{i});
2nd I find the strings in text file
strArr=fscanf(fid,'%s');
Please Note that you can't replace %s with %d. (Correct me if I am wrong)
3rd I replace commas with newline character
newstrArr=strrep(strArr,',','\n');
4th I add each character to a new array with for loop
for j=1:length(newstrArr)
Array=[Array newstrArr(j)];
endfor
However When I display, using;
disp(Array);
I have this output
How can I properly replace the commas with newlines?
Regards
The issue is that you are inserting a literal '\n' (the characters \ and n) and not a newline character. This is because in Octave, a single-quote enclosed string ignores escape sequences. If you want Octave to respect escape sequences you could use a double-quoted string which will convert \n into a newline.
strrep(strArr, ',', "\n");
Or if you want your code to be MATLAB-compatible, you'll want to instead use char(10) (an actual new-line character). This is because MATLAB does not have double-quote enclosed strings.
output = strrep(strArr, ',', char(10));
Another option would be to split your input at the , and use sprintf to add the newlines (it'll treat \n as a newline)
values = strsplit(strArr, ',');
output = sprintf('%s\n', values{:});
If you just want to save each entry to a new line in a file, you can use fprintf instead.
values = strsplit(strArr, ',');
fout = fopen('output.txt', 'w');
fprintf(foug, '%s\n', values{:});
fclose(fout);
If you really just want to replace "," with newline simply do
in = fileread ("yourfile");
out = strrep (in, ",", "\n")
out = 546
234
453
685
234
Btw, see the difference between "\n" (in GNU Octave a newline) and '\n' (literally \n)
Another option is to use regexprep(), this has the advantage of being MATLAB compatible. Assuming that the newline convention you want is \n, then
regexprep('123,456,789',',','\n')
ans = 123
456
789
When output to a file via fprintf() the result looks like
123
456
789
provided the text editor understands the newline convention.

Ruby storing array into csv result in output with extra ""

I have a script that is scrapping a website, generates a csv file and is storing raw data into this csv. Everything works well except when I am trying to store an array into the csv file :
tarif_jeune = []
tarif_adulte =[]
html_doc.search("td table table table tr").each do |tr|
unless (tr.css("td:nth-child(11)").text.squish == "") || (tr.css("td:nth-child(11)").text.squish.size > 5) || (tr.css("td:nth-child(11)").text.squish == "0,00")
tarif_adulte << tr.css("td:nth-child(11)").text.squish
end
unless (tr.css("td:nth-child(12)").text.squish == "") || (tr.css("td:nth-child(12)").text.squish.size > 5) || (tr.css("td:nth-child(12)").text.squish == "0,00")
tarif_jeune << tr.css("td:nth-child(12)").text.squish
end
end
then inserting tarif_jeune and tarif_adulte into csv file :
csv << ["true", tr.css("td:nth-child(10)").text.squish, tr.css("td:nth-child(11)").text.squish, tr.css("td:nth-child(11)").text.squish, tr.css("td:nth-child(12)").text.squish, tr.css("td:nth-child(13)").text.squish, tr.css("td:nth-child(14)").text.squish, tr.css("td:nth-child(15)").text.squish, tr.css("td:nth-child(1) a").attr("href").value, tarif_jeune.uniq, tarif_adulte.uniq, cat.uniq, address]
cat, tarif_jeune, tarif_adulte are all arrays. I would expect them to look like this in my csv ["poo", "faa", "foo"] but the outpout is quite different: "" are inserted everywhere and I get something like this :
tarif_jeune, tarif_adulte, cat
"[""15,00""]","[""20,00""]","[""Simple Messieurs 45"", ""Simple Dames Senior"", ""Simple Messieurs Senior""]"
Can someone explain where those extra "" come from and how to get rid of them.
The double quote " character is the default :quote_char in CSV class.
So, if you try to write a string that comprises of double quote characters, they will be escaped by CSV class and will be written as "" to the file.
In your case, you are writing an array of strings to CSV. When Array#to_s is called on a array of strings, the output will be a String that looks somewhat like below:
ary = ["a", "b", "c"]
puts "#{ary}"
#=> ["a", "b", "c"]
The double quotes in string representation of array will be escaped by CSV class, and the above ary will appear as "[""a"", ""b"", ""c""]" in csv file.
To solve this issue, look at why you need to store the output Array#to_s in CSV. You most likely want to do ary.join(" ") or equivalent and then write to file.
This is perfectly valid csv.
Assuming you want your data to look like this (as you say in ["poo", "faa", "foo"]):
tarif_jeune, tarif_adulte, cat
["15,00"],["20,00"],["Simple Messieurs 45", "Simple Dames Senior", "Simple Messieurs Senior"]
Here, inside the fields, there is a comma, which is also your field seperator. Therefore the fields have to be surrounded with quotes. And because your fields also contain quotes inside, those quotes need to be escaped with another quote:
tarif_jeune, tarif_adulte, cat
"[""15,00""]","[""20,00""]","[""Simple Messieurs 45"", ""Simple Dames Senior"", ""Simple Messieurs Senior""]"
Any decent csv parser should be able to handle those extra quotes. In fact, without them, your csv would be malformed.

Need some help parsing a string in C

Using fgets, I have read in a line from a text file. The line may be something like this:
# O^6+ + H -> O^5+ + H^+
Or it may be this:
# Mg^12+ + H -> Mg^11+ + H^+
or this:
# Ne^10+ + He -> Ne^9+ + He^+
Or a multitude of other possibilities.
I am trying to extract the ion, the charge and the atom terms from the string.
I tried something like this:
sscanf(line,"# %2s^%d+ + %2s",cs->ION,&(cs->Z),cs->ATOM);
I also tried this:
sscanf(line,"# %[^^]s^%d+ + %2s",cs->ION,&(cs->Z),cs->ATOM); Because I was picking up the '^' character.
I just can't seem to get this to work for every case. Any suggestions are appreciated.
Your try with the format string
"# %[^^]s^%d+ + %2s"
was almost right, except that after the %[^^] there has to be no s, i. e.
"# %[^^]^%d+ + %2s"
works.

Comparing two files for identical lines where the order doesn't matter

I have two files (which could be up to 150,000 lines long; each line is 160 bytes), which I'd like to check to see if the lines in each are the same. diff won't work for me (directly) because a small percentage of the lines occur in a different order in the two files. Typically, a pair of lines will be transposed.
What's the best way to see if the same lines appear in both files, but where order doesn't matter?
Thanks,
Chris
Although it's a slightly expensive way to do it (for anything larger I'd rethink this), I'd fire up python and do the following:
filename1 = "WHATEBVER YOUR FILENAME IS"
filename2 = "WHATEVER THE OTHER ONE IS"
file1contents = set(open(filename1).readlines())
file2contents = set(open(filename2).readlines())
if file1contents == file2contents:
print "Yup they're the same!"
else:
print "Nope, they differ. In file2, not file1:\n\n"
for diffLine in file2contents - file1contents:
print "\t", diffLine
print "\n\nIn file1, not file2:\n\n"
for diffLine in file1contents - file2contents:
print "\t", diffLine
That'll print the different lines if they differ.
For only 150k lines, just hash each line and store them ordered in a lookup table. Then for each line in file two just perform the lookup.
Another python script to do this:
#!/usr/bin/env python
import sys
file1 = sys.argv[1]
file2 = sys.argv[2]
lines1 = open(file1,'r').readlines()
lines2 = open(file2,'r').readlines()
lines1.sort()
lines2.sort()
s = ''
for i,line in enumerate(lines1):
if lines2[i] != line:
print '> %s' % line
print '< %s' % lines2[i]
s = 'not'
print 'file %s is %s like file %s' % (file1, s, file2)

Resources