Camel: changing stream encoding - apache-camel

I'm receiving data stream from http with that route:
from("direct:foo").
to("http://foo.com/bar.html").
to("file:///tmp/bar.html")
HTTP stream comes with Windows-1251 encoding. I'd like to re-code to UTF-8 on the fly and then store to file.
How to do that using standard camel way?

I think vikingsteve's solution misses a step. The input stream contains characters with encoding CP1251. The characters in that stream will not change their encoding when you convert the input stream contents to a string. You need to specify the same character encoding scheme that was used by the entity that encoded the characters when you decode them. Otherwise you will get undesirable results.
<route id="process_umlaug_file" startupOrder="2">
<from uri="file:///home/steppra1/Downloads?fileName=input_umlauts.txt"/>
<convertBodyTo type="java.lang.String" charset="ISO-8859-1"/>
<to uri="file:///home/steppra1/Downloads?fileName=output_umlauts.txt&charset=UTF-8"/>
</route>
I tested this reading a CP1251 encoded file containing German umlauts:
steppra1#steppra1-linux-mint ~/Downloads $ file input_umlauts.txt
input_umlauts.txt: ISO-8859 text, with CRLF line terminators
steppra1#steppra1-linux-mint ~/Downloads $ file output_umlauts.txt
output_umlauts.txt: UTF-8 Unicode text, with CRLF line terminators
Using the two steps of decoding and then re-coding yields properly encoded German umlauts. If I change above route to
<route id="process_umlaug_file" startupOrder="2">
<from uri="file:///home/steppra1/Downloads?fileName=input_umlauts.txt"/>
<convertBodyTo type="java.lang.String" charset="UTF-8"/>
<to uri="file:///home/steppra1/Downloads?fileName=output_umlauts.txt"/>
</route>
then the output file is still UTF-8 encoded, possibly because that is my platform default, but the umlauts are garbled.

Please have a look at .convertBodyTo() - in particular the charset argument.
from("direct:foo").
to("http://foo.com/bar.html").
convertBodyTo(String.class, "UTF-8")
to("file:///tmp/bar.html")
Reference: http://camel.apache.org/convertbodyto.html

Related

JMeter: How to decode alphanumeric characters fetched in JMETER response to its valid value?

I have a JMeter test plan which basically downloads a file by breaking it into multiple parts.
However, these parts are received in encoded alphanumeric character format.
For instance, we have a .txt file which is broken down into 2 parts. Each part has an encoded set of characters. I have been successful so far in appending these characters into another file.
Is there a way of restoring the contents of this file ( holding alphanumeric characters) into the original .txt file with its valid contents back again?
e.g. JMeter response: <data> aWJiZWFuLFBhbmFtYSxDb3NtZXRpY </data>
Can someone please suggest the steps to achieve this?
It looks like it is Base64-encoded, you can use __base64Decode() function (can be installed as a part of Custom JMeter Functions bundle using JMeter Plugins Manager)
${__base64Decode(aWJiZWFuLFBhbmFtYSxDb3NtZXRpY,)}
If you don't have possibility or unwilling to use JMeter Plugins you can achieve the same using JMeter's built-in __groovy() function:
${__groovy(new String('aWJiZWFuLFBhbmFtYSxDb3NtZXRpY'.decodeBase64()),)}

libxml UTF-8 encoding not displaying Spanish characters

I am programming in C using the libxml2 library, I am converting a normal xml file into valid XMLTV format but the Spanish characters are not displaying correctly.
I am aware that the xml encoding for Spanish characters should be UTF-8.
I have this line at the top of my files:
<?xml version="1.0" encoding="utf-8"?>
And I also use the libxml function:
xmlSaveFormatFileEnc( (const xmlChar *) szFilename, doc, "utf-8", 1 );
which I believe is meant to save the file in utf-8 encoding.
When importing the xmltv file to the EPG the Spanish characters are displaying as different Spanish characters (e.g. é may display as Á). I have the feeling the file is still not encoding in UTF-8 but in ISO-8859-1

Convert WAV to base64

I have some wave files (.wav) and I need to convert them in base64 encoded strings. Could you guide me how to do that in Python/C/C++ ?
#ForceBru's answer
import base64
enc=base64.b64encode(open("file.wav").read())
has one problem. I noticed for some WAV files that I encoded the string generated was shorter than expected.
Python docs for "open()" say
If mode is omitted, it defaults to 'r'.
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability.
Hence, the code snippet doesn't read in binary. So, the below code should be used for better output.
import base64
enc = base64.b64encode(open("file.wav", "rb").read())
Python
The easiest way
from base64 import b64encode
f=open("file.wav")
enc=b64encode(f.read())
f.close()
Now enc contains the encoded value.
You can use a bit simplified version:
import base64
enc=base64.b64encode(open("file.wav").read())
C
See this file for an example of base64 encoding of a file.
C++
Here you can see the base64 conversion of strings. I think it wouldn't be too difficult to do the same for files.

Parsing rtf file without encoded special character

I have an rtf file which contains special character as it is without decode to hexadecimal or unicode. I am using lib rtf in c language.
Below is the content from the rtf file. It has not mentioned encoded mode.
\b0\fi-380\li1800 · Going to home
While parsing i will get some junk value for the character. How to get proper character from the rtf file if it contains any character in specula mode only.
Regards,
Try this:
\b0\fi-380\li1800 \u183? Going to home
How to do it:
https://stackoverflow.com/a/20117501/1543816

Linux & C-Programming: How can I write utf-8 encoded text to a file?

I am interested in writing utf-8 encoded strings to a file.
I did this with low level functions open() and write().
In the first place I set the locale to a utf-8 aware character set with
setlocale("LC_ALL", "de_DE.utf8").
But the resulting file does not contain utf-8 characters, only iso8859 encoded umlauts. What am I doing wrong?
Addendum: I don't know if my strings are really utf-8 encoded in the first place. I just keep them in the source file in this form: char *msg = "Rote Grütze";
See screenshot for content of the textfile:
alt text http://img19.imageshack.us/img19/9791/picture1jh9.png
Changing the locale won't change the actual data written to the file using write(). You have to actually produce UTF-8 characters to write them to a file. For that purpose you can use libraries as ICU.
Edit after your edit of the question: UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). So, for all the text that doesn't have any of this symbols, both are equivalent. However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. Sometimes you just have to tell it to.
To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8.
Another edit: Just to be sure, you can convert your source code to UTF-8 using iconv:
iconv -f latin1 -t utf8 file.c
This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already.
Regards,
Yes, you can do it with glibc. They call it multibyte instead of UTF-8, because it can handle more than one encoding type. Check out this part of the manual.
Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support.
If you are coming from an Unicode file I believe the function you looking for is wcstombs().
Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set.
Once you have done this, could you edit your original post to add the pertinent information?

Resources