Word2vec C++ training on German wikipedia - c

I am using the C version of word2vec (as found at https://code.google.com/archive/p/word2vec/) and training it on a filtered dump of the German version of Wikipedia (~17 GB raw text, ~1.4 B words). I am using the following settings:
-cbow 1 -size 300 -window 5 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 -min-count 1000
The resulting output file contains ~50k words, however none of them contain the letters ä, ö, ü or ß. I verified that word2vec can handle them by making a small corpus containing words with those letters and they appeared in the output.
What could be causing the words containing these character to not appear in the output file? Is it somehow related to the large size of the corpus or any of the settings that I am using?

It should not be related to the size of the corpus. I've trained a German model (see link below) with similar settings on a Wikipedia dump and German news articles (600k words in the vocabulary) and generated word vectors for words with German umlauts as well.
Things you can do:
Check the character encoding of your corpus file as well as your training environment to be UTF-8
Avoid this problem by converting the umlauts to their respective bigram tokens (ä → ae, ß → ss etc.) in the preprocessing
Check out this project where word2vec was applied on a german corpus (but with gensim using the C implementation)

Related

Switching data in text file to ASCII or to UTF8

I have a text file in Unix with two columns that contains strings in various languages (Chines, Korean, Japaneese, Arabic, English, French, German, Etc...) in the first column.
Current file's encoding is:
> file index.txt
index.txt: Non-ISO extended-ASCII English text, with LF, NEL line
terminators
I've been told that this file have a subset of entries (in column 1) that using a non-ASCII, non-UTF8 encoding and that I should switching data in this column preferably to ASCII. If not possible, to UTF8.
For example:
1. How user see it: 'Bibliothe<C3>que'.
2. Via vim: 'Bibliothèque'.
3. Via less: 'Bibliothèque'.
I already tried many conversion and method (for days) but non of them convert it as expected.
E.g
I tried to change the encoding to UTF8:
iconv -f CP1256 -t UTF-8 < index.txt > index.txt.2
770> file index.txt.2
index.txt.2: UTF-8 Unicode English text But the characters
seems to be corrupted in the new file.
But got: 1. Via vim: 'Bibliothﺃ¨que' 2. Via less: 'Bibliothأ¨que'.
I check how many non-ASCii rows this file contains and got output file with hundreds lines in the file 'index.txt.non_ascii':
pcregrep --color='auto' -n "[\x80-\xFF]" index.txt > index.txt.non_ascii
I also tried to write a short script (in Perl) that read the data and store it as utf8, but strings where corrupted again.
I will really appreciate if someone could assist me with this problem.
Thanks in advance!
Mike

Is there a tool to search differences in hex files?

I have two hex files, and want to search for values that have changed from 9C in one file to 9D in the next one.
Even a tool that synchronizes the windows would work, as I can't seem to find any that enable me to do this. All I've found are tools that let me search for the same value and same address between two files.
Thanks!
"There's no such thing as a hex file."
Oh well perhaps I should delete all my hex files.
I use Motorola and Intel formats of hex files (they are just text files representing bytes, but include a record structure for addresses, checksum etc per line. These files need to be treated as binary (in .gitattributes) since they cannot be merged. I use Kdiff3 to see differences (or any other text diff tool).

How to output special characters in cmd window?

Once i write a c program and try to output special characters (like ä ö ü ß) with printf() on the cmd window on windows 10 it only shows sth like ▒▒▒▒▒▒▒▒▒▒▒▒
But if i just type them in the cmd window without a c programm being executed it displays these characters properly.
When i change the console type to standard output in netbeans the output is correct as well.
I tried to change the codepage of cmd but it didnt fix the problem.
I use the gcc c compiler.
The reason is the usage of different code pages for character encoding.
In GUI text editor on writing program code stored in a file on which each character is encoded with just a single byte the code page Windows-1252 is used in Western European and North American countries.
In console window opened on running a console application an OEM code page is used which is in Western European countries OEM 850 and in North American countries OEM 437.
So you need for ÄÖÜäöüß different byte values written in code to get those characters displayed as expected in the console window at least on execution in Western European and North American countries.
Character Windows-1252 OEM 850
Ä \xC4 \x8E
Ö \xD6 \x99
Ü \xDC \x9A
ä \xE4 \x84
ö \xF6 \x94
ü \xF1 \x8C
ß \xDF \xE1
The code page used by default in a console window can be seen by opening a command prompt window and run either chcp (change code page) or mode which both display the active code page.
The default code page for GUI applications and console applications on a computer for a user account depends on the Windows region and language settings for this user account.
Some web pages you should read to better understand character encoding:
Character encoding (English Wikipedia article)
On the Goodness of Unicode by Tim Bray
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolksy
What's the best default new file format? (UltraEdit forum topic)
Programmers should not write non ASCII characters into strings output by a compiled executable because it depends on which code page is used by the compiler on creating the binary representation (bytes) of the characters in executable. It is better to use the hexadecimal notation when active code page on execution of the application is known or defined by the application before the string is output.
It is also possible to store strings in the executable in Unicode, determine the encoding of the output handle before output any string and convert each Unicode string to the encoding of the output handle before the string is written to the output handle.
And of course it depends on used output font how the bytes in the strings in the executable are finally really displayed on screen.

What does ISO-8859 in `file` mean?

I ran the following command in a software repository I have access to:
find . -not -name ".svn" -type f -exec file "{}" \;
and saw many output lines like
./File.java: ISO-8859 C++ program text
What does that mean? ISO-8859 is an encoding class, not a certain encoding. I've expected all files to be UTF-8, but most are in the presented encoding. Is ISO-8859 a proper subset of UTF-8, too?
Is it possible for me to convert all those files safely by using ISO-8859-1 as source encoding while translating it into UTF-8 with iconv for example?
I am afraid that the Unix file program is rather bad at this. It just means it is in a byte encoding. It does not mean that it is ISO-8859-1. It might even be in a non-ISO byte encdidng, although it usually figures that out.
I have a system that does much better than file, but it is trained on an English-language corpus, so might not do as well as on German.
The short answer is that the result of file is not reliable. You have to know the real encoding to up-convert it.
The charset detection used by file is rather simplistic. It recognizes UTF-8. And it distinguished between "ISO-8859" and "non-ISO extended-ASCII" by looking for bytes in the 0x80-0x9F range where the ISO 8859 encodings have "holes". But it makes no attempt to determine which ISO 8859 encoding is in use. Which is why it just says ISO-8859 instead of ISO-8859-1 or ISO-8859-15.

Weird directory entries in FAT file system

So I'm trying to figure out how the FAT FS works and got confused by the root directory table. I have two files in the partition: test.txt and innit.eh which results in the following table:
The entries starting with 0xE5 are deleted, so I assume these were created due to renaming. The entries for the actual files look like this:
TEST TXT *snip*
INNIT EH *snip*
What I don't understand is where the entries like
At.e.s.t......t.x.t
Ai.n.n.i.t.....e.h.
are coming from and what are they for. They do not start with 0xE5, so should be treated as existing files.
By the way, I'm using Debian Linux to create filesystems and files, but I noticed similar behaviour on FS and files created on Windows.
The ASCII parts of the name (where the letters were close to each other) is the legacy 8.3 DOS shortname. You see it only uses capital letters. In DOS, only these would be there.
The longer parts (with 0x00 in between) is the long name (shown in Windows) which is Unicode, and uses 16bits per character.
The intervening bytes are all 0x00, which gives a strong feeling that they are stored in UTF-16 instead of UTF-8. Perhaps they are there as an extension similar to the other VFAT extensions for long filenames?

Resources