Change File Encoding to utf-8 via vim in a script - file

I just got knocked down after our server has been updated from Debian 4 to 5.
We switched to UTF-8 environment and now we have problems getting the text printed correctly on the browser, because all files are in non-utf8 encodings like iso-8859-1, ascii, etc.
I tried many different scripts.
The first one I tried is "iconv". That one doesn't work, it changes the content, but the file's encoding is still non-utf8.
Same problem with enca, encamv, convmv and some other tools I installed via apt-get.
Then I found a python code, which uses chardet Universal Detector module, to detect encoding of a file (which works fine), but using the unicode class or the codec class to save it as utf-8 doesn't work, without any errors.
The only way I found to get the file and its content converted to UTF-8, is vi.
These are the steps I do for one file:
vi filename.php
:set bomb
:set fileencoding=utf-8
:wq
That's it. That one works perfect. But how can I get this running via a script?
I would like to write a script (Linux shell) which traverses a directory taking all php files, then converting them using vi with the commands above.
As I need to start the vi app, I do not know how to do something like this:
"vi --run-command=':set bomb, :set fileencoding=utf-8' filename.php"
Hope someone can help me.

This is the simplest way I know of to do this easily from the command line:
vim +"argdo se bomb | se fileencoding=utf-8 | w" $(find . -type f -name *.php)
Or better yet if the number of files is expected to be pretty large:
find . -type f -name *.php | xargs vim +"argdo se bomb | se fileencoding=utf-8 | w"

You could put your commands in a file, let's call it script.vim:
set bomb
set fileencoding=utf-8
wq
Then you invoke Vim with the -S (source) option to execute the script on the file you wish to fix. To do this on a bunch of files you could do
find . -type f -name "*.php" -exec vim -S script.vim {} \;
You could also put the Vim commands on the command line using the + option, but I think it may be more readable like this.
Note: I have not tested this.

You may actually want set nobomb (BOM = byte order mark), especially in the [not windows] world.
e.g., I had a script that didn't work as there was a byte order mark at the start. It isn't usually displayed in editors (even with set list in vi), or on the console, so its difficult to spot.
The file looked like this
#!/usr/bin/perl
...
But trying to run it, I get
./filename
./filename: line 1: #!/usr/bin/perl: No such file or directory
Not displayed, but at the start of the file, is the 3 byte BOM. So, as far as linux is concerned, the file doesn't start with #!
The solution is
vi filename
:set nobomb
:set fileencoding=utf-8
:wq
This removes the BOM at the start of the file, making it correct utf8.
NB Windows uses the BOM to identify a text file as being utf8, rather than ANSI. Linux (and the official spec) doesn't.

The accepted answer will keep the last file open in Vim. This problem can be easily resolved using the -c option of Vim,
vim +"argdo set bomb | set fileencoding=utf-8 | w" -c ":q" file1.txt file2.txt
If you need only process one file, the following will also work,
vim -c ':set bomb' -c ':set fileencoding=utf-8' -c ':wq' file1.txt

Related

Create a vim script, function or macro and run in windows by command line

i created a script that converts a text file into utf8 encoding. I can run it in vim. The problem is that i need to run it by cmd in windows and i cant figure out how. Help me
Sorry for my english. Im from south america, i speak spanish.
Alternatives
Unless you really need special Vim capabilities, you're probably better off using non-interactive tools like sed, awk, or Perl / Python / Ruby / your favorite scripting language here. For simple character set conversion, look into the iconv tool in particular.
That said, you can use Vim non-interactively:
Silent Batch Mode
For very simple text processing (i.e. using Vim like an enhanced 'sed' or 'awk', maybe just benefitting from the enhanced regular expressions in a :substitute command), use Ex-mode.
REM Windows
call vim -N -u NONE -n -es -S "commands.ex" "filespec"
Note: silent batch mode (:help -s-ex) messes up the Windows console, so you may have to do a cls to clean up after the Vim run.
# Unix
vim -T dumb --noplugin -n -es -S "commands.ex" "filespec"
Attention: Vim will hang waiting for input if the "commands.ex" file doesn't exist; better check beforehand for its existence! Alternatively, Vim can read the commands from stdin. You can also fill a new buffer with text read from stdin, and read commands from stderr if you use the - argument.
Full Automation
For more advanced processing involving multiple windows, and real automation of Vim (where you might interact with the user or leave Vim running to let the user take over), use:
vim -N -u NONE -n -c "set nomore" -S "commands.vim" "filespec"
Here's a summary of the used arguments:
-T dumb Avoids errors in case the terminal detection goes wrong.
-N -u NONE Do not load vimrc and plugins, alternatively:
--noplugin Do not load plugins.
-n No swapfile.
-es Ex mode + silent batch mode -s-ex
Attention: Must be given in that order!
-S ... Source script.
-c 'set nomore' Suppress the more-prompt when the screen is filled
with messages or output to avoid blocking.

In C, what's the typical way to handle multiple arguments that are "list"-like?

Suppose I have some program called "combine" that takes input of "red", "green" and "blue"-type files to produce an output file (let's say "color.jpg")... BUT the number of each type is arbitrary. Let's also suppose that there's no way to determine what type the file is except through how the user classifies them. What do people usually do in this case?
For instance, on the command line, some of the approaches might be:
command red1,red2,red3 green1,green2 blue1 color.jpg
This comma-approach breaks down if commas can appear in the filenames. It's the approach I like the most though. Another idea would be
command "red1 red2 red3" "green1 green2" "blue1" color.jpg
but this approach also has trouble with spaces in names.
I could also require ASCII files containing lists giving the files of each type:
command redlist greenlist bluelist color.jpg
but this requires lugging around extra files.
Further ideas? Is there a standard LINUX way of doing this?
The standard way would be this:
command --red red1.jpg --red red2.jpg --blue blue1.jpg
With short options:
command -r red1.jpg -r red2.jpg -b blue1.jpg
With bash shorthand:
command -r={red1,red2}.jpg -b blue1.jpg
(The above gets expanded by the shell so it looks like the previous invocation.)
Doing things this way avoids arbitrary limitations like "no commas in filenames" and also makes your program more interoperable with standard *nix utilities like xargs and so on.
Another way is accepting:
command -r redfile1 redfile2 -b bluefile1 blue2 blue2 -g green1
so that:
command -r red* -b blue* -g green*
is possible.

Moving things in terminal based on their name

Edit: I think this has been answered successfully, but I can't check 'til later. I've reformatted it as suggested though.
The question: I have a series of files, each with a name of the form XXXXNAME, where XXXX is some number. I want to move them all to separate folders called XXXX and have them called NAME. I can do this manually, but I was hoping that by naming them XXXXNAME there'd be some way I could tell Terminal (I think that's the right name, but not really sure) to move them there. Something like
mv *NAME */NAME
but where it takes whatever * was in the first case and regurgitates it to the path.
This is on some form of Linux, with a bash shell.
In the real life case, the files are 0000GNUmakefile, with sequential numbering. I'm having to make lots of similar-but-slightly-altered versions of a program to compile and run on a cluster as part of my research. It would probably have been quicker to write a program to edit all the files and put in the right place in the first place, but I didn't.
This is probably extremely simple, and I should be able to find an answer myself, if I knew the right words. Thing is, I have no formal training in programming, so I don't know what to call things to search for them. So hopefully this will result in me getting an answer, and maybe knowing how to find out the answer for similar things myself next time. With the basic programming I've picked up, I'm sure I could write a program to do this for me, but I'm hoping there's a simple way to do it just using functionality already in Terminal. I probably shouldn't be allowed to play with these things.
Thanks for any help! I can actually program in C and Python a fair amount, but that's through trial and error largely, and I still don't know what I can do and can't do in Terminal.
SO many ways to achieve this.
I find that the old standbys sed and awk are often the most powerful.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p'
If you're satisfied that the commands look right, pipe the command line through a shell:
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p' | sh
I put NAME in brackets and used \2 so that if it varies more than your example indicates, you can come up with a regular expression to handle your filenames better.
To do the same thing in gawk (GNU awk, the variant found in most GNU/Linux distros):
ls | gawk '/^[0-9]{4}NAME$/ {printf("mv -iv %s %s/%s\n", $1, substr($0,0,4), substr($0,5))}'
As with the first sample, this produces commands which, if they make sense to you, can be piped through a shell by appending | sh to the end of the line.
Note that with all these mv commands, I've added the -i and -v options. This is for your protection. Read the man page for mv (by typing man mv in your Linux terminal) to see if you should be comfortable leaving them out.
Also, I'm assuming with these lines that all your directories already exist. You didn't mention if they do. If they don't, here's a one-liner to create the directories.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mkdir -p \1:p' | sort -u
As with the others, append | sh to run the commands.
I should mention that it is generally recommended to use constructs like for (in Tim's answer) or find instead of parsing the output of ls. That said, when your filename format is as simple as /[0-9]{4}word/, I find the quick sed one-liner to be the way to go.
Lastly, if by NAME you actually mean "any string of characters" rather than the literal string "NAME", then in all my examples above, replace NAME with .*.
The following script will do this for you. Copy the script into a file on the remote machine (we'll call it sortfiles.sh).
#!/bin/bash
# Get all files in current directory having names XXXXsomename, where X is an integer
files=$(find . -name '[0-9][0-9][0-9][0-9]*')
# Build a list of the XXXX patterns found in the list of files
dirs=
for name in ${files}; do
dirs="${dirs} $(echo ${name} | cut -c 3-6)"
done
# Remove redundant entries from the list of XXXX patterns
dirs=$(echo ${dirs} | uniq)
# Create any XXXX directories that are not already present
for name in ${dirs}; do
if [[ ! -d ${name} ]]; then
mkdir ${name}
fi
done
# Move each of the XXXXsomename files to the appropriate directory
for name in ${files}; do
mv ${name} $(echo ${name} | cut -c 3-6)
done
# Return from script with normal status
exit 0
From the command line, do chmod +x sortfiles.sh
Execute the script with ./sortfiles.sh
Just open the Terminal application, cd into the directory that contains the files you want moved/renamed, and copy and paste these commands into the command line.
for file in [0-9][0-9][0-9][0-9]*; do
dirName="${file%%*([^0-9])}"
mkdir -p "$dirName"
mv "$file" "$dirName/${file##*([0-9])}"
done
This assumes all the files that you want to rename and move are in the same directory. The file globbing also assumes that there are at least four digits at the start of the filename. If there are more than four numbers, it will still be caught, but not if there are less than four. If there are less than four, take off the appropriate number of [0-9]s from the first line.
It does not handle the case where "NAME" (i.e. the name of the new file you want) starts with a number.
See this site for more information about string manipulation in bash.

Text specification for a tree of files?

I'm looking for examples of specifying files in a tree structure, for example, for specifying the set of files to search in a grep tool. I'd like to be able to include and exclude files and directories by name matches. I'm sure there are examples out there, but I'm having a hard time finding them.
Here's an example of a possible syntax:
*.py *.html
*.txt *.js
-*.pyc
-.svn/
-*combo_*.js
(this would mean include file with extensions .py .html .txt .js, exclude .pyc files, anything under a .svn directory, and any file matching combo_.js)
I know I've seen these sorts of specifications in other tools before. Is this ringing any bells for anyone?
There is no single standard format for this kind of thing, but if you want to copy something that is widely recognized, have a look at the rsync documentation. Look at the chapter on "INCLUDE/EXCLUDE PATTERN RULES."
Apache Ant provides 'ant globs or patterns where:
**/foo/**/*.java
means "any file ending in '.java' in a directory which includes a directory named 'foo' in its path" -- including ./foo/X.java
In your example syntax, is it implicitly understood that there's an escaping character so that you can explicitly include a file that begins with a dash? (The same question goes for any other wildcard characters, but I suppose I'd expect to see more files with dashes in their names than asterisks.)
Various command shells use * (and possibly ? to match a single char), as in your example, but they generally only match against a string of characters that doesn't include a path component separator (i.e. '\' on Windows systems, '/' elsewhere). I've also seen such source control apps as Perforce use additional patterns that can match against path component separators. For instance, with Perforce the pattern "foo/...ext" (without quotes) will match all files under the foo/ directory structure that end with "ext", regardless of whether they are in foo/ itself or in one of its descendant directories. This seems to be a useful pattern.
If you're using bash, you can use the extglob extension to get some nice globbing functions. Enable it as follows:
shopt -s extglob
Then you can do things like the following:
# everything but .html, .jpg or ,gif files
ls -d !(*.html|*gif|*jpg)
# list file9, file22 but not fileit
ls file+([0-9])
# begins with apl or un only
ls -d +(apl*|un*)
See also this page.
How about find in unixish environments?
Find can, of course, do more than build a list of files, but that is one of the common ways it is used. From the man page:
NAME
find -- walk a file hierarchy
SYNOPSIS
find [-H | -L | -P] [-EXdsx] [-f pathname] pathname ... expression
find [-H | -L | -P] [-EXdsx] -f pathname [pathname ...] expression
DESCRIPTION
The find utility recursively descends the directory tree for each
pathname listed, evaluating an expression (composed of the
primaries''
andoperands'' listed below) in terms of each file in the tree.
to achieve your goal I would write something like (formatted for readability):
find ./ \( -name *.{py,html,txt,js,pyc} -or \
-name *combo_*.js -or \
\( -name *.svn -and -type d\)\) \
-print
Moreover there is a idomatic pattern using xargs which makes find suitable for sending the whole list so constructed to an arbitrary command as in:
find /path -type f -print0 | xargs -0 rm
find(1) is a fine tool as described in the previous answer but if it gets more complicated, you should consider either writing your own script in any of the usual suspects (Ruby, Perl, Python et al.) or try to use one of the more powerful shells such as zsh which has a ** globbing commands and you can specify things to exclude. The latter is probably more complicated though.
You might want to check out ack, which allows you to specify file types to search in with options like --perl, etc.
It also ignores .svn directories by default, as well as core dumps, editor cruft, binary files, and so on.

Convert DOS/Windows line endings to Linux line endings in Vim

If I open files I created in Windows, the lines all end with ^M.
How do I delete these characters all at once?
dos2unix is a commandline utility that will do this, or :%s/^M//g will if you use Ctrl-v Ctrl-m to input the ^M, or you can :set ff=unix and Vim will do it for you.
There is documentation on the fileformat setting, and the Vim wiki has a comprehensive page on line ending conversions.
Alternately, if you move files back and forth a lot, you might not want to convert them, but rather to do :set ff=dos, so Vim will know it's a DOS file and use DOS conventions for line endings.
Change the line endings in the view:
:e ++ff=dos
:e ++ff=mac
:e ++ff=unix
This can also be used as saving operation (:w alone will not save using the line endings you see on screen):
:w ++ff=dos
:w ++ff=mac
:w ++ff=unix
And you can use it from the command-line:
for file in *.cpp
do
vi +':w ++ff=unix' +':q' "$file"
done
I typically use
:%s/\r/\r/g
which seems a little odd, but works because of the way that Vim matches linefeeds. I also find it easier to remember :)
I prefer to use the following command:
:set fileformat=unix
You can also use mac or dos to respectively convert your file to Mac or MS-DOS/Windows file convention. And it does nothing if the file is already in the correct format.
For more information, see the Vim help:
:help fileformat
:set fileformat=unix to convert from DOS to Unix.
:%s/\r\+//g
In Vim, that strips all carriage returns, and leaves only newlines.
In VIM:
:e ++ff=dos | set ff=unix | w!
In shell with VIM:
vim some_file.txt +'e ++ff=dos | set ff=unix | wq!'
e ++ff=dos - force open file in dos format.
set ff=unix - convert file to unix format.
From: File format
[Esc] :%s/\r$//
dos2unix can directly modify the file contents.
You can directly use it on the file, without any need for temporary file redirection.
dos2unix input.txt input.txt
The above uses the assumed US keyboard. Use the -437 option to use the UK keyboard.
dos2unix -437 input.txt input.txt
Convert directory of files from DOS to Unix
Using command line and sed, find all files in current directory with the extension ".ext" and remove all "^M"
# https://gist.github.com/sparkida/7773170
find $(pwd) -type f -name "*.ext" | while read file; do sed -e 's/^M//g' -i "$file"; done;
Also, as mentioned in a previous answer, ^M = Ctrl+V + Ctrl+M (don't just type the caret "^" symbol and M).
tr -d '\15\32' < winfile.txt > unixfile.txt
(See: Convert between Unix and Windows text files)
To run directly in a Linux console:
vim file.txt +"set ff=unix" +wq
The following steps can convert the file format for DOS to Unix:
:e ++ff=dos Edit file again, using dos file format ('fileformats' is ignored).[A 1]
:setlocal ff=unix This buffer will use LF-only line endings when written.[A 2]
:w Write buffer using Unix (LF-only) line endings.
Reference: File format
I found a very easy way: Open the file with nano: nano file.txt
Press Ctrl + O to save, but before pressing Enter, press: Alt+D to toggle between DOS and Unix/Linux line-endings, or: Alt+M to toggle between Mac and Unix/Linux line-endings, and then press Enter to save and Ctrl+X to quit.
The comment about getting the ^M to appear is what worked for me. Merely typing "^M" in my vi got nothing (not found). The CTRL+V CTRL+M sequence did it perfectly though.
My working substitution command was
:%s/Ctrl-V Ctrl-M/\r/g
and it looked like this on my screen:
:%s/^M/\r/g
With the following command:
:%s/^M$//g
To get the ^M to appear, type CtrlV and then CtrlM. CtrlV tells Vim to take the next character entered literally.
:g/Ctrl-v Ctrl-m/s///
CtrlM is the character \r, or carriage return, which DOS line endings add. CtrlV tells Vim to insert a literal CtrlM character at the command line.
Taken as a whole, this command replaces all \r with nothing, removing them from the ends of lines.
You can use:
vim somefile.txt +"%s/\r/\r/g" +wq
Or the dos2unix utility.
You can use the following command:
:%s/^V^M//g
where the '^' means use CTRL key.
The below command is used for reformating all .sh file in the current directory. I tested it on my Fedora OS.
for file in *.sh; do awk '{ sub("\r$", ""); print }' $file >luxubutmp; cp -f luxubutmp $file; rm -f luxubutmp ;done
In Vim, type:
:w !dos2unix %
This will pipe the contents of your current buffer to the dos2unix command and write the results over the current contents. Vim will ask to reload the file after.
I wanted newlines in place of the ^M's. Perl to the rescue:
perl -pi.bak -e 's/\x0d/\n/g' excel_created.txt
Or to write to stdout:
perl -p -e 's/\x0d/\n/g' < excel_created.txt
Usually there is a dos2unix command you can use for this. Just make sure you read the manual as the GNU and BSD versions differ on how they deal with the arguments.
BSD version:
dos2unix $FILENAME $FILENAME_OUT
mv $FILENAME_OUT $FILENAME
GNU version:
dos2unix $FILENAME
Alternatively, you can create your own dos2unix with any of the proposed answers here, for example:
function dos2unix(){
[ "${!}" ] && [ -f "{$1}" ] || return 1;
{ echo ':set ff=unix';
echo ':wq';
} | vim "${1}";
}
From Wikia:
%s/\r\+$//g
That will find all carriage return signs (one and more reps) up to the end of line and delete, so just \n will stay at EOL.
This is my way. I opened a file in DOS EOL and when I save the file, that will automatically convert to Unix EOL:
autocmd BufWrite * :set ff=unix
If you create a file in Notepad or Notepad++ in Windows, bring it to Linux, and open it by Vim, you will see ^M at the end of each line. To remove this,
At your Linux terminal, type
dos2unix filename.ext
This will do the required magic.
I knew I'd seen this somewhere. Here is the FreeBSD login tip:
Do you need to remove all those ^M characters from a DOS file? Try
tr -d \\r < dosfile > newfile
-- Originally by Dru <genesis#istar.ca>
This is a little more than you asked for but:
nmap <C-d> :call range(line('w0'),line('w$'))->map({_,v-> getline(v)})->map({_,v->trim(v,join(map(range(1,0x1F)+[0xa0],{n->n->nr2char()}),''),2)})->map({k,v->setline(k+1,v)})<CR>
Run this and :set ff=unix|dos and no more need for unix2dos.
the single arg form of trim() has the same default mask above, plus 0X20 (an actual space) instead of 0x1F
that default mask clears out all non-printing chars including non-breaking spaces [0xa0] that are hard to find
create a list of lines from the range of lines
map that list to the trim function with using the same mask code as the source, less spaces
map that again to setline to replace the lines.
all :set fileformat= does at this point is choose which eol to save it with, dos or unix
it should be pretty easy to change the range of characters above if you want to eliminate or add some

Resources