Problem:
Writing a bash script, i'm trying to import a list of products that are inside a csv file into an array:
#!/bin/bash
PRODUCTS=(`csvprintf -f "/home/test/data/input.csv" -x | grep "col2" | sed 's/<col2>//g' | sed 's/<\/col2>//g' | sed -n '1!p' | sed '$ d' | sed 's/ //g'`)
echo ${PRODUCTS[#]}
In the interactive shell, the result/output looks perfect as following:
burger
special fries
juice - 300ml
When i use exactly the same commands in bash script, even debugging with bash -x script.sh, in the part of echo ${PRODUCTS[#]}, the result of array is all files names located at /home/test/data/ and:
burger
special
fries
juice
-
300ml
The array is taking directory list AND messed up newlines. This don't happen in interactive shell (single command line).
Anyone know how to fix that?
Looking at the docs for csvprintf, you're converting the csv into XML and then parsing it with regular expressions. This is generally a very bad idea.
You might want to install csvkit then you can do
csvcut -c prod input.csv | sed 1d
Or you could use a language that comes with a CSV parser module. For example, ruby
ruby -rcsv -e 'CSV.read("input.csv", :headers=>true).each {|row| puts row["prod"]}'
Whichever method you use, read the results into a bash array with this construct
mapfile -t products < <(command to extract the product data)
Then, to print the array elements:
for prod in "${products[#]}"; do echo "$prod"; done
# or
printf "%s\n" "${products[#]}"
The quotes around the array expansion are critical. If missing, you'll see one word per line.
Tip: don't use ALLCAPS variable names in the shell: leave those for the shell. One day you'll write PATH=something and then wonder why your script is broken.
Despite all the sed-backslash discussions on Stackoverflow I cannot find a working solution for my specific problem. I want to precede a certain string in a file by a backslash: something -> \something.
sed -i -- 's/\(something\)/\\\1/g' file
This always returns the string \1 instead of \something, because for some reason sed thinks it should escape the third backslash. The (from my point of view more logical) behaviour can be achieved by inserting a space between \\ and \1 in the sed command, but then the result is \ something (i.e. with an inserted space in the result) which is not what I want.
I am running this command in a batch file on Windows, using sed from cygwin (I hope this does not matter as I am aiming for a cross-platform solution).
EDIT: /usr/bin/sed version 4.2.2.
In Windows cmd with Cygwin, use this sed command:
sed -e 's/\(something\)/\\\\\1/g' file
You can start your script from a batch file
myBatch.bat
#echo off
c:\cygwin64\bin\bash ./mySed
mySed
#!/bin/bash
echo asdfsomethingasdf | sed 's/\(something\)/\\\1/g'
It can be necessary to use /usr/bin/sed when your path isn't completely set
I am using CentOS. I have a file that contains information like:
100000,UniqueName1
100000,UniqueName2
100000,UniqueName4
100000,SoloName9
I want to split this out into files, one for each line, each named:
[secondvalue]_file.txt
For an example:
SoloName9_file.txt
Is it possible to split the file in this fashion using a command, or will I need to write a shell script? If the former, what's the command?
Thank you!
Here's one approach. Use the sed command to turn this file into a valid shell script that you can then execute.
sed -e 's/^/echo /g' -e 's/,/ >/g' -e 's/$/_file.txt/g' <your.textfile >your.sh
chmod +x your.sh
./your.sh
Note that trailing whitespace in the file would take some additional work.
Writing it into a shell script file gives you a chance to review it, but you can also execute it as a single line.
sed -e 's/^/echo /g' -e 's/,/ >/g' -e 's/$/_file.txt/g' <your.textfile | sh
I have one file (for example: test.txt), this file contains some lines and for example one line is: abcd=11
But it can be for example: abcd=12
Number is different but abcd= is the same in all case, so could anybody give me command for finding this line and remove it?
I have tried: sed -e \"/$abcd=/d\" /test.txt >/test.txt but it removes all lines from my file and I also have tried: sed -e \"/$abcd=/d\" /test.txt >/testNew.txt but it doesn't delete line from test.txt, it only creates new file (testNew.txt) and in this file it removes my line. But it is not what I want.
Based on your description in your text, here is a cleaned-up version of your sed script that should work.
Assuming a linux GNU sed
sed -i '/abcd=/d' /test.txt
If you're using OS-X, then you need
sed -i "" '/abcd=/d' /test.txt
If these don't work, then use old-school sed with a conditional mv to manage your tmpfiles.
sed '/abcd=/d' /test.txt > test.txt.$$ && /bin/mv test.txt.$$ test.txt
Notes:
Not sure why you're doing \"/$abcd=/d\", you don't need to escape " chars unless you're doing more with this code than you indicate (like using eval). Just write it as "/$abcd=/d".
Normally you don't need '-e'
If you really want to use '$abcd, then you need to give it a value AND as you're matching the string 'abcd=', then you can do
abcd='abcd='
sed -i "/${abcd}/d" /test.txt
I hope this helps.
Here's a solution using grep:
$ grep -v '^\$abcd=' test.txt
Proof of concept:
$ cat test.txt
a
b
ab
ac
$abcd=1
$abcd=2
$abcd
ab
a
$abcd=3
x
$ grep -v '^\$abcd=' test.txt
a
b
ab
ac
$abcd
ab
a
x
As far as I know, this command can be used to create some other file with the deleted lines. Now that we have another file we can rename that file and delete the original file if we want.
You will just have to do this
grep -v '^\$abcd=' test.txt > tmp.txt
now tmp.txt will have contents
a
b
ab
ac
$abcd
ab
a
x
If you want you may rename this to test.txt after deleting test.txt
I need to find the encoding of all files that are placed in a directory. Is there a way to find the encoding used?
The file command is not able to do this.
The encoding that is of interest to me is ISO 8859-1. If the encoding is anything else, I want to move the file to another directory.
It sounds like you're looking for enca. It can guess and even convert between encodings. Just look at the man page.
Or, failing that, use file -i (Linux) or file -I (OS X). That will output MIME-type information for the file, which will also include the character-set encoding. I found a man-page for it, too :)
file -bi <file name>
If you like to do this for a bunch of files
for f in `find | egrep -v Eliminate`; do echo "$f" ' -- ' `file -bi "$f"` ; done
uchardet - An encoding detector library ported from Mozilla.
Usage:
~> uchardet file.java
UTF-8
Various Linux distributions (Debian, Ubuntu, openSUSE, Pacman, etc.) provide binaries.
In Debian you can also use: encguess:
$ encguess test.txt
test.txt US-ASCII
As it is a perl script, it can be installed on most systems, by installing perl or the script as standalone, in case perl has already been installed.
$ dpkg -S /usr/bin/encguess
perl: /usr/bin/encguess
Here is an example script using file -I and iconv which works on Mac OS X.
For your question, you need to use mv instead of iconv:
#!/bin/bash
# 2016-02-08
# check encoding and convert files
for f in *.java
do
encoding=`file -I $f | cut -f 2 -d";" | cut -f 2 -d=`
case $encoding in
iso-8859-1)
iconv -f iso8859-1 -t utf-8 $f > $f.utf8
mv $f.utf8 $f
;;
esac
done
To convert encoding from ISO 8859-1 to ASCII:
iconv -f ISO_8859-1 -t ASCII filename.txt
It is really hard to determine if it is ISO 8859-1. If you have a text with only 7-bit characters that could also be ISO 8859-1, but you don't know. If you have 8-bit characters then the upper region characters exist in order encodings as well. Therefore you would have to use a dictionary to get a better guess which word it is and determine from there which letter it must be. Finally, if you detect that it might be UTF-8 then you are sure it is not ISO 8859-1.
Encoding is one of the hardest things to do, because you never know if nothing is telling you.
With Python, you can use the chardet module.
With this command:
for f in `find .`; do echo `file -i "$f"`; done
you can list all files in a directory and subdirectories and the corresponding encoding.
If files have a space in the name, use:
IFS=$'\n'
for f in `find .`; do echo `file -i "$f"`; done
Remember it'll change your current Bash session interpreter for "spaces".
In PHP you can check it like below:
Specifying the encoding list explicitly:
php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"
More accurate "mb_list_encodings":
php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"
Here in the first example, you can see that I used a list of encodings (detect list order) that might be matching.
To have a more accurate result, you can use all possible encodings via: mb_list_encodings()
Note the mb_* functions require php-mbstring:
apt-get install php-mbstring
This is not something you can do in a foolproof way. One possibility would be to examine every character in the file to ensure that it doesn't contain any characters in the ranges 0x00 - 0x1f or 0x7f -0x9f but, as I said, this may be true for any number of files, including at least one other variant of ISO 8859.
Another possibility is to look for specific words in the file in all of the languages supported and see if you can find them.
So, for example, find the equivalent of the English "and", "but", "to", "of" and so on in all the supported languages of ISO 8859-1 and see if they have a large number of occurrences within the file.
I'm not talking about literal translation such as:
English French
------- ------
of de, du
and et
the le, la, les
although that's possible. I'm talking about common words in the target language (for all I know, Icelandic has no word for "and" - you'd probably have to use their word for "fish" [sorry that's a little stereotypical. I didn't mean any offense, just illustrating a point]).
I know you're interested in a more general answer, but what's good in ASCII is usually good in other encodings. Here is a Python one-liner to determine if standard input is ASCII. (I'm pretty sure this works in Python 2, but I've only tested it on Python 3.)
python -c 'from sys import exit,stdin;exit()if 128>max(c for l in open(stdin.fileno(),"b") for c in l) else exit("Not ASCII")' < myfile.txt
If you're talking about XML files (ISO-8859-1), the XML declaration inside them specifies the encoding: <?xml version="1.0" encoding="ISO-8859-1" ?>
So, you can use regular expressions (e.g., with Perl) to check every file for such specification.
More information can be found here: How to Determine Text File Encoding.
I am using the following script to
Find all files that match FILTER with SRC_ENCODING
Create a backup of them
Convert them to DST_ENCODING
(optional) Remove the backups
#!/bin/bash -xe
SRC_ENCODING="iso-8859-1"
DST_ENCODING="utf-8"
FILTER="*.java"
echo "Find all files that match the encoding $SRC_ENCODING and filter $FILTER"
FOUND_FILES=$(find . -iname "$FILTER" -exec file -i {} \; | grep "$SRC_ENCODING" | grep -Eo '^.*\.java')
for FILE in $FOUND_FILES ; do
ORIGINAL_FILE="$FILE.$SRC_ENCODING.bkp"
echo "Backup original file to $ORIGINAL_FILE"
mv "$FILE" "$ORIGINAL_FILE"
echo "converting $FILE from $SRC_ENCODING to $DST_ENCODING"
iconv -f "$SRC_ENCODING" -t "$DST_ENCODING" "$ORIGINAL_FILE" -o "$FILE"
done
echo "Deleting backups"
find . -iname "*.$SRC_ENCODING.bkp" -exec rm {} \;
I was working in a project that requires cross-platform support and I encounter many problems related with the file encoding.
I made this script to convert all to utf-8:
#!/bin/bash
## Retrieve the encoding of files and convert them
for f `find "$1" -regextype posix-egrep -regex ".*\.(cpp|h)$"`; do
echo "file: $f"
## Reads the entire file and get the enconding
bytes_to_scan=$(wc -c < $f)
encoding=`file -b --mime-encoding -P bytes=$bytes_to_scan $f`
case $encoding in
iso-8859-1 | euc-kr)
iconv -f euc-kr -t utf-8 $f > $f.utf8
mv $f.utf8 $f
;;
esac
done
I used a hack to read the entire file and estimate the file encoding using file -b --mime-encoding -P bytes=$bytes_to_scan $f
You can extract encoding of a single file with the file command. I have a sample.html file with:
$ file sample.html
sample.html: HTML document, UTF-8 Unicode text, with very long lines
$ file -b sample.html
HTML document, UTF-8 Unicode text, with very long lines
$ file -bi sample.html
text/html; charset=utf-8
$ file -bi sample.html | awk -F'=' '{print $2 }'
utf-8
In Cygwin, this looks like it works for me:
find -type f -name "<FILENAME_GLOB>" | while read <VAR>; do (file -i "$<VAR>"); done
Example:
find -type f -name "*.txt" | while read file; do (file -i "$file"); done
You could pipe that to AWK and create an iconv command to convert everything to UTF-8, from any source encoding supported by iconv.
Example:
find -type f -name "*.txt" | while read file; do (file -i "$file"); done | awk -F[:=] '{print "iconv -f "$3" -t utf8 \""$1"\" > \""$1"_utf8\""}' | bash
With Perl, use Encode::Detect.