How to extract data from multiple files with Python? - dataset

I am new to Python, which is also my first programming language. I have a set of txt files (academic papers), I need to extract the paper ID (e.g. ID: a1111111) and abstract (e.g. ABSTRACT: .....). I have no idea how to extract this data from multiple files from multiple folders? Thanks A LOT!

So your question is two part: reading files and accessing folders
Reading files
The methods/objects in python used for reading files is in Python's documentation on chapter 7:
http://docs.python.org/2/tutorial/inputoutput.html
The basic gist is that you use the open method to access files that are in the same directory
f = open('stuff.txt', 'r')
Where stuff.txt is the name of the file in the same directory that your python file is in.
Calling print f.read() will display the text (in String format) of the file. Feel free to assign f.read() to a variable to capture the data.
>>> x = f.read()
>>> print x
This is the entire file.\n
Best read the documentation for all these methods, cause there are subtleties. For example, calling f.read() once will return the entire file contents to you, but calling f.read() again will return an empty string, as the "end of the file has been reached."
Accessing Folders
Can you explain to me how exactly you'd like to access folders? In this case, it would be much easier to just put all your files in the same directory as where you are running your python file.
However, the basic way to move around in python is to use: os.chdir(path) which is basically cd'ing around. You must import os before you use this.
Leave a comment if you'd like some more information

Related

Trouble with running words through text files and counting them

Python 3+
This is the error i get
This is my code
I want the user to input some words, then the program should run each word through my two textfiles, if the word exists in any of them, I want the program to add +1 to the positive/negative count list.
Thank you for your help :)
Seems like you have stumbled upon a Decoding error when trying to open one of the input files in the wordlist function. it is usually hard to determine the encoding used for a particular file. so you could :
1.Try opening the file with a different encoding such as ISO-8859-15,etc.
def OpenFile():
try:
with open("My File.txt",mode="r",encoding="IS0-8859-15")
#do process My File
except UnicodeDecodeError:
print("Something went Wrong Try a different file encoding")
else:
#everything was okay, return the required
finally:
# clean up here
2. Look it modules that try and determine the correct encoding for the file such as the chardet module
Install the
chardet module :
sudo pip3 install chardet
you can run it at the command line with your file as the Argument to determine the encoding
cd /path/to/File/
chardetect My\ File.txt
this should return the likely encoding for the given file
3.You can use the chardet module inside your python code however this is recommended in a case where you will be opening a file you do not have access to e.g at a clients computer whom wants to open another specified file
and reopening the same file and redetecting the encoding will cause your program to be slow.
First of all positive_count and negative_count should be integers and not lists. If you wish to count, adding 1 to the list isn't really what you're trying to accomplish.
Second of all, the UnicodeDecodeError is there because the encoding of the underlying file is not utf-8. Did you try utf-16 or utf-16-le? In case you're using Windows, utf-16-le is probably the encoding used unless you're using code-points in which case guessing will be a nightmare.

Deleting specific files from a directory

I am trying to delete all files from a directory apart from two (which will be erased, then re-written). One of these files, not to be deleted, contains the names of all files in the folder/directory (the other contains the number of files in the directory).
I believe there (possibly?) are 2 solutions:
Read the names of each file from the un-deleted file and delete them individually until there is only the final 2 files remaining,
or...
Because all other files end in .txt I could use some sort of filter which would only delete files with this ending.
Which of these 2 would be most efficient and how could it be done?
Any help would be appreciated.
You are going to end up deleting files one by one, regardless of which method you use. Any optimizations you make are going to be very miniscule. Without actually timing your algorithms, I'd say they'd both take about the same amount of time (and this would vary from one computer to the next, based on CPU speed, HDD type, etc). So, instead of debating that, I'll provide you code for both the ways you've mentioned:
Method1:
import os
def deleteAll(infilepath):
with open(infilepath) as infile:
for line in infile:
os.remove(line)
Method 2:
import os
def deleteAll():
blacklist = set(['names/of/files/to/be/deleted', 'number/of/files'])
for fname in (f for f in os.listdir() if f not in blacklist):
os.remove(fname)

combine a binary file and a .txt file to a single file in python

I have a binary file (.bin) and a (.txt) file.
Using Python3, is there any way to combine these two files into one file (WITHOUT using any compressor tool if possible)?
And if I have to use a compressor, I want to do this with python.
As an example, I have 'file.txt' and 'file.bin', I want a library that gets these two and gives me one file, and also be able to un-merge the file.
Thank you
Just create a tar archive, a module that let's you accomplish this task is already bundled with Cpython, and it's called tarfile.
more examples here.
there are a lot of solutions for compressing!
gzip or zlib would allows compression and decompression and could be a solution for your problem.
Example of how to GZIP compress an existing file from [http://docs.python.org]:
import gzip
f_in = open('file.txt', 'rb')
f_out = gzip.open('file.txt.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
but also tarfile is a good solution!
Tar's the best solution to get binary file.
If you want the output to be a text, you can use base64 to transform binary file into a text data, then concatenate them into one file (using some unique string (or other technique) to mark the point they were merged).

Reading a file directly from HDFS into a shell function

I have a shell function that is called from inside my map function. The shell function takes 2 parameters -> an input file and an output file. Something like this
$> unix-binary /pathin/input.txt /pathout/output.txt
The problem is, that these input.txt files reside in HDFS and the output.txt files need to be written back to HDFS. Currently, I first copy the needed file with fs.copyToLocalFile into the local hard drive, call the unix binary and then write the output.txt back to HDFS with fs.copyFromLocalFile.
The problem with this approach is that, it is not optimal because it involves substantial amount of redundant reading and writing to HDD which slows down the performance. So, my question is, how I can read the HDFS file directly as an input and output the results directly to HDFS?
obviously,
$>unix-binary hdfs://master:53410/pathin/input.txt' hdfs://master:54310/pathout/output.txt
will not work. Is there any other way around? Can I treat an HDFS file as a loacl file somehow?
I have access to the unix-binary source code written in C. Maybe changing the source code would help?
thanks
You can add the file to the DistributedCache and access it from the mapper from the cache. Call your shell function on the local file and write the output file to local disk and then copy the local file to HDFS.
However, operations such as calling shell functions, or reading/writing from within a mapper/reducer break the MapReduce paradigm. If you find yourself needing to perform such operations, MapReduce may not be the solution you're looking for. HDFS and MapReduce were designed to perform massive scale batch processing on small numbers of extremely large files.
Since you have access to unix-binary source code, your best option might be to implement the particular function(s) you want in java. Feed the input files to your mapper and call the function from the mapper on the data rather than working with files on HDFS/LocalFS.

C - Reading multiple files

just had a general question about how to approach a certain problem I'm facing. I'm fairly new to C so bear with me here. Say I have a folder with 1000+ text files, the files are not named in any kind of numbered order, but they are alphabetical. For my problem I have files of stock data, each file is named after the company's respective ticker. I want to write a program that will open each file, read the data find the historical low and compare it to the current price and calculate the percent change, and then print it. Searching and calculating are not a problem, the problem is getting the program to go through and open each file. The only way I can see to attack this is to create a text file containing all of the ticker symbols, having the program read that into an array and then run a loop that first opens the first filename in the array, perform the calculations, print the output, close the file, then loop back around moving to the second element (the next ticker symbol) in the array. This would be fairly simple to set up (I think) but I'd really like to avoid typing out over a thousand file names into a text file. Is there a better way to approach this? Not really asking for code ( unless there is some amazing function in c that will do this for me ;) ), just some advice from more experienced C programmers.
Thanks :)
Edit: This is on Linux, sorry I forgot to metion that!
Under Linux/Unix (BSD, OS X, POSIX, etc.) you can use opendir / readdir to go through the directory structure. No need to generate static files that need to be updated, when the file system has the information you want. If you only want a sub-set of stocks at a given time, then using glob would be quicker, there is also scandir.
I don't know what Win32 (Windows / Platform SDK) functions are called, if you are developing using Visual C++ as your C compiler. Searching MSDN Library should help you.
Assuming you're running on linux...
ls /path/to/text/files > names.txt
is exactly what you want.
opendir(); on linux.
http://linux.die.net/man/3/opendir
Exemple :
http://snippets.dzone.com/posts/show/5734
In pseudo code it would look like this, I cannot define the code as I'm not 100% sure if this is the correct approach...
for each directory entry
scan the filename
extract the ticker name from the filename
open the file
read the data
create a record consisting of the filename, data.....
close the file
add the record to a list/array...
> sort the list/array into alphabetical order based on
the ticker name in the filename...
You could vary it slightly if you wish, scan the filenames in the directory entries and sort them first by building a record with the filenames first, then go back to the start of the list/array and open each one individually reading the data and putting it into the record then....
Hope this helps,
best regards,
Tom.
There are no functions in standard C that have any notion of a "directory". You will need to use some kind of platform-specific function to do this. For some examples, take a look at this post from Cprogrammnig.com.
Personally, I prefer using the opendir()/readdir() approach as shown in the second example. It works natively under Linux and also on Windows if you are using Cygwin.
Approach 1) I would just have a specific directory in which I have ONLY these files containing the ticker data and nothing else. I would then use the C readdir API to list all files in the directory and iterate over each one performing the data processing that you require. Which ticker the file applies to is determined only by the filename.
Pros: Easy to code
Cons: It really depends where the files are stored and where they come from.
Approach 2) Change the file format so the ticker files start with a magic code identifying that this is a ticker file, and a string containing the name. As before use readdir to iterate through all files in the folder and open each file, ensure that the magic number is set and read the ticker name from the file, and process the data as before
Pros: More flexible than before. Filename needn't reflect name of ticker
Cons: Harder to code, file format may be fixed.
but I'd really like to avoid typing out over a thousand file names into a text file. Is there a better way to approach this?
I have solved the exact same problem a while back, albeit for personal uses :)
What I did was to use the OS shell commands to generate a list of those files and redirected the output to a text file and had my program run through them.
On UNIX, there's the handy glob function:
glob_t results;
memset(&results, 0, sizeof(results));
glob("*.txt", 0, NULL, &results);
for (i = 0; i < results.gl_pathc; i++)
printf("%s\n", results.gl_pathv[i]);
globfree(&results);
On Linux or a related system, you could use the fts library. It's designed for traversing file hierarchies: man fts,
or even something as simple as readdir
If on Windows, you can use their Directory Management API's. More specifically, the FindFirstFile function, used with wildcards, in conjunction with FindNextFile

Resources