Trying to transfer the output data from multiple python scripts to SQLite3 - database

I'm running 12 different python scripts at the same time to screen for a given criteria out of thousands of data points each hour.
I would like to save the output data to a "master csv file" but figured it would be better to put the data into SQLite3 instead as I would be overwhelmed with csv files.
I am trying to transfer the output straight to SQLite3.
This is my script so far:
symbol = []
with open(r'c:\\Users\\Desktop\\Results.csv') as f:
for line in f:
symbol.append(line.strip())
f.close
path_out = (r'c:\\Users\\Desktop\\Results.csv')
i=0

While you can use a sqlite database concurrently, it might slow your processes down considerably.
Whenever a program wants to write to the database file, the whole database has to be locked:
When SQLite tries to access a file that is locked by another process, the default behavior is to return SQLITE_BUSY.
So you could end up with more than one of your twelve processes waiting for the database to become available because one of them is writing.
Basically, concurrent read/write access is what client/server databases like PostgreSQL are made for. It is not a primary use case for sqlite.
So having the twelve programs each write a separate CSV file and merging them later is probably not such a bad choice, in my opinion. It is much easier than setting up a PostgreSQL server, at any rate.

Related

Make SAS export the code itself into an EXCEL sheet

Often I have to write individual SAS programs for various demands where the final output is an EXCEL file. For the sake of reproducibility I often copy and paste the SAS code into a separate EXCEL sheet -- so it is easy to get back to the point when there is a similar request or the need for modifications.
Now I am wondering if there exists a way to make SAS export the current code into an EXCEL sheet in an automated way. Such a task is probably easier if the SAS code has been saved as a physical file on the hard-drive (so it not only exists in the working memory) -- so let us assume that this was indeed the case.
If the program has been saved as a file, then you can read it into a dataset (the same way you read any other file) and export that dataset to Excel. It's up to you how exactly you do both of those things, SAS gives you lots of options.
As far as telling which program is executing, it varies by how you're running things. If you are using DM SAS and the Enhanced Editor, you can use this code:
%let filename= %sysget(SAS_EXECFILEPATH)\%sysget(SAS_EXECFILENAME);
%put &=filename;
and now &filename. holds the current program's path and filename.
If you're running in batch, you have to use the environment variable SYSIN. Other modes have different methods, and it's not always possible (such as if you're in an embedded program in an EG project). See this doc page for more details.

Apache Spark: batch processing of files

I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files (1MB) that once processed will be more like 1KB, and then do sc.wholeTextFiles to get started with my analysis
How do I loop on each file (*.xml) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr) ?
I'd recommend you just to use sc.wholeTextFiles and preprocess them using transformations, after that save all of them back as a single compressed sequence file (you can refer to my guide to do so: http://0x0fff.com/spark-hdfs-integration/)
Another option might be to write a mapreduce that would process the whole file at a time and save them to the sequence file as I proposed before: https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter.java. It is the example described in 'Hadoop: The Definitive Guide' book, take a look at it
In both cases you would do almost the same, both Spark and Hadoop will bring up a single process (Spark task or Hadoop mapper) to process these files, so in general both of the approaches will work using the same logic. I'd recommend you to start with a Spark one as it is simpler to implement given the fact you already have a cluster with Spark

tokyo cabinet: .tcb.wal file created along with .tcb file. Db size doesnot decreases while deleting records

I am using tokyo cabinets B+ tree API to create a lookup database. On linux environment I see a .tcb.wal file created along with the actual .tcb database file. The size of this file is 0. I wonder whether its a lock file that is created to help synchronization. Also when I delete records from the database the size of the file does not decrease. Any reasons why its behaving like that?
The extension .wal stands for Write Ahead Logging file. This file is only relevant if you use any transaction functions; most applications do not use these. (For details, search for "ahead" in the documentation.)
The file size does not change for every deletion for efficiency reasons. Similarly, if you create an empty database, it will reserve space for faster insertions.

Reading a file directly from HDFS into a shell function

I have a shell function that is called from inside my map function. The shell function takes 2 parameters -> an input file and an output file. Something like this
$> unix-binary /pathin/input.txt /pathout/output.txt
The problem is, that these input.txt files reside in HDFS and the output.txt files need to be written back to HDFS. Currently, I first copy the needed file with fs.copyToLocalFile into the local hard drive, call the unix binary and then write the output.txt back to HDFS with fs.copyFromLocalFile.
The problem with this approach is that, it is not optimal because it involves substantial amount of redundant reading and writing to HDD which slows down the performance. So, my question is, how I can read the HDFS file directly as an input and output the results directly to HDFS?
obviously,
$>unix-binary hdfs://master:53410/pathin/input.txt' hdfs://master:54310/pathout/output.txt
will not work. Is there any other way around? Can I treat an HDFS file as a loacl file somehow?
I have access to the unix-binary source code written in C. Maybe changing the source code would help?
thanks
You can add the file to the DistributedCache and access it from the mapper from the cache. Call your shell function on the local file and write the output file to local disk and then copy the local file to HDFS.
However, operations such as calling shell functions, or reading/writing from within a mapper/reducer break the MapReduce paradigm. If you find yourself needing to perform such operations, MapReduce may not be the solution you're looking for. HDFS and MapReduce were designed to perform massive scale batch processing on small numbers of extremely large files.
Since you have access to unix-binary source code, your best option might be to implement the particular function(s) you want in java. Feed the input files to your mapper and call the function from the mapper on the data rather than working with files on HDFS/LocalFS.

Following multiple log files efficiently

I'm intending to create a programme that can permanently follow a large dynamic set of log files to copy their entries over to a database for easier near-realtime statistics. The log files are written by diverse daemons and applications, but the format of them is known so they can be parsed. Some of the daemons write logs into one file per day, like Apache's cronolog that creates files like access.20100928. Those files appear with each new day and may disappear when they're gzipped away the next day.
The target platform is an Ubuntu Server, 64 bit.
What would be the best approach to efficiently reading those log files?
I could think of scripting languages like PHP that either open the files theirselves and read new data or use system tools like tail -f to follow the logs, or other runtimes like Mono. Bash shell scripts probably aren't so well suited for parsing the log lines and inserting them to a database server (MySQL), not to mention an easy configuration of my app.
If my programme will read the log files, I'd think it should stat() the file once in a second or so to get its size and open the file when it's grown. After reading the file (which should hopefully only return complete lines) it could call tell() to get the current position and next time directly seek() to the saved position to continue reading. (These are C function names, but actually I wouldn't want to do that in C. And Mono/.NET or PHP offer similar functions as well.)
Is that constant stat()ing of the files and subsequent opening and closing a problem? How would tail -f do that? Can I keep the files open and be notified about new data with something like select()? Or does it always return at the end of the file?
In case I'm blocked in some kind of select() or external tail, I'd need to interrupt that every 1, 2 minutes to scan for new or deleted files that shall (no longer) be followed. Resuming with tail -f then is probably not very reliable. That should work better with my own saved file positions.
Could I use some kind of inotify (file system notification) for that?
If you want to know how tail -f works, why not look at the source? In a nutshell, you don't need to periodically interrupt or constantly stat() to scan for changes to files or directories. That's what inotify does.

Resources