JPG PDF Files in Storm - distributed

I wanted to know if it's possible to manipulate JPG files in Storm? Should we expect any issues if JPG or PDF files are being transmitted from bolt to bolt? We are manipulating these files in very large volume, and need a distributed platform to keep up.
From my understanding, messages (and hopefully files) go into in memory queues between bolts.
Has anyone tried to pass JPG or PDF files between bolts in Storm? Are there any limitations that would prevent this from working? If not Storm, can anyone recommmend an appropriate platform?
Thank you for your help!

I have never tried this, but I did some experiments with large tuples which worked fine. I would not expect any problems. As long as you can provide appropriate (de)serialization (best via Kryo), Storm does not care what data it is. To Storm, everything locks like a bunch of bytes anyway (except for key attributes that are used for fieldsGrouping).
You might also check out Apache Flink (Disclaimer: I am a contributer)

Related

How to write to different files based on content for batch processing in Flink?

I am trying to process some files on HDFS and write the results back to HDFS too. The files are already prepared before job starts. The thing is I want to write to different paths and files based on the file content. I am aware that BucketingSink(doc here) is provided to achieve this in Flink streaming. However, it seems that Dataset does not have a similar API. I have found out some Q&As on stackoverflow.(1, 2, 3). Now I think I have two options:
Use Hadoop API: MultipleTextOutputFormat or MultipleOutputs;
Read files as stream and use BucketingSink.
My question is how to make a choice between them, or is there another solution ? Any help is appreciated.
EDIT: This question may be a duplicate of this .
We faced the same problem. We too are surprised that DataSet does not support addSink().
I recommend not switching to Streaming mode. You might give up some optimizations (i.e Memory pools) that are available in batch mode.
You may have to implement your own OutputFormat to do the bucketing.
Instead, you can extend the OutputFormat[YOUR_RECORD] (or RichOutputFormat[]) where you can still use the BucketAssigner[YOUR_RECORD, String] to open/write/close output streams.
That's what we did and it's working great.
I hope flink would support this soon in Batch Mode soon.

How can i gather lots of files from one filetype?

Im trying to fuzz some tools but i need a huge amount of .zip or .jpg files for that. I ve tried crawlers like webripper but its not very effective (or im doing it wrong). Is there a better way to get lots of different files?
Ok, for the offchance that someone else might need sth like this:
In the end i used Webripper and instead of generating links to google/bing results with the "filetype" parameter i just put some upload/freeware pages as targeted rip job with the max link depth.
Webripper might crash sometimes and it will take quite some time but well it works somewhat.
A possible better solution would probably be to use the google API (e.g. c#SearchAPI ). Then extract the clean links from the results and call asynch download for those. Using the direct result link most likely wont work because google will block it after some files "Unusual datatransfer".

How to process multiple text files at a time to analysis using mapreduce in hadoop

I have lots of small files , say more than 50000. i need to process these files at a time using map reduce concept to generate some analysis based on the input files.
Please suggest me a way to do this and also please let me know how to merge this small files into a big file using hdfs
See this blog post from cloudera explaining the problem with small files.
There is a project in github named FileCrush which does merge large number of small files. From project's homepage:
Turn many small files into fewer larger ones. Also change from text to sequence and other compression options in one pass.

Storing Serialized Video files to SQL Server

I currently am faced with a need to host 20 small video files for my website. I know I could just host them with my project in a folder but I came a crossed this article.
http://www.kindblad.com/2008/04/how-to-store-files-in-ms-sql-server.html
The thought of storing the file in the db had not occurred to me. My question is would there be a performance increase or decrease by storing the files as bit data in the db versus just streaming the data. I like the idea of having the data in the db for portability and having control and who gets access to the videos. Thanks in advance.
Unless you have a pressing need to store them in a database, I wouldn't, personally. You can still control who gets access to which files by using a handler to validate access to the file. One big problem that the method in that article has is that it doesn't support reading a byte range - so if someone wants to seek to the middle of a video, for example, they would have to wait for the whole thing to download. You'd want it to be able to support the range header, as described in this question.

Clarification needed about Where I Should Put (Store) My Core Data’s SQLite File?

Yes, I know. This question have been already replied in Where to store the Core Data file? and in Store coredata file outside of documents directory?.
#Kendall Helmstetter Gelner and #Matthias Bauch provided very good replies. I upvoted for them.
Now my question is quite conceptual and I'll try to explain it.
From Where You Should Put Your App’s Files section in Apple doc, I've read the following:
Handle support files — files your application downloads or generates and
can recreate as needed — in one of two ways:
In iOS 5.0 and earlier, put support files in the /Library/Caches directory to prevent them from being
backed up
In iOS 5.0.1 and later, put support files in the /Library/Application Support directory and apply the
com.apple.MobileBackup extended attribute to them. This attribute
prevents the files from being backed up to iTunes or iCloud. If you
have a large number of support files, you may store them in a custom
subdirectory and apply the extended attribute to just the directory.
Apple says that for handling support files you can follow two different ways based on the installed iOS. In my opinion (but maybe I'm wrong) a Core Data file is a support file and so it falls in these categories.
Said this, does the approach by Matthias and Kendall continue to be valid or not? In particular, if I create a directory, say Private, within the Library folder, does this directory continue to remain hidden both in iOS 5 version (5.0 and 5.0.1) or do I need to follow Apple solution? If the latter is valid, could you provide any sample or link?
Thank you in advance.
I would say that a Core Data file is not really a support file - unless you have some way to replicate the data stored, then you would want it backed up.
The support files are more things like images, or databases that are only caches for a remote web site.
So, you could continue to place your Core Data databases where you like (though it should be under Application Support).
Recent addition as of Jan 2013: Apple has started treating pre-loaded CoreData data stores that you copy from a bundle into a writable area, as if they were a support file - even if you write user data into the same databases also. The solution (from DTS) is to make sure when you copy the databases into place, set the do-not-backup flag, and then un-set that if user data is written into the database.
If your CoreData store is purely a cache of downloaded network data, continue to make sure it goes someplace like Caches or has the Do Not Backup flag set.

Resources