Avro hadoop random access to files

Avro hadoop random access to files - file

I am wondering if Avro supports random access or queries. For example if I create a Avro file call it B.avro which contains 2 binary files X.png and Y.png, is it possible to directly access Y.png? without the need of iterating through the entire file, it would be great if there was a way to access the file contents directly using the the file key.
If not is there any other data structure that would allow me to do this in the hadoop environment sequenceFiles, HAR ? I am basically using Avro as a way to deal with a large number of small files in hadoop, but I would also like to query these files which makes it tough when storing them in larger collections.
Thanks.

I don't know if there is any OOTB feature that allows us to access value through its key. But random access to AVro data files is supported by public void seek(long position), provided by DataFileReader.
You might find MapFile useful. The class MapFile.Reader allows us to get the value for a named key.
Don't mind please if this is exactly not what you need.

Related

Is it possible to write/read metadata for a text file using Labview?

The situation
I use Labview 2012 on Windows 7
my test result data is written in text files. First, information about the test is written in the file (product type, test type, test conditions etc) and after that the logged data is written each second.
All data files are stored in folders, sorted to date and the names of the files contain some info about the test
I have years worth of data files and my search function now only works on the file names (opening each file to look for search terms costs too much time)
The goal
To write metadata (additional properties like Word files can have) with the text files so that I can implement a search function to quickly find the file that I need
I found here the way to write/read metadata for images, but I need it for text files or something similar.

You would need to be writing to data files that supports meta data to begin with (such as LabVIEW TDMS or datalog file formats). In a similar situation, I would simply use a separate file with the same name, but a different extension for example. Then you can index those file names, and if you want the data you just swap the meta data filename extension and you are good to go.

I would not bother with files and use database for results logging. It may be not what you wiling to do, but this is the ultimate solution for the search problem and it open a lot of data analytics possibilities.

The metadata in Word files is from a feature called "Alternative Data Streams" which is actually a function of NTFS. You can learn more about it here.
I can't say I've ever used this feature. I don't think there is a nice API for LabVIEW, but one could certainly be made. With some research you should be able to play around with this feature and see if it really makes finding files any easier. My understanding is that the data can be lost if transferred over the network or onto a non-NTFS thumbdrive.

What type of database for storing ML experiments

So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?

The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.

How to version shapefiles

The program I work on has several shapefiles, with quite a few attributes. At the moment they are stored in our version control (Subversion) as compressed blobs (dbf.gz, shp.gz and shx.gz). This is how they are used by the program, but it's extremely inconvenient for versioning purposes. We get no information about changes to entries, or attributes - just that something, somewhere in the file has changed. No useful diff.
The DBF is the one that has the attributes. I was thinking maybe we could store it as CSV and then as part of the build process, convert it to DBF and do ??? (to be determined) to make it a valid shapefile, then make the zipped version as it currently uses.
Another approach might be to remove nearly all the attributes from the shapefile, store those in CSV/YAML/whatever (which can be versioned nicely), and either look them up by the shape IDs or try to attach them to our objects after they have been instantiated from shapefiles, something like that.
But maybe folks with more experience with shapefiles have better ideas?

The DBF you are referring to starting your second paragraph has the attributes. Why not dump out the table on a "per Shape" basis to an XML style file and use THAT for the subversion. If you are actually working within Visual Foxpro (which also uses DBF style files too), you could use the function CursorToXML() and just run that through a loop of distinct shapes and dump out to each respective XML file. Then, when reading it back in.... XMLToCursor() of the per file shape.

Fastest way to store/retrieve a dictionary - SQL, text file...?

I've got a text file of words and word frequencies. It's very large - theoretically we're talking millions of rows.
I just want to retrieve values from the file, and do it as quickly and efficiently as possible (for a web app, in Django).
My question is: what is the best way to store and retrieve the values? Should import them into SQL? Or keep the file and use grep? Or put them into a JSON dictionary...? Or some other way?
Would be very grateful for advice!

putting them in a json dictionary would be a bad idea unless you want to load the entire thing into memory when you search through it.
sql is basically built for this kind of thing, so i would use that. a file and grep would also work fine, but you wouldn't gain any benefits from indexing etc that sql would give you.

Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!

How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)

We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight