I have a problem where I need to store (and later fairly quickly retrieve) large volumes of sensor data generated at high sampling frequencies (in kHz range). Currently, I'm using HDF5 file format where a single file contains data for like 10 minutes or so and the files are stored on some file server. The data don't contain much most of the time and in such time intervals the range of the values is pretty limited so there is a lot of compression possible.
However, since I know next to nothing about current data science trends, I wonder if there is a more efficient solution for storing and retrieval. I often need to do some analysis on the data (mainly using Python) so a reasonably quick retrieval of the data is a must. I tried looking for a solution but I only encountered problems where sensors were generating a single value only once per couple of seconds or minutes, so I'm not sure if those proposed solutions would work here, since the volumes are indeed quite large.
Edit:
To clarify a bit, my issue with the current HDF5 solution is, that I often need to process the data as a stream, so in practice, I need to open one file, crawl through it, close it and open another one and so on, which I think could be solved more elegantly if the data were stored in some sort of database. Another thing is, that since I have multiple sensors and therefore multiple files/directories, the data are pretty spread out across the file server, so I would really welcome some more centralized approach. Again, I can imagine some database here, but I'm not sure if they are built for that and offer those quick response times and compression abilities.
A client's system will connect to our system via an API for a data pull. For now this data will be stored in a data mart, and say 50,000 records per request.
I would like to know the most efficient way of delivering the payload which originates in a SQL Azure database.
The API request will be a RESTful. After the request is received, I was thinking that the payload would be retrieved from the database, converted to JSON, and GZIP encoded/transferred over HTTP back to the client.
I'm concerned about processing this may take with many clients connected pulling a lot of data.
Would it be best to just return the straight results in clear text to the client?
Suggestions welcome.
-- UPDATE --
To clarify, this is not a web client that is connecting. The connection is made by another application to receive a one-time, daily data dump, so no pagination.
The data consists primarily of text with one binary field.
First of all : do not optimize prematurely! that means : dont sacrifice simplicity and maintainability of your code for gain you dont event know.
Lets see. 50000 records does not really say anything without specifying size of the record. I would advise you start from basic implementation and optimize when needed. So try this
Implement simple JSON response with that 50000 records, and try to call it from consumer app. Measure size of data and response time - evaluate carefully, if this is really a problem for once a day operation
If yes, turn on compression for that JSON response - this is usually HUGE change with JSON because of lots of repetitive text. One tip here: set content type header to "application/javascript" - Azure have dynamic compression enabled by default for this content type. Again - try it, evaluate if size of data or reponse time is problem
If it is still problem, maybe it is time for some serialization optimization after all, but i would strogly recommend something standard and proved here (no custom CSV mess), for example Google Protocol Buffers : https://code.google.com/p/protobuf-net/
This is a bit long for a comment, so ...
The best method may well be one of those "it depends" answers.
Is the just the database on azure, or is your whole entire hosting on azure. Never did any production on Azure myself.
What are you trying to optimize for -- total round response time, total server CPU time, or perhaps sometime else?
For example, if you database server is azure and but but you web server is local perhaps you can simply optimize the database request and depend on scaling via multiple web servers if needed.
If data the changes with each request, you should never compress it if you are trying to optimize server CPU load, but you should compress it if you are trying to optimize bandwidth usage -- either can be your bottleneck / expensive resource.
For 50K records, even JSON might be a bit verbose. If you data is a single table, you might have significant data savings by using something like CSV (including the 1st row as a record header for a sanity check if nothing else). If your result is a result of joining multiple table, i.e., hierarchical, using JSON would be recommended simply to avoid the complexity of rolling your own heirarchical representation.
Are you using a SSL or your webserver, if so SSL could be your bottleneck (unless this is handled via other hardware)
What is the nature of the data you are sending? Is is mostly text, numbers, images? Text usually compress well, numbers less so, and images poorly (usually). Since you suggest JSON, I would expect that you have little if any binary data though.
If compressing JSON, it can be a very efficient format since the repeated field name mostly compress out of your result. XML likewise (but less so this the tags come in pairs)
ADDED
If you know what the client will be fetching before hand and can prepare the packet data in advance, by all means do so (unless storing the prepared data is an issue). You could run this at off peak hours, create it as a static .gz file and let IIS serve it directly when needed. Your API could simply be in 2 parts 1) retrieve a list of static .gz files available to the client 2) Confirm processing of said files so you can delete them.
Presumably you know that JSON & XML are not as fragile as CSV, i.e., added or deleting fields from your API is usually simple. So, if you can compress the files, you should definitely use JSON or XML -- XML is easier for some clients to parse, and to be honest if you use the Json.NET or similar tools you can generate either one from the same set of definitions and information, so it is nice to be flexible. Personally, I like Json.NET quite a lot, simple and fast.
Normally what happens with such large requests is pagination, so included in the JSON response is a URL to request the next lot of information.
Now the next question is what is your client? e.g. a Browser or a behind the scenes application.
If it is a browser there are limitations as shown here:
http://www.ziggytech.net/technology/web-development/how-big-is-too-big-for-json/
If it is an application then your current approach of 50,000 requests in a single JSON call would be acceptable, the only thing you need to watch here is the load on the DB pulling the records, especially if you have many clients.
If you are willing to use a third-party library, you can try Heavy-HTTP which solves this problem out of the box. (I'm the author of the library)
I need to run regularly scheduled tasks that fetch relatively large xml documents (5Mb) and process them.
A problem I am currently experiencing is that I hit the memory limits of the application instance, and the instance is terminated while my task is running.
I did some rough measurements:
The task is usually scheduled into an instance that already uses 40-50 Mb of memory
Url fetching of 5 mb text file increases instance memory usage to 65-75 Mb.
Decoding fetched text into Unicode increases memory usage to 95-105 Mb.
Passing unicode string to lxml parser and accessing its root node increases instance memory usage to about 120-150 Mb.
During actual processing of the document (converting xml nodes to datastore models, etc.) the instance is terminated.
I could avoid 3rd step and save some memory by passing encoded text directly to lxml parser, but specifying encoding for lxml parser has some problems on GAE for me.
I can probably use MapReduce library for this job, but is it really worthwhile for a 5mb file?
Another option could be to split the task into several tasks.
Also I could probably save the file to blobstore, and then process it by reading it line by line from blobstore? As a side note it would be convenient if UrlFetch service allowed to read the response "on demand" to simplify processing of large documents.
So generally speaking what is the most convenient way to perform such kind of work?
Thank you!
Is this on frontend or a backend instance? Looks like a job for a backend instance to me.
Have you considered using different instance types?
frontend types
backend types
I'm writing a web application who needs to store data sent from one client, wait for other client to request and read it (on small intervalls, like 3 or 4 seconds) and then remove this data.
Currently im doing it saving this data to flat files but i'd like to know if it would be more efficient to write it to a database.
I know that usually it's more efficient to use a database but in this case i'll try to handle a lot of requests with small amounts of data on them.
Thanks in advance and sorry about my english :)
I agree with David's comment above. The question is how much I/O will you incur for each read/write. That can be affected by a lot of factors. I'm guessing the flat file option will be fastest, especially if your database is on a remote server and the data has to be sent over your internal network to read and write it.
Depending on how much data you have and how many requests you are handling, the fastest I/O would be to hold the data in memory. Of course, this is not very fault tolerant -- but that is also another consideration. The DB would provide you better integrity (over using flat files) in the event of a failure -- but if that is not a consideration, you may want to just keep it in memory.
I've got an interesting one: the ability to marshal the download of files - many in the gigabyte region of data.
I have a silverlight website that allows the upload of large volumes of data (Gigs) using the following plugin: http://silverlightuploader.codeplex.com/
However, I also want to be able to allow users to download the same data too. But I want to be able to restrict the amount of concurrent downloads. Thus the idea of directly controlling a stream of data to the client via silverlight is compelling - as I don't want to directly install anything on the machine.
My question is: For the volume of data I am looking at retrieving is it appropriate to use the WebClient class (I can specify how many bytes into the http stream I want to read, so I can download it incrementally, and put some business rules round it checking how many people are currently downloading, and make it wait until user count has gone down...), or can I use sockets to keep the overhead down of HTTP?
Unless there is a project I've not found which does exactly this thing!
Cheers in advance,
Matt
As long as you download the data in chunks of some smaller size then the actual volume of the total file won't matter and it won't really matter what you use to do the downloading. For example, for a file of that size I would just use the WebClient class and download chucks of maybe 1 or 2 Mb at a time to a temporary storage file on disk. You'll have to keep track of how much you've got downloaded and where you need to start the next chuck from, but that isn't overly difficult a problem. You can use sockets but then you have to communicate with the web server yourself to get access to the file in the first place.
When a client connects to download the next chunk, that is where you would enforce your business logic concerning the number of concurrent users. There might be a library you can use to do all this but to be honest it's not a complex problem.