Silverlight Large File Downloader - silverlight

I've got an interesting one: the ability to marshal the download of files - many in the gigabyte region of data.
I have a silverlight website that allows the upload of large volumes of data (Gigs) using the following plugin: http://silverlightuploader.codeplex.com/
However, I also want to be able to allow users to download the same data too. But I want to be able to restrict the amount of concurrent downloads. Thus the idea of directly controlling a stream of data to the client via silverlight is compelling - as I don't want to directly install anything on the machine.
My question is: For the volume of data I am looking at retrieving is it appropriate to use the WebClient class (I can specify how many bytes into the http stream I want to read, so I can download it incrementally, and put some business rules round it checking how many people are currently downloading, and make it wait until user count has gone down...), or can I use sockets to keep the overhead down of HTTP?
Unless there is a project I've not found which does exactly this thing!
Cheers in advance,
Matt

As long as you download the data in chunks of some smaller size then the actual volume of the total file won't matter and it won't really matter what you use to do the downloading. For example, for a file of that size I would just use the WebClient class and download chucks of maybe 1 or 2 Mb at a time to a temporary storage file on disk. You'll have to keep track of how much you've got downloaded and where you need to start the next chuck from, but that isn't overly difficult a problem. You can use sockets but then you have to communicate with the web server yourself to get access to the file in the first place.
When a client connects to download the next chunk, that is where you would enforce your business logic concerning the number of concurrent users. There might be a library you can use to do all this but to be honest it's not a complex problem.

Related

What is the best way to handle big media files in MEAN-Stack applications?

I have a MEAN-Stack application and I store media files in an AWS S3 Bucket.
Currently I handle the media file upload, in the way that I encode the files in base64 and transfer them with simple post request for each file, over the node.js backend to the S3 Bucked and return the reference link to the file after wards.
That worked well for a time, but now some users decide to upload bigger Files, that party exceeded the size cap of a post call (I think thats 100mb per call, so I capped it to 95mb + 5mb puffer for meta information).
This obviously exceeded the technical capabilities of the application, but also for media files less than that size, it takes a long time for upload and there is no feedback about the uploading progress for the user.
What would be the best way to handle big files in the MEAN + S3 Stack?
What Angular sided libraries would you suggest? Maybe for video file compression / type conversion (.mov is part of the problem) but also for user feedback
Does it make sense to put a data stream through the Node.js server?
How would you handle the RAM cap? (currently 512mb per VM(EC2) on which the Node server is hosted).
Or what other solutions would you suggest?
A small foreword: read about AWS request-signing if you do not already know what it is. This allows your back-end to sign a hash of the parameters of AWS requests so that they can be called by the front end securely. You should actually use this with your existing GetObject requests so that you can control, track, and expire accesses.
What would be the best way to handle big files in the MEAN + S3 Stack?
Either by uploading directly from the client, or streaming to a server as a multipart upload to AWS S3. Note that doing so through the client will require some work as you must call CreateMultipartUpload orchestrate the signature of multiple UploadPart requests on the server, and then call CompleteMultipartUpload.
Multipart upload limits are huge and can handle any scale with your choice of chunk size.
In NodeJS this can actually be done much easier than by handling each command. See the #aws-sdk/lib-storage package, which wraps the upload in a transaction that handles errors and retries.
What Angular sided libraries would you suggest? Maybe for video file compression / type conversion (.mov is part of the problem) but also for user feedback.
I don't know much about angular, but I would recommend not doing processing of objects on the front-end. A great (and likely cheap) way to accomplish this without a dedicated server is through AWS lambda functions which trigger on object upload. See more about lambda-invocation here.
Does it make sense to put a data stream through the Node.js server?
It does to me as I mentioned in the answer to question 1, but its not the only way. Lambda functions are again a suitable alternative, or request presigning. See an AWS blog about the issue here.
Also there seems to be way to post from the front-end directly, and control access through S3 policies.
How would you handle the RAM cap? (currently 512mb per VM(EC2) on which the Node server is hosted).
Like all performance question, the answer is to measure. Monitor the usage of your server in production, and through tests. In addition, its always good to run stress-tests on important architecture. Hammer your architecture (replicated in a development deployment) in emulation of the worst-case, high-volume usage.
What might be most beneficial in your case, is to not run a server, but either a cluster of servers with autoscaling and load balancing. In addition, containerization can help decouple physical server deploys and your application. Containers also can use the AWS Fargate, which is a server-less architecture for containers. Containers also mean memory scaling can happen in-process and without much configuration change.
To focus this answer: For your purposes, Fargate or Lambda seem appropriate to provide a server-less architecture.
5 Or what other solutions would you suggest?
See above answers.

How to transfer rules and configuration to edge devices?

In our application we have a server which contains entities along with their relations and processing rules stored in DB. To that server there will be n no.of clients like raspberry pi , gateways, android apps are connected.
I want to push configuration & processing rules to those clients, so when they read some data they can process on their own. This is to make the edge devices self sustainable, avoid outages when server/network is down.
How to push/pull the configuration. I don't want to maintain DBs at client and configure replication. But the problem is maintenance and patching of DBs for those no.of client will be tough.
So any other better alternative.?
At the same time I have to push logs to upstream (server).
Thanks in advance.
I have been there. You need an on-device data store. For this range of embedded Linux, in order of growing development complexity:
Variables: Fast to change and retrieve, makes sense if the data fits in memory. Lost if the process ends.
Filesystem: Requires no special libraries, just read/write access somewhere. Workable if the data is small enough to fit in memory and does not change much during execution (read on startup when lacking network, write on update from server). If your data can be structured as a few object variables, you could write them to JSON files, and there is plenty of documentation on other file storage options for Android apps.
In-memory datastore like Redis: Lightweight dependency, can automate messaging and filesystem-stored backup. Provides a managed framework/hybrid of the previous two.
Lightweight databases, especially SQLite: Lightweight SQL database, stored in one file and popular with Android apps (probably already installed on many of the target devices). It could work for frequent changes on a larger block of data in a memory-constrained environment, but does not look like a great fit. It gets worse for anything heavier.
Redis replication is easy, but indiscriminate, so mainly sensible if your devices receive a changing but identical ruleset. Otherwise, in all these cases, the easiest transfer option may be to request and receive the whole configuration (GET a string, download a JSON file, etc.) and parse the received values.

Where does IPFS store all the data?

I've been trying to implement and understand the working of IPFS and have a few things that aren't clear.
Things I've tried:
Implemented IPFS on my system and stored files on it. Even if I delete the files from my system and close the ipfs daemon, I am still able to access the files from a different machine through IPFS.
I've noticed there's a .ipfs folder in my home directory that contains the part of blocks of data that I add to IPFS.
Questions:
Are the blocks stored locally on my system too?
Where else is the data stored? On other peers that I am connected to? Because I'm still able to access the file if I close my ipfs daemon.
If this is true, and data is stored at several places, the possibility of losing my data is still there, if all the peers disconnect from the network?
Does every peer on the network store the entire file or just a part of the file?
If a copy of data is being distributed across the p2p network, it means the data is being duplicated multiple times? How is this efficient in terms of storage?
We store data uploaded by other peers too?
Minimum System requirements for running IPFS? We just need abundant storage, not necessarily a powerful system?
When you upload something, the file is chunked by ipfs and stored in your cache folder (.ipfs).
If you check the file existance on another peer of the network (say the main gateway, ipfs.io) that peer requests the file from you and caches it too.
If later you switch off your daemon and you can still see the file on the gateway it's probably because the gateway or some other peer on the web still has it cached.
When a peer wants to download a file but it's out of memory (it can no longer cache), it trashes the oldest used files to free space.
If you want to dive deep into the technology, check first these fundamentals:
how git works
decentralized hash tables (DHT)
kademlia
merkle trees
The latter should give you an idea of how the mechanism works more or less.
Now, let's answer point by point
Are the blocks stored locally on my system too?
Yes
Where else is the data stored? On other peers that I am connected to? Because I'm still able to access the file if I close my ipfs daemon.
All the peers that request your file cache it
If this is true, and data is stored at several places, possibility of losing my data is still there, if all the peers disconnect from the network?
You lose the file when it's no longer possible to reconstitute your file from all the peers that had a part of it cached (including yourself)
Does every peer on the network store the entire file or just a part of the file?
One can get just a part of it, imagine you are watching a movie and you stop more, or less, at the half... that's it, you've cached just half of it.
If copy of data is being distributed across the p2p network, it means the data is being duplicated multiple times? How is this efficient in terms of storage?
When you watch a video on YouTube your browser caches it (that means a replication!)... ipfs is more efficient in terms of traffic, let's say you switch off the browser and 2 minutes later you want to watch it again. Ipfs gets it from your cache, YouTube makes you download it again. There's also an interesting matter on the delta storage (related to git) and from where you get it (could be inside your lan... that means blazing fast) but I want to stick to the questions.
We store data uploaded by other peers too?
If you get data, you cache it so...
Minimum System requirements for running IPFS? We just need abundant storage, not necessarily a powerful system?
The main daemon is written in go. Go is efficient but not as much as writing it on C++, C, Rust... Also, the tech is pretty young and it will improve with time. The more space you have the more you can cache, CPU power isn't THAT important.
If you are interested in ways to store data in a p2p manner, here some links to interesting projects.
https://filecoin.io/
https://storj.io/
https://maidsafe.net/
https://www.ethereum.org/ and it's related storage layer
https://ethersphere.github.io/swarm-home/
Files are stored inside IPFS Object which are upto 256kb in size. IPFS
object can also contain link to other ipfs objects. Files that are
larger than 256 kb, an image or video. those are split up into
multiple ipfs objects that are all 256 kb in size and afterward, the
system will create an empty IPFS OBJECT that links to all other pieces
of the file. Each object gets hashed and given a unique content
identifier (CID), which serves as a fingerprint. This makes it faster
and easier to store the small pieces of your data on the network
quickly.
Because IPFS uses content-based addressing, once something is added
cannot be changed. it is an immutable data store much like a
blockchain. IPFS can help deliver content in a way that can save you
considerable money.
IPFS removes duplications across the network and tracks version
history for every file. IPFS also provides high performance and
clustered persistence.
Since IPFS supports versioning of your files. let's say u want to
share an important file with someone over the ipfs. IPFS will create a
new commit object. it is very basic. it just tells ipfs which commit
went before it and it links to the IPFS object of your file. let's say
after a while u want to update a file. you just add updated file to
the IPFS network and the software will create a new commit object for
your file. this commit object now links to the previous commit. this
can be done endlessly. IPFS will make sure your file plus its entire
history is accessible to the other nodes on the network.
Biggest problem with IPFS is keeping files available. every node on
the network keeps a cache of the files that it has downloaded and
helps to share them if other people need them. But if a specific file
is hosted by 4 nodes and if those nodes go offline, then those files
become unavailable and no one can grab a copy of it. there are two
possible solutions to this problem.
Either we incentive people to store files and make them available or
we can proactively distribute files and make sure that there are
always a certain amount number of copies available on the network.
that is what exactly file coin intends to do. File coin is created by
the same group people that have created IPFS. it is basically a
blockchain built on top of IPFS that wants to create a decentralized
market for storage. if you have some free space you can rent it out to
others and make money from it in the process.
IPFS and the blockchain are a perfect fit. You can address large
amounts of data with IPFS and place the immutable, IPFS links into a
blockchain transaction. This timestamps and secures your content,
without having to put the data on the chain itself.
Reference

What is a good way to send large data sets to a client through API requests?

A client's system will connect to our system via an API for a data pull. For now this data will be stored in a data mart, and say 50,000 records per request.
I would like to know the most efficient way of delivering the payload which originates in a SQL Azure database.
The API request will be a RESTful. After the request is received, I was thinking that the payload would be retrieved from the database, converted to JSON, and GZIP encoded/transferred over HTTP back to the client.
I'm concerned about processing this may take with many clients connected pulling a lot of data.
Would it be best to just return the straight results in clear text to the client?
Suggestions welcome.
-- UPDATE --
To clarify, this is not a web client that is connecting. The connection is made by another application to receive a one-time, daily data dump, so no pagination.
The data consists primarily of text with one binary field.
First of all : do not optimize prematurely! that means : dont sacrifice simplicity and maintainability of your code for gain you dont event know.
Lets see. 50000 records does not really say anything without specifying size of the record. I would advise you start from basic implementation and optimize when needed. So try this
Implement simple JSON response with that 50000 records, and try to call it from consumer app. Measure size of data and response time - evaluate carefully, if this is really a problem for once a day operation
If yes, turn on compression for that JSON response - this is usually HUGE change with JSON because of lots of repetitive text. One tip here: set content type header to "application/javascript" - Azure have dynamic compression enabled by default for this content type. Again - try it, evaluate if size of data or reponse time is problem
If it is still problem, maybe it is time for some serialization optimization after all, but i would strogly recommend something standard and proved here (no custom CSV mess), for example Google Protocol Buffers : https://code.google.com/p/protobuf-net/
This is a bit long for a comment, so ...
The best method may well be one of those "it depends" answers.
Is the just the database on azure, or is your whole entire hosting on azure. Never did any production on Azure myself.
What are you trying to optimize for -- total round response time, total server CPU time, or perhaps sometime else?
For example, if you database server is azure and but but you web server is local perhaps you can simply optimize the database request and depend on scaling via multiple web servers if needed.
If data the changes with each request, you should never compress it if you are trying to optimize server CPU load, but you should compress it if you are trying to optimize bandwidth usage -- either can be your bottleneck / expensive resource.
For 50K records, even JSON might be a bit verbose. If you data is a single table, you might have significant data savings by using something like CSV (including the 1st row as a record header for a sanity check if nothing else). If your result is a result of joining multiple table, i.e., hierarchical, using JSON would be recommended simply to avoid the complexity of rolling your own heirarchical representation.
Are you using a SSL or your webserver, if so SSL could be your bottleneck (unless this is handled via other hardware)
What is the nature of the data you are sending? Is is mostly text, numbers, images? Text usually compress well, numbers less so, and images poorly (usually). Since you suggest JSON, I would expect that you have little if any binary data though.
If compressing JSON, it can be a very efficient format since the repeated field name mostly compress out of your result. XML likewise (but less so this the tags come in pairs)
ADDED
If you know what the client will be fetching before hand and can prepare the packet data in advance, by all means do so (unless storing the prepared data is an issue). You could run this at off peak hours, create it as a static .gz file and let IIS serve it directly when needed. Your API could simply be in 2 parts 1) retrieve a list of static .gz files available to the client 2) Confirm processing of said files so you can delete them.
Presumably you know that JSON & XML are not as fragile as CSV, i.e., added or deleting fields from your API is usually simple. So, if you can compress the files, you should definitely use JSON or XML -- XML is easier for some clients to parse, and to be honest if you use the Json.NET or similar tools you can generate either one from the same set of definitions and information, so it is nice to be flexible. Personally, I like Json.NET quite a lot, simple and fast.
Normally what happens with such large requests is pagination, so included in the JSON response is a URL to request the next lot of information.
Now the next question is what is your client? e.g. a Browser or a behind the scenes application.
If it is a browser there are limitations as shown here:
http://www.ziggytech.net/technology/web-development/how-big-is-too-big-for-json/
If it is an application then your current approach of 50,000 requests in a single JSON call would be acceptable, the only thing you need to watch here is the load on the DB pulling the records, especially if you have many clients.
If you are willing to use a third-party library, you can try Heavy-HTTP which solves this problem out of the box. (I'm the author of the library)

What's the best way to send pictures to a browser when they have to be stored as blobs in a database?

I have an existing database containing some pictures in blob fields. For a web application I have to display them.
What's the best way to do that, considering stress on the server and maintenance and coding efforts.
I can think of the following:
"Cache" the blobs to external files and send the files to the browser.
Read them from directly the database every time it's requested.
Some additionals facts:
I cannot change the database and get rid of the blobs alltogether and only save file references in the database (Like in the good ol' Access days), because the database is used by another application which actually requires the blobs.
The images change rarely, i.e. if an image is in the database it mostly stays that way forever.
There'll be many read accesses to the pictures, 10-100 pictures per view will be displayed. (Depending on the user's settings)
The pictures are relativly small, < 500 KB
I would suggest a combination of the two ideas of yours: The first time the item is requested read it from the database. But afterwards make sure they are cached by something like Squid so you don't have to retrieve the every time they are requested.
one improtant thing is to use proper HTTP cache control ... that is, setting expiration dates properly, responding to HEAD requests correctly (not all plattforms/webservers allow that) ...
caching thos blobs to the file system makes sense to me ... even more so, if the DB is running on another server ... but even if not, i think a simple file access is a lot cheaper, than piping the data through a local socket ... if you did cache the DB to the file system, you could most probably configure any webserver to do a good cache control for you ... if it's not the case already, you should maybe request, that there is a field to indicate the last update of the image, to make your cache more efficient ...
greetz
back2dos

Resources