Amazon S3: Multipart Upload

jluxenberg · on Nov 10, 2010

Not to be confused with the "multipart/form-data" encoding type for form-based file uploads. This is not an implementation of that protocol for S3.

chunkbot · on Nov 11, 2010

Because we love it when corporations come up with their own proprietary implementations of existing open protocols with similar if not superior functionality.

JeremyBanks · on Nov 11, 2010

Apples and oranges. multipart/form-data is used for sending a set of form information, possibly including files, all together. This announcement is that S3 will now allow you to upload a file in pieces so that you don't lose everything when an upload is interrupted.

drivebyacct2 · on Nov 11, 2010

I don't understand your sarcasm. I think you misunderstand which of these protocols ARE existing, and they both seem pretty darn open.

bgentry · on Nov 11, 2010

Based on the docs, it appears that this also allows you to upload segments of files without knowing the final number of segments or the final file size.

This will be pretty damn useful for piping the output of some process generating a large file (i.e. video transcoding) and beginning the upload before the file has been fully created.

cperciva · on Nov 11, 2010

This will be pretty damn useful for piping the output of some process generating a large file (i.e. video transcoding) and beginning the upload before the file has been fully created.

Even better: You can split a video into pieces, transcode each part on a different EC2 node, and upload the parts directly from those respective nodes.

bgentry · on Nov 11, 2010

won't work with all codecs, but that same concept can be applied to a lot of areas!

zootm · on Nov 11, 2010

That's an interesting idea. The only problem I can see is that every segment needs to be min. 5MB, so you'd probably need to buffer an extra segment compared to the "naive" implementationto ensure the last segment is big enough.

yates · on Nov 11, 2010

The last segment can be of any size so you could send the last part and it can be 1 byte. All the other parts need to be 5MB or larger.

neilc · on Nov 11, 2010

Limitations of the TCP/IP protocol make it very difficult for a single application to saturate a network connection.

I'm just curious, but exactly which "limitations" are those? I can believe that parallel connections help in practice (especially when fetching small objects), but for large objects, I find it surprising you can't get reasonably close to saturating a single network connection with a modern TCP stack (e.g., using TCP window scaling).

tav · on Nov 11, 2010

It's pretty much impossible to saturate even a LAN connection with a single TCP connection. The are a number of issues at play here — RTT (Round Trip Time, i.e. ping/latency), window sizes, packet loss and initcwnd (TCP's initial window).

The combination of the limitations imposed by the speed of light and TCP's windowing system means that you are buggered transferring large files over high-latency TCP connections. I haven't checked their figures, but here's a TCP rate calculator I just found which lets you tune the different parameters: http://osn.fx.net.nz/LFN/

The greater the delay, the bigger the impact. For example if we take a standard Windows XP machine and plug in the values for a standard Gigabit LAN (typically .2ms latency between hosts) we get a maximum speed of 700Mbit/sec, but if we try if between two hosts, one of them in the USA (typically around 120ms) the maximum transfer rate falls to 1.17 Mbit/sec.

There are a number of attempts trying to fix TCP's failings in this regard. For starters, see this presentation by the Google/SPDY guys making a case for changing TCP Slow Start: http://www.chromium.org/spdy/An_Argument_For_Changing_TCP_Sl... — here's the IETF Draft for increasing TCP's Initial Window http://tools.ietf.org/html/draft-hkchu-tcpm-initcwnd

And, more radically, see the uTP work by the Bittorrent folk who are trying to create a better alternative to TCP instead of simply fixing it — http://en.wikipedia.org/wiki/Micro_Transport_Protocol + https://github.com/bittorrent/libutp (source code).

Anyways, sorry for not going into too much detail (it's 4am), but hope I was able to clear things up a little.

neilc · on Nov 11, 2010

The are a number of issues at play here — RTT (Round Trip Time, i.e. ping/latency), window sizes, packet loss and initcwnd (TCP's initial window).

Initial window size: not relevant AFAICS, I'm not talking about connection startup behavior.

RTT, Window size: if the bandwidth-delay product is large, obviously you need a large window size (>>65K). Thankfully, recent TCP stacks support TCP window scaling.

Packet loss: you need relatively large buffers (by the standards of traditional TCP) and a sane scheme for recovering from packet loss (e.g., SACK), but I don't see why this is a show stopper on modern TCP stacks.

I'm not super familiar with the SPDY work, but from what I recall, it primarily addresses connection startup behavior, rather than steady-state behavior.

nkurz · on Nov 11, 2010

In theory you should be right, and yet in practice it seems to be a problem. Here's a recent exchange that offers some real life numbers: http://serverfault.com/questions/111813/what-is-the-best-way...

shib71 · on Nov 11, 2010

This is very good - uploading large files is a PITA. Now all we need is a Flash client we can add to a website, and we'll have a reliable way for website users to upload huge files.

cperciva · on Nov 11, 2010

Did you really just use the words "Flash client" and "reliable" in the same sentence?

shib71 · on Nov 11, 2010

As fashionable as it is to deride Flash, the vast majority find it quite stable. I believe a good Flash developer could produce a solid upload client for this service. I also believe that every developer on HN would use such a client without batting an eye.

cperciva · on Nov 11, 2010

I also believe that every developer on HN would use such a client without batting an eye.

Well, except maybe people on ipads....

shib71 · on Nov 11, 2010

Granted. Mobile devices are handicapped in terms of embeddable functionality, but that isn't limited to Flash. Java applets, the only obvious alternative for this use case, are in the same boat. I think that if mobile users want to upload >100Mb files over the air it's fair to make them use an app.

dholowiski · on Nov 11, 2010

Am I the only one who thinks it's strange that the AWS blog is hosted on typepad?

jeffbarr · on Nov 11, 2010

Perhaps, but I started the blog in November of 2004, long before EC2, S3, or any of the other services had been released. It was a clean and simple way to get a blog up and running and I've never had a compelling reason to go through the trouble to move it to another host.

Here's the first post that I wrote for the AWS blog: http://aws.typepad.com/aws/2004/11/welcome.html

pjscott · on Nov 11, 2010

I would assume that the AWS guys are all about comparative advantage. Anybody not familiar with the term will probably enjoy Wikipedia's surprisingly good explanation:

http://en.wikipedia.org/wiki/Comparative_advantage

tl;dr: The AWS people get higher marginal return on their investment of time and money by hosting with TypePad and putting the time they save into making AWS better. Probably.

bshep · on Nov 11, 2010

I did not find this in the description for the service, but I'm wondering what happens if you have a crash or power failure while doing a multi-part upload and dont have the 'upload-id' stored anywhere.

First of all the storage for the data already uploaded is reserved and there is no way to release since you cannot abort the upload without the 'id'.

Second of all there doesnt seem to be a way to list active multi-part uploads, you can only list the status of an upload for which you have the 'id'.

Any ideas?

throw_away · on Nov 11, 2010

this doc: http://docs.amazonwebservices.com/AmazonS3/latest/dev/iearch... says that there is an operation to list all in-progress multi-part uploads as well as list the parts uploaded for a given multi-part upload.

blantonl · on Nov 11, 2010

What is an acceptable use case for this few feature?

jasonkester · on Nov 11, 2010

Here's how I'm going to use it:

I run a service that process S3 and Cloudfront logs for people. Each S3 Bucket generates between 200 and 1000 logfiles every day that need to be combined together to for a full day's weblog.

Part of what my service does is re-upload that combined logfile back to the bucket in question, and since for large sites it can be upwards of 200mb zipped, it'd be nice to be able to upload it in little 5mb chunks that can be resent if anything goes wrong.

andraz · on Nov 11, 2010

exactly the same use case here... :) Maybe Amazon should go a step further and enable people to get logs aggregated together by the chosen time unit (hour, day, week), so we won't need to do round trip just to join files.