Chunked File Upload

Status: DRAFT
Created: 20. January 2013
Author: shgupta
JIRA: SLING-2707
References: - http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html
Updated: -

Use Cases:

1. Large file upload - With high speed internet connections, advent of cloud and HD going mainstream, Sling support large files (> 2GB) upload.
2. Fault tolerant uploads - Sling provides capability to resume upload from failure point. It doesn't require client to restart the complete upload process.

Approach

Sling provides an extension to SlingPostServlet which accepts file chunks in accordance with a specified protocol. Sling client slices the file in chunks, and upload the chunks in serial manner to server. Each chunk has "Offset" attribute which identify chunk position in complete file. SlingPostServletupon receiving the last chunk, stitches all chunks into a single file and save them to the final destination.

In case of upload failures, sling provides support to query the last chunk uploaded till failure point. Client resumes chunk upload from last failure point.

Content Model

Chunks are stored within actual path in sling:chunkMixin mixin node type. The content model to store chunk is defined as follows:

Content model to store chunk
// node type to store chunk
// offset: offset of chunk in file
// jcr:data: binary of chunk
[sling:chunk] > nt:hierarchyNode
  primaryitem jcr:data
  - sling:offset  (long) mandatory
  - jcr:data (binary) mandatory
 
 //----------------------------------------------------------------------------- 
 // Mixin node type to identify that a node has chunks
 // sling:fileLength : length of complete file
 // sling:length : cumulative length of all uploaded chunks
[sling:chunks]
  mixin
  - sling:fileLength (long)
  - sling:chunksLength (long)
  + * (sling:chunkNode) multiple

The typical nt:file node under chunked upload would look like

Typical nt:file nod under chunked upload
/content/dam/folder/catalog.pdf [nt:file]
                                + jcr:content [nt:resource] [sling:chunkMixin]
                                    - jcr:data = empty until completed
                                    - sling:fileLength = 982145 // (filename@Length from client)
									- sling:chunksLength = 30000 //cumulative length of all uploaded chunks
                                    + chunk_0-9999 [sling:chunkNode]
                                            - sling:offset =0
                                            - jcr:data [binary data]
                                    + chunk_10000-19999 [sling:chunkNode]
                                    + ....

Protocol Specification

Upload chunk using POST

Chunk upload request

Client uses POST method to parent path to upload binary chunk of file. Sling client passes $filename@Offset and $filename@Length as multipart request parameters. "Offset" indicate chunk's data offset in complete file. "Length" is optional parameter and indicates length of complete file. If "Length" parameter is known, Sling automatically calculates if request is last chunk request and according stitches all chunks and store it into final destination.
[request]

First/Intermediate chunk upload request
POST /content/dam/folder HTTP/1.1
Authorization: Basic YWRtaW46YWRtaW4=
Transfer-Encoding: chunked
Content-Type: multipart/form-data; boundary=CbZDcL_DxJIVQqSG1WkYaIoLWqT3FGYCVe
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1 (java 1.5)
Host: localhost:4502

--CbZDcL_DxJIVQqSG1WkYaIoLWqT3FGYCVe
Content-Disposition: form-data; name="catalog.pdf@Length"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

1000
--CbZDcL_DxJIVQqSG1WkYaIoLWqT3FGYCVe
Content-Disposition: form-data; name="catalog.pdf@Offset"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

400
--CbZDcL_DxJIVQqSG1WkYaIoLWqT3FGYCVe
Content-Disposition: form-data; name="catalog.pdf"; filename="catalog.pdf"
Content-Type: application/pdf
Content-Transfer-Encoding: binary
$binary-data
--CbZDcL_DxJIVQqSG1WkYaIoLWqT3FGYCVe--

[response]
The response shows that chunk is stored within the actual path in sling:chunkMixin node type.

Chunk upload response
HTTP/1.1 200 OK
Connection: Keep-Alive
Server: Day-Servlet-Engine/4.1.42
Content-Type: text/html;charset=UTF-8
Date: Mon, 06 May 2013 14:42:22 GMT
Transfer-Encoding: chunked

<html>
<head>
    <title>Content modified /content/dam/folder</title>
</head>
    <body>
    <h1>Content modified /content/dam/folder</h1>
    <table>
        <tbody>
            <tr>
                <td>Status</td>
                <td><div id="Status">200</div></td>
            </tr>
            <tr>
                <td>Message</td>
                <td><div id="Message">OK</div></td>
            </tr>
            <tr>
                <td>Location</td>
                <td><a href="/content/dam/folder" id="Location">/content/dam/folder</a></td>
            </tr>
            <tr>
                <td>Parent Location</td>
                <td><a href="/" id="ParentLocation">/</a></td>
            </tr>
            <tr>
                <td>Path</td>
                <td><div id="Path">/content/dam/folder</div></td>
            </tr>
            <tr>
                <td>Referer</td>
                <td><a href="" id="Referer"></a></td>
            </tr>
            <tr>
                <td>ChangeLog</td>
                <td><div id="ChangeLog">&lt;pre&gt;modified("/content/dam/folder/catalog.pdf/jcr:content/jcr:lastModified");&lt;br/&gt;modified("/content/dam/folder/catalog.pdf/jcr:content/jcr:mimeType");
&lt;br/&gt;created("/content/dam/folder/catalog.pdf/jcr:content/chunk_400_799");&lt;br/&gt;modified("/content/dam/folder/catalog.pdf/jcr:content/chunk_400_799/jcr:data");&lt;br/&gt;
modified("/content/dam/folder/catalog.pdf/jcr:content/chunk_400_799/sling:offset");&lt;br/&gt;modified("/content/dam/folder/catalog.pdf/jcr:content/sling:chunksLength");&lt;br/&gt;&lt;/pre&gt;</div></td>
            </tr>
        </tbody>
    </table>
    <p><a href="">Go Back</a></p>
    <p><a href="/content/dam/folder">Modified Resource</a></p>
    <p><a href="/">Parent of Modified Resource</a></p>
    </body>
</html>

Chunk Upload in streaming use case

In streaming use case, file's length is not known in advance. Sling client requires to send "fileName@Completed" to true to indicate that it has reached end of file and current chunk request would be last.
[request]

Last chunk upload request
POST /content/dam/folder HTTP/1.1
Authorization: Basic YWRtaW46YWRtaW4=
Transfer-Encoding: chunked
Content-Type: multipart/form-data; boundary=lMaKIb2KPscWvPV8B0fULKkKayVtcxugD8Lt
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1 (java 1.5)
Host: localhost:4502

--lMaKIb2KPscWvPV8B0fULKkKayVtcxugD8Lt
Content-Disposition: form-data; name="catalog.pdf@Completed"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

true
--lMaKIb2KPscWvPV8B0fULKkKayVtcxugD8Lt
Content-Disposition: form-data; name="catalog.pdf@Offset"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

400
--lMaKIb2KPscWvPV8B0fULKkKayVtcxugD8Lt
Content-Disposition: form-data; name="catalog.pdf"; filename="catalog.pdf"
Content-Type: application/pdf
Content-Transfer-Encoding: binary
$binary_data

--lMaKIb2KPscWvPV8B0fULKkKayVtcxugD8Lt--

[response]
The response shows that merge chunks uploaded at final destination and deletion of chunk upload metadata.

Last chunk upload response
HTTP/1.1 200 OK
Connection: Keep-Alive
Server: Day-Servlet-Engine/4.1.42
Content-Type: text/html;charset=UTF-8
Date: Mon, 06 May 2013 15:52:16 GMT
Transfer-Encoding: chunked

<html>
<head>
    <title>Content modified /content/dam/folder</title>
</head>
    <body>
    <h1>Content modified /content/dam/folder</h1>
    <table>
        <tbody>
            <tr>
                <td>Status</td>
                <td><div id="Status">200</div></td>
            </tr>
            <tr>
                <td>Message</td>
                <td><div id="Message">OK</div></td>
            </tr>
            <tr>
                <td>Location</td>
                <td><a href="/content/dam/folder" id="Location">/content/dam/folder</a></td>
            </tr>
            <tr>
                <td>Parent Location</td>
                <td><a href="/" id="ParentLocation">/</a></td>
            </tr>
            <tr>
                <td>Path</td>
                <td><div id="Path">/content/dam/folder</div></td>
            </tr>
            <tr>
                <td>Referer</td>
                <td><a href="" id="Referer"></a></td>
            </tr>
            <tr>
                <td>ChangeLog</td>
                <td><div id="ChangeLog">&lt;pre&gt;modified("/content/dam/folder/catalog.pdf/jcr:content/jcr:lastModified");&lt;br/&gt;modified("/content/dam/folder/catalog.pdf/jcr:content/jcr:mimeType");
&lt;br/&gt;modified("/content/dam/folder/catalog.pdf/jcr:content/jcr:data");&lt;br/&gt;deleted("/content/dam/folder/catalog.pdf/jcr:content/chunk_0_199");
&lt;br/&gt;deleted("/content/dam/folder/catalog.pdf/jcr:content/chunk_200_399");&lt;br/&gt;deleted("/content/dam/folder/catalog.pdf/jcr:content/sling:chunksLength");
&lt;br/&gt;&lt;/pre&gt;</div></td>
            </tr>
        </tbody>
    </table>
    <p><a href="">Go Back</a></p>
    <p><a href="/content/dam/folder">Modified Resource</a></p>
    <p><a href="/">Parent of Modified Resource</a></p>
    </body>
</html>

Query Sling about the interrupted chunk upload status

Client sent get request on upload to retrieve chunk upload status.
[request]

Query interrupted chunk upload request
GET //content/dam/folder/catalog.pdf.3.json HTTP/1.1
Authorization: Basic YWRtaW46YWRtaW4=
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1 (java 1.5)
Host: localhost:4502

[response]
sling:chunksLength property indicate cumulative length of all chunks stored in sling. Sling client should resume upload from sling:chunksLength property's value offset

Query interrupted chunk upload response
{"jcr:createdBy":"admin","jcr:created":"Mon May 06 2013 21:31:22 GMT+0530","jcr:primaryType":"nt:file","jcr:content":{"jcr:lastModifiedBy":"admin",
"jcr:uuid":"845e9cee-f963-4f72-b115-fa021859c809",":jcr:data":0,"jcr:mixinTypes":["sling:chunkMixin"],"sling:chunksLength":200,"jcr:mimeType":"application/pdf",
"jcr:lastModified":"Mon May 06 2013 21:31:22 GMT+0530", "jcr:primaryType":"nt:resource","sling:fileLength":1700,"chunk_0_199":{"jcr:createdBy":"admin",":jcr:data":200,"sling:offset":0,"jcr:created":"Mon May 06 2013 21:31:22 GMT+0530","jcr:primaryType":"sling:chunkNode"}}

Abort chunk upload

To abort chunk upload Sling client passes ":operation=delete" request parameter along with ":applyToChunks=true".

Abort incomplete chunk upload request
POST /content/dam/folder/catalog.pdf HTTP/1.1
Authorization: Basic YWRtaW46YWRtaW4=
Content-Length: 403
Content-Type: multipart/form-data; boundary=dDzF5u2n-HJu5tudkdVpFucFsmqcVV-CONtRqlL
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1 (java 1.5)
Host: localhost:4502

--dDzF5u2n-HJu5tudkdVpFucFsmqcVV-CONtRqlL
Content-Disposition: form-data; name=":applyToChunks"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

true
--dDzF5u2n-HJu5tudkdVpFucFsmqcVV-CONtRqlL
Content-Disposition: form-data; name=":operation"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

delete
--dDzF5u2n-HJu5tudkdVpFucFsmqcVV-CONtRqlL--

[response]
Sling removes nt:file node for a new resumable upload and individual chunks on a existing nt:file node.

Abort incomplete chunk upload response
HTTP/1.1 200 OK
Connection: Keep-Alive
Server: Day-Servlet-Engine/4.1.42
Content-Type: text/html;charset=UTF-8
Date: Mon, 06 May 2013 16:09:58 GMT
Transfer-Encoding: chunked

<html>
<head>
    <title>Content modified /content/dam/folder/catalog.pdf</title>
</head>
    <body>
    <h1>Content modified /content/dam/folder/catalog.pdf</h1>
    <table>
        <tbody>
            <tr>
                <td>Status</td>
                <td><div id="Status">200</div></td>
            </tr>
            <tr>
                <td>Message</td>
                <td><div id="Message">OK</div></td>
            </tr>
            <tr>
                <td>Location</td>
                <td><a href="/content/dam/folder/catalog.pdf" id="Location">/content/dam/folder/catalog.pdf</a></td>
            </tr>
            <tr>
                <td>Parent Location</td>
                <td><a href="/content/dam/folder" id="ParentLocation">/content/dam/folder</a></td>
            </tr>
            <tr>
                <td>Path</td>
                <td><div id="Path">/content/dam/folder/catalog.pdf</div></td>
            </tr>
            <tr>
                <td>Referer</td>
                <td><a href="" id="Referer"></a></td>
            </tr>
            <tr>
                <td>ChangeLog</td>
                <td><div id="ChangeLog">&lt;pre&gt;deleted("/content/dam/folder/catalog.pdf");&lt;br/&gt;&lt;/pre&gt;</div></td>
            </tr>
        </tbody>
    </table>
    <p><a href="">Go Back</a></p>
    <p><a href="/content/dam/folder/catalog.pdf">Modified Resource</a></p>
    <p><a href="/content/dam/folder">Parent of Modified Resource</a></p>
    </body>
</html>

Error Scenarios

Start concurrent chunk upload

If Sling client starts a new upload on an already "in progress" chunk upload, Sling sends 500 internal server error along with error message "Chunk upload already in progress at {path}"

Start concurrent chunk upload
POST /content/dam/folder HTTP/1.1
Authorization: Basic YWRtaW46YWRtaW4=
Transfer-Encoding: chunked
Content-Type: multipart/form-data; boundary=WR64qwKjZHY7i8CXduKaVyT6hxsIyBjAie
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1 (java 1.5)
Host: localhost:4502

--WR64qwKjZHY7i8CXduKaVyT6hxsIyBjAie
Content-Disposition: form-data; name="catalog.pdf@Length"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

1000
--WR64qwKjZHY7i8CXduKaVyT6hxsIyBjAie
Content-Disposition: form-data; name="catalog.pdf@Offset"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

0
--WR64qwKjZHY7i8CXduKaVyT6hxsIyBjAie
Content-Disposition: form-data; name="catalog.pdf"; filename="catalog.pdf"
Content-Type: application/pdf
Content-Transfer-Encoding: binary
$binary_data
--WR64qwKjZHY7i8CXduKaVyT6hxsIyBjAie--

[response]

Chunk upload already in progress
HTTP/1.1 500 Internal Server Error
Connection: Close
Server: Day-Servlet-Engine/4.1.42
Content-Type: text/html;charset=UTF-8
Date: Mon, 06 May 2013 16:09:58 GMT
Transfer-Encoding: chunked

<html>
<head>
    <title>Error while processing /content/dam/folder</title>
</head>
    <body>
    <h1>Error while processing /content/dam/folder</h1>
    <table>
        <tbody>
            <tr>
                <td>Status</td>
                <td><div id="Status">500</div></td>
            </tr>
            <tr>
                <td>Message</td>
                <td><div id="Message">javax.jcr.RepositoryException: Chunk upload already in progress at {/content/dam/folder/catalog.pdf}</div></td>
            </tr>
            <tr>
                <td>Location</td>
                <td><a href="/content/dam/folder" id="Location">/content/dam/folder</a></td>
            </tr>
            <tr>
                <td>Parent Location</td>
                <td><a href="/" id="ParentLocation">/</a></td>
            </tr>
            <tr>
                <td>Path</td>
                <td><div id="Path">/content/dam/folder</div></td>
            </tr>
            <tr>
                <td>Referer</td>
                <td><a href="" id="Referer"></a></td>
            </tr>
            <tr>
                <td>ChangeLog</td>
                <td><div id="ChangeLog">&lt;pre&gt;&lt;/pre&gt;</div></td>
            </tr>
        </tbody>
    </table>
    <p><a href="">Go Back</a></p>
    <p><a href="/content/dam/folder">Modified Resource</a></p>
    <p><a href="/">Parent of Modified Resource</a></p>
    </body>
</html>

Start resumable upload from non zero offset

If sling client starts upload from non zero offset, Sling sends 500 internal server error along with error message "no chunk upload found at {path} wrapped javax.jcr.RepositoryException.

Start chunk upload request from non-zero offset
POST /content/dam/folder HTTP/1.1
Authorization: Basic YWRtaW46YWRtaW4=
Transfer-Encoding: chunked
Content-Type: multipart/form-data; boundary=4SC3O7Wgs4nrN8yqNaH1TNfQRxPK62
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1 (java 1.5)
Host: localhost:4502

--4SC3O7Wgs4nrN8yqNaH1TNfQRxPK62
Content-Disposition: form-data; name="catalog.pdf@Length"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

1700
--4SC3O7Wgs4nrN8yqNaH1TNfQRxPK62
Content-Disposition: form-data; name="catalog.pdf@Offset"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

200
--4SC3O7Wgs4nrN8yqNaH1TNfQRxPK62
Content-Disposition: form-data; name="catalog.pdf"; filename="catalog.pdf"
Content-Type: application/pdf
Content-Transfer-Encoding: binary

$binary-data
--4SC3O7Wgs4nrN8yqNaH1TNfQRxPK62--

[response]

No chunk upload found
HTTP/1.1 500 Internal Server Error
Connection: Close
Server: Day-Servlet-Engine/4.1.42
Content-Type: text/html;charset=UTF-8
Date: Mon, 06 May 2013 16:22:55 GMT
Transfer-Encoding: chunked

<html>
<head>
    <title>Error while processing /content/dam/folder</title>
</head>
    <body>
    <h1>Error while processing /content/dam/folder</h1>
    <table>
        <tbody>
            <tr>
                <td>Status</td>
                <td><div id="Status">500</div></td>
            </tr>
            <tr>
                <td>Message</td>
                <td><div id="Message">javax.jcr.RepositoryException: no chunk upload found at {/content/dam/folder/catalog.pdf}</div></td>
            </tr>
            <tr>
                <td>Location</td>
                <td><a href="/content/dam/folder" id="Location">/content/dam/folder</a></td>
            </tr>
            <tr>
                <td>Parent Location</td>
                <td><a href="/" id="ParentLocation">/</a></td>
            </tr>
            <tr>
                <td>Path</td>
                <td><div id="Path">/content/dam/folder</div></td>
            </tr>
            <tr>
                <td>Referer</td>
                <td><a href="" id="Referer"></a></td>
            </tr>
            <tr>
                <td>ChangeLog</td>
                <td><div id="ChangeLog">&lt;pre&gt;&lt;/pre&gt;</div></td>
            </tr>
        </tbody>
    </table>
    <p><a href="">Go Back</a></p>
    <p><a href="/content/dam/folder">Modified Resource</a></p>
    <p><a href="/">Parent of Modified Resource</a></p>
    </body>
</html>

Upload noncontinuous upload

If sling client sends noncontinuous chunk upload request, Sling sends 500 internal server error along with error message "Chunk's offset {actual offset} doesn't match expected offset {expected offset} wrapped javax.jcr.RepositoryException.

Noncontinuous chunk upload request
POST /content/dam/folder HTTP/1.1
Authorization: Basic YWRtaW46YWRtaW4=
Transfer-Encoding: chunked
Content-Type: multipart/form-data; boundary=i3nkScb8nmEmcC87H-LOXKXPO5cutm6
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.1 (java 1.5)
Host: localhost:4502

--i3nkScb8nmEmcC87H-LOXKXPO5cutm6
Content-Disposition: form-data; name="catalog.pdf@Length"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

1000
--i3nkScb8nmEmcC87H-LOXKXPO5cutm6
Content-Disposition: form-data; name="catalog.pdf@Offset"
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

200
--i3nkScb8nmEmcC87H-LOXKXPO5cutm6
Content-Disposition: form-data; name="catalog.pdf"; filename="catalog.pdf"
Content-Type: application/pdf
Content-Transfer-Encoding: binary

$binary_data
--i3nkScb8nmEmcC87H-LOXKXPO5cutm6--

[response]

Offset mismatch error
HTTP/1.1 500 Internal Server Error
Connection: Close
Server: Day-Servlet-Engine/4.1.42
Content-Type: text/html;charset=UTF-8
Date: Mon, 06 May 2013 16:09:58 GMT
Transfer-Encoding: chunked

<html>
<head>
    <title>Error while processing /content/dam/folder</title>
</head>
    <body>
    <h1>Error while processing /content/dam/folder</h1>
    <table>
        <tbody>
            <tr>
                <td>Status</td>
                <td><div id="Status">500</div></td>
            </tr>
            <tr>
                <td>Message</td>
                <td><div id="Message">javax.jcr.RepositoryException: Chunk's offset {200} doesn't match expected offset {600}</div></td>
            </tr>
            <tr>
                <td>Location</td>
                <td><a href="/content/dam/folder" id="Location">/content/dam/folder</a></td>
            </tr>
            <tr>
                <td>Parent Location</td>
                <td><a href="/" id="ParentLocation">/</a></td>
            </tr>
            <tr>
                <td>Path</td>
                <td><div id="Path">/content/dam/folder</div></td>
            </tr>
            <tr>
                <td>Referer</td>
                <td><a href="" id="Referer"></a></td>
            </tr>
            <tr>
                <td>ChangeLog</td>
                <td><div id="ChangeLog">&lt;pre&gt;&lt;/pre&gt;</div></td>
            </tr>
        </tbody>
    </table>
    <p><a href="">Go Back</a></p>
    <p><a href="/content/dam/folder">Modified Resource</a></p>
    <p><a href="/">Parent of Modified Resource</a></p>
    </body>
</html>
  • No labels

1 Comment

  1. Some comments:

    • I second Julian Reschke's proposal to return 201/Created from the POST requests. The Location header should be set to the URL use to test for chunked upload; except for the final POST, which of course returns the URL to the actual binary uploaded.
    • Request Extension for POST: Yes, we need a request extension to properly assign selectors. Suggest to use res instead of html. We also use this extension in the Default GET Servlet to request streaming the result in case we need an extension.
    • Chunk numbers: Using chunk numbering as propose implies that there is a predefined size of each chunk. The respective specification is missing in this proposal. Otherwise, instead of using chunk numbers you could use size ranges. For example the discovery request returns the size of contiguous data already uploaded successfully. The POST requests in indicate the file offset (the number of bytes sent is equal to the Content-Length header.
    • If the check request has to has an extension, this should be JSON to reflect the actual data format expected in the response. This URL should (see above) be used as the Location header on the 201/CREATED responses to the POST request.