p:unarchive (3.0) 
Extracts documents from an archive file.
<p:declare-step type="p:unarchive"> <input port="source" primary="true" content-types="any" sequence="false"/> <output port="result" primary="true" content-types="any" sequence="true"/> <option name="exclude-filter" as="xs:string*" required="false" select="()"/> <option name="format" as="xs:QName?" required="false" select="()"/> <option name="include-filter" as="xs:string*" required="false" select="()"/> <option name="override-content-types" as="array(array(xs:string))?" required="false" select="()"/> <option name="parameters" as="map(xs:QName, item()*)?" required="false" select="()"/> <option name="relative-to" as="xs:anyURI?" required="false" select="()"/> </p:declare-step>
The p:unarchive extracts document from an archive file (for instance a ZIP file) and returns these on its result port.
Ports:
Port | Type | Primary? | Content types | Seq? | Description |
|---|---|---|---|---|---|
|
|
|
|
| The archive to extract the documents from. |
|
|
|
|
| The extracted documents. |
Options:
The p:unarchive step allows you to extract one or more documents from an archive (for instance a ZIP file). The result will be a sequence of the
extracted documents on the result port.
You can specify exactly which documents to extract or not to extract using the include-filter and
exclude-filter options. See Determining which files to extract.
Sometimes it is important to specify the exact base URI of the extracted documents for subsequent steps. You can do this using the
relative-to option. See The base URI of the extracted files.
Although probably rare, it is also possible to control the content type (MIME type) of the extracted documents, using the
override-content-types option. See Overriding content types for an example.
Archives come in many formats. The only format the p:unarchive step is required to handle is ZIP. However, depending on the XProc processor used,
other formats may also be processed.
The include-filter and exclude-filter options determine which documents to extract. Both option must be
a sequence of zero or more XPath regular expressions, as strings. For an example see Excluding documents. Basic operation:
The paths of the documents in the archive are matched against the regular expressions.
A document must be included and not excluded.
An empty include-filter option means: all documents are included. An empty
exclude-filter option means: no documents are excluded.
In more detail:
First, the include-filter option is processed:
If it is empty (its value is the empty sequence ()), all documents in the archive are
included.
Otherwise, the path of every document in the archive is matched against the list of regular expressions in
the include-filter option (like in matches($path-in-archive, $regular-expression)). If one of the
regular expression matches, the document is included, otherwise it is excluded.
Now the exclude-filter option is processed against the resulting list of entries:
If it is empty (its value is the empty sequence ()), no further documents excluded.
Otherwise, the path of every document in the archive is matched against the list of regular expressions in
the exclude-filter option (like in matches($path-in-archive, $regular-expression)). If one of the
regular expression matches, the document is excluded, otherwise it is included.
If the value for one of these options is a sequence with just a single values, you can set this by attribute:
<p:unarchive exlude-filter="\.xml$"/>
However, if more than one value is involved you must use <p:with-option> (providing a sequence with multiple
values by attribute is not possible):
<p:unarchive>
<p:with-option name="exclude-filter" select="('\.xml$', '\.jpg$')"/>
</p:unarchive>The relative-to option can be used to specify the base-uri document-property of the extracted
documents:
If the relative-to option is not specified, the base-uri document-property of
an extracted document is the full URI of the archive followed by the path of the document in the archive. For
instance: file:///path/to/archive/archive.zip/path/in/archive/test.xml.
If a relative-to option is specified, it must be a valid URI. The base-uri document-property of an
extracted document is this URI followed by the path of the document in the archive. For instance, assume we’ve
set the relative-to option to file:///my/documents/:
file:///my/documents/path/in/archive/test.xml.
The Basic usage and most other examples show what happens if you don’t specify a
relative-to option. The Using relative-to example shows what happens if you do.
Assume we have a simple ZIP archive with two entries:
An XML file in the root called reference.xml
An image in an images/ sub-directory called logo.png.
The following pipeline uses p:unarchive to extract its contents. The <p:for-each> construction after the p:unarchive creates an
overview of what was extracted. The actual extracted files are discarded.
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">
<p:input port="source"/>
<p:output port="result"/>
<p:unarchive/>
<p:for-each>
<p:identity>
<p:with-input exclude-inline-prefixes="#all">
<unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/>
</p:with-input>
</p:identity>
</p:for-each>
<p:wrap-sequence wrapper="unarchived-files"/>
</p:declare-step>Resulting overview of the extracted files:
<unarchived-files> <unarchived-file content-type="image/png" href="file:/…/…/test.zip/images/logo.png"/> <unarchived-file content-type="application/xml" href="file:/…/…/test.zip/reference.xml"/> </unarchived-files>
This example uses the same ZIP archive as Basic usage. The exclude-filter option excludes the
entries ending with .xml:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">
<p:input port="source"/>
<p:output port="result"/>
<p:unarchive exclude-filter="\.xml$"/>
<p:for-each>
<p:identity>
<p:with-input exclude-inline-prefixes="#all">
<unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/>
</p:with-input>
</p:identity>
</p:for-each>
<p:wrap-sequence wrapper="unarchived-files"/>
</p:declare-step>Resulting overview of the extracted files:
<unarchived-files> <unarchived-file content-type="image/png" href="file:/…/…/test.zip/images/logo.png"/> </unarchived-files>
The following example excludes all documents from the images sub-directory:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">
<p:input port="source"/>
<p:output port="result"/>
<p:unarchive exclude-filter="^images/"/>
<p:for-each>
<p:identity>
<p:with-input exclude-inline-prefixes="#all">
<unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/>
</p:with-input>
</p:identity>
</p:for-each>
<p:wrap-sequence wrapper="unarchived-files"/>
</p:declare-step>Resulting overview of the extracted files:
<unarchived-files> <unarchived-file content-type="application/xml" href="file:/…/…/test.zip/reference.xml"/> </unarchived-files>
This example uses the same ZIP archive as Basic usage. The following pipeline explicitly sets the base part of the
URIs for the extracted documents to file:///my/documents/:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">
<p:input port="source"/>
<p:output port="result"/>
<p:unarchive relative-to="file:///my/documents/"/>
<p:for-each>
<p:identity>
<p:with-input exclude-inline-prefixes="#all">
<unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/>
</p:with-input>
</p:identity>
</p:for-each>
<p:wrap-sequence wrapper="unarchived-files"/>
</p:declare-step>Resulting overview of the extracted files:
<unarchived-files>
<unarchived-file content-type="image/png" href="file:///my/documents/images/logo.png"/>
<unarchived-file content-type="application/xml"
href="file:///my/documents/reference.xml"/>
</unarchived-files>This example uses the same ZIP archive as Basic usage. The following pipeline explicitly sets the content type for
.png files to application/octet-stream:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">
<p:input port="source"/>
<p:output port="result"/>
<p:unarchive>
<p:with-option name="override-content-types" select="[ ['\.png$', 'application/octet-stream'] ]"/>
</p:unarchive>
<p:for-each>
<p:identity>
<p:with-input exclude-inline-prefixes="#all">
<unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/>
</p:with-input>
</p:identity>
</p:for-each>
<p:wrap-sequence wrapper="unarchived-files"/>
</p:declare-step>Resulting overview of the extracted files:
<unarchived-files>
<unarchived-file content-type="application/octet-stream"
href="file:/…/…/test.zip/images/logo.png"/>
<unarchived-file content-type="application/xml" href="file:/…/…/test.zip/reference.xml"/>
</unarchived-files>More information about how this mechanism works can be found in the description of the p:archive-manifest step.
No document-properties from the document on the source port survive.
A relative value for the relative-to option gets de-referenced against the base URI of the element in the pipeline it is
specified on. In most cases this will be the path of the pipeline document.
Paths in an archive are always relative. However, depending on how archives are constructed, a path in an archive can be with or without a
leading /. Usually it will be without. For archives constructed by p:archive no leading slash will be
present.
The only format this step is required to handle is ZIP. The ZIP format definition can be found here.
Error code | Description |
|---|---|
It is a dynamic error if the map | |
It is a dynamic error if the format of the archive does not match the specified format, cannot be understood, determined and/or processed. | |
It is a dynamic error if the | |
It is a dynamic error if the specified value is not a valid XPath regular expression. |
This description of the p:unarchive step is for XProc version: 3.0. This is a required step (an XProc 3.0 processor must support this).
The formal specification for the p:unarchive step can be found here.
The p:unarchive step is part of categories:
The p:unarchive step is also present in version:
3.1.