Extracts documents from an archive file.
<p:declare-step type="p:unarchive"> <input port="source" primary="true" content-types="any" sequence="false"/> <output port="result" primary="true" content-types="any" sequence="true"/> <option name="exclude-filter" as="xs:string*" required="false" select="()"/> <option name="format" as="xs:QName?" required="false" select="()"/> <option name="include-filter" as="xs:string*" required="false" select="()"/> <option name="override-content-types" as="array(array(xs:string))?" required="false" select="()"/> <option name="parameters" as="map(xs:QName, item()*)?" required="false" select="()"/> <option name="relative-to" as="xs:anyURI?" required="false" select="()"/> </p:declare-step>
The p:unarchive
extracts document from an archive file (for instance a ZIP file) and returns these on its result
port.
Ports:
Port | Type | Primary? | Content types | Seq? | Description |
---|---|---|---|---|---|
|
|
|
|
| The archive to extract the documents from. |
|
|
|
|
| The extracted documents. |
Options:
The p:unarchive
step allows you to extract one or more documents from an archive (for instance a ZIP file). The result will be a sequence of the
extracted documents on the result
port.
You can specify exactly which documents to extract or not to extract using the include-filter
and
exclude-filter
options. See Determining which files to extract.
Sometimes it is important to specify the exact base URI of the extracted documents for subsequent steps. You can do this using the
relative-to
option. See The base URI of the extracted files.
Although probably rare, it is also possible to control the content type (MIME type) of the extracted documents, using the
override-content-types
option. See Overriding content types for an example.
Archives come in many formats. The only format the p:unarchive
step is required to handle is ZIP. However, depending on the XProc processor used,
other formats may also be processed.
The include-filter
and exclude-filter
options determine which documents to extract. Both option must be
a sequence of zero or more XPath regular expressions, as strings. For an example see Excluding documents. Basic operation:
The paths of the documents in the archive are matched against the regular expressions.
A document must be included and not excluded.
An empty include-filter
option means: all documents are included. An empty
exclude-filter
option means: no documents are excluded.
In more detail:
First, the include-filter
option is processed:
If it is empty (its value is the empty sequence ()
), all documents in the archive are
included.
Otherwise, the path of every document in the archive is matched against the list of regular expressions in
the include-filter
option (like in matches($path-in-archive, $regular-expression)
). If one of the
regular expression matches, the document is included, otherwise it is excluded.
Now the exclude-filter
option is processed against the resulting list of entries:
If it is empty (its value is the empty sequence ()
), no further documents excluded.
Otherwise, the path of every document in the archive is matched against the list of regular expressions in
the exclude-filter
option (like in matches($path-in-archive, $regular-expression)
). If one of the
regular expression matches, the document is excluded, otherwise it is included.
If the value for one of these options is a sequence with just a single values, you can set this by attribute:
<p:unarchive exlude-filter="\.xml$"/>
However, if more than one value is involved you must use <p:with-option>
(providing a sequence with multiple
values by attribute is not possible):
<p:unarchive> <p:with-option name="exclude-filter" select="('\.xml$', '\.jpg$')"/> </p:unarchive>
The relative-to
option can be used to specify the base-uri
document-property of the extracted
documents:
If the relative-to
option is not specified, the base-uri
document-property of
an extracted document is the full URI of the archive followed by the path of the document in the archive. For
instance: file:///path/to/archive/archive.zip/path/in/archive/test.xml
.
If a relative-to
option is specified, it must be a valid URI. The base-uri
document-property of an
extracted document is this URI followed by the path of the document in the archive. For instance, assume we’ve
set the relative-to
option to file:///my/documents/
:
file:///my/documents/path/in/archive/test.xml
.
The Basic usage and most other examples show what happens if you don’t specify a
relative-to
option. The Using relative-to example shows what happens if you do.
Assume we have a simple ZIP archive with two entries:
An XML file in the root called reference.xml
An image in an images/
sub-directory called logo.png
.
The following pipeline uses p:unarchive
to extract its contents. The <p:for-each>
construction after the p:unarchive
creates an
overview of what was extracted. The actual extracted files are discarded.
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0"> <p:input port="source"/> <p:output port="result"/> <p:unarchive/> <p:for-each> <p:identity> <p:with-input exclude-inline-prefixes="#all"> <unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/> </p:with-input> </p:identity> </p:for-each> <p:wrap-sequence wrapper="unarchived-files"/> </p:declare-step>
Resulting overview of the extracted files:
<unarchived-files> <unarchived-file content-type="image/png" href="file:/…/…/test.zip/images/logo.png"/> <unarchived-file content-type="application/xml" href="file:/…/…/test.zip/reference.xml"/> </unarchived-files>
This example uses the same ZIP archive as Basic usage. The exclude-filter
option excludes the
entries ending with .xml
:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0"> <p:input port="source"/> <p:output port="result"/> <p:unarchive exclude-filter="\.xml$"/> <p:for-each> <p:identity> <p:with-input exclude-inline-prefixes="#all"> <unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/> </p:with-input> </p:identity> </p:for-each> <p:wrap-sequence wrapper="unarchived-files"/> </p:declare-step>
Resulting overview of the extracted files:
<unarchived-files> <unarchived-file content-type="image/png" href="file:/…/…/test.zip/images/logo.png"/> </unarchived-files>
The following example excludes all documents from the images
sub-directory:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0"> <p:input port="source"/> <p:output port="result"/> <p:unarchive exclude-filter="^images/"/> <p:for-each> <p:identity> <p:with-input exclude-inline-prefixes="#all"> <unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/> </p:with-input> </p:identity> </p:for-each> <p:wrap-sequence wrapper="unarchived-files"/> </p:declare-step>
Resulting overview of the extracted files:
<unarchived-files> <unarchived-file content-type="application/xml" href="file:/…/…/test.zip/reference.xml"/> </unarchived-files>
This example uses the same ZIP archive as Basic usage. The following pipeline explicitly sets the base part of the
URIs for the extracted documents to file:///my/documents/
:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0"> <p:input port="source"/> <p:output port="result"/> <p:unarchive relative-to="file:///my/documents/"/> <p:for-each> <p:identity> <p:with-input exclude-inline-prefixes="#all"> <unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/> </p:with-input> </p:identity> </p:for-each> <p:wrap-sequence wrapper="unarchived-files"/> </p:declare-step>
Resulting overview of the extracted files:
<unarchived-files> <unarchived-file content-type="image/png" href="file:///my/documents/images/logo.png"/> <unarchived-file content-type="application/xml" href="file:///my/documents/reference.xml"/> </unarchived-files>
This example uses the same ZIP archive as Basic usage. The following pipeline explicitly sets the content type for
.png
files to application/octet-stream
:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0"> <p:input port="source"/> <p:output port="result"/> <p:unarchive> <p:with-option name="override-content-types" select="[ ['\.png$', 'application/octet-stream'] ]"/> </p:unarchive> <p:for-each> <p:identity> <p:with-input exclude-inline-prefixes="#all"> <unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/> </p:with-input> </p:identity> </p:for-each> <p:wrap-sequence wrapper="unarchived-files"/> </p:declare-step>
Resulting overview of the extracted files:
<unarchived-files> <unarchived-file content-type="application/octet-stream" href="file:/…/…/test.zip/images/logo.png"/> <unarchived-file content-type="application/xml" href="file:/…/…/test.zip/reference.xml"/> </unarchived-files>
More information about how this mechanism works can be found in the description of the p:archive-manifest
step.
No document-properties from the document on the source
port survive.
A relative value for the relative-to
option gets de-referenced against the base URI of the element in the pipeline it is
specified on. In most cases this will be the path of the pipeline document.
Paths in an archive are always relative. However, depending on how archives are constructed, a path in an archive can be with or without a
leading /
. Usually it will be without. For archives constructed by p:archive
no leading slash will be
present.
The only format this step is required to handle is ZIP. The ZIP format definition can be found here.
Error code | Description |
---|---|
It is a dynamic error if the map | |
It is a dynamic error if the format of the archive does not match the specified format, cannot be understood, determined and/or processed. | |
It is a dynamic error if the | |
It is a dynamic error if the specified value is not a valid XPath regular expression. |
This description of the p:unarchive
step is for XProc version: 3.1. This is a required step (an XProc 3.1 processor must support this).
The formal specification for the p:unarchive
step can be found here.
The p:unarchive
step is part of categories:
The p:unarchive
step is also present in version:
3.0.