p:unarchive (3.0)

Extracts documents from an archive file.

Summary
Description
- Determining which files to extract
- The base URI of the extracted files
Examples
Additional details
Errors raised
Reference information

Summary

<p:declare-step type="p:unarchive">
  <input port="source" primary="true" content-types="any" sequence="false"/>
  <output port="result" primary="true" content-types="any" sequence="true"/>
  <option name="exclude-filter" as="xs:string*" required="false" select="()"/>
  <option name="format" as="xs:QName?" required="false" select="()"/>
  <option name="include-filter" as="xs:string*" required="false" select="()"/>
  <option name="override-content-types" as="array(array(xs:string))?" required="false" select="()"/>
  <option name="parameters" as="map(xs:QName, item()*)?" required="false" select="()"/>
  <option name="relative-to" as="xs:anyURI?" required="false" select="()"/>
</p:declare-step>

The p:unarchive extracts document from an archive file (for instance a ZIP file) and returns these on its result port.

Ports:

Port	Type	Primary?	Content types	Seq?	Description
`source`	`input`	`true`	`any`	`false`	The archive to extract the documents from.
`result`	`output`	`true`	`any`	`true`	The extracted documents.

Options:

Name	Type	Req?	Default	Description
`exclude-filter`	`xs:string*` (XPath regular expression)	`false`	`()`	A sequence of XPath regular expressions (as strings) that determine which files in the archive are not extracted. See Determining which files to extract.
`format`	`xs:QName?`	`false`	`()`	The format of the archive file on the `source` port: If its value is `zip`, the `p:unarchive` step expects a ZIP archive on the `source` port. If absent or the empty sequence, the `p:unarchive` step tries to guess the archive file format. The only format that this step is required to recognize and handle is ZIP. Whether any other archive formats can be handled and what their names (values for this option) are depends on the XProc processor used.
`include-filter`	`xs:string*` (XPath regular expression)	`false`	`()`	A sequence of XPath regular expressions (as strings) that determine which files in the archive are extracted. See Determining which files to extract.
`override-content-types`	`array(array(xs:string))?`	`false`	`()`	Use this to override the content-type determination for the extracted files (the value of their `content-type` document-property). This mechanism works the same as for the `p:archive-manifest` step. See Overriding content types for an example.
`parameters`	`map(xs:QName, item()*)?`	`false`	`()`	Parameters used to control the document extraction. The XProc specification does not define any parameters for this option. A specific XProc processor might define its own.
`relative-to`	`xs:anyURI?`	`false`	`()`	This option can be used to explicitly set the `base-uri` document-property of the extracted documents. See The base URI of the extracted files

Description

The p:unarchive step allows you to extract one or more documents from an archive (for instance a ZIP file). The result will be a sequence of the extracted documents on the result port.

You can specify exactly which documents to extract or not to extract using the include-filter and exclude-filter options. See Determining which files to extract.
Sometimes it is important to specify the exact base URI of the extracted documents for subsequent steps. You can do this using the relative-to option. See The base URI of the extracted files.
Although probably rare, it is also possible to control the content type (MIME type) of the extracted documents, using the override-content-types option. See Overriding content types for an example.

Archives come in many formats. The only format the p:unarchive step is required to handle is ZIP. However, depending on the XProc processor used, other formats may also be processed.

Determining which files to extract

The include-filter and exclude-filter options determine which documents to extract. Both option must be a sequence of zero or more XPath regular expressions, as strings. For an example see Excluding documents. Basic operation:

The paths of the documents in the archive are matched against the regular expressions.
A document must be included and not excluded.
An empty include-filter option means: all documents are included. An empty exclude-filter option means: no documents are excluded.

In more detail:

First, the include-filter option is processed:
- If it is empty (its value is the empty sequence ()), all documents in the archive are included.
- Otherwise, the path of every document in the archive is matched against the list of regular expressions in the include-filter option (like in matches($path-in-archive, $regular-expression)). If one of the regular expression matches, the document is included, otherwise it is excluded.
Now the exclude-filter option is processed against the resulting list of entries:
- If it is empty (its value is the empty sequence ()), no further documents excluded.
- Otherwise, the path of every document in the archive is matched against the list of regular expressions in the exclude-filter option (like in matches($path-in-archive, $regular-expression)). If one of the regular expression matches, the document is excluded, otherwise it is included.

If the value for one of these options is a sequence with just a single values, you can set this by attribute:

<p:unarchive exlude-filter="\.xml$"/>

However, if more than one value is involved you must use <p:with-option> (providing a sequence with multiple values by attribute is not possible):

<p:unarchive>
  <p:with-option name="exclude-filter" select="('\.xml$', '\.jpg$')"/>
</p:unarchive>

The base URI of the extracted files

The relative-to option can be used to specify the base-uri document-property of the extracted documents:

If the relative-to option is not specified, the base-uri document-property of an extracted document is the full URI of the archive followed by the path of the document in the archive. For instance: file:///path/to/archive/archive.zip/path/in/archive/test.xml.
If a relative-to option is specified, it must be a valid URI. The base-uri document-property of an extracted document is this URI followed by the path of the document in the archive. For instance, assume we’ve set the relative-to option to file:///my/documents/: file:///my/documents/path/in/archive/test.xml.

The Basic usage and most other examples show what happens if you don’t specify a relative-to option. The Using relative-to example shows what happens if you do.

Examples

Basic usage

Assume we have a simple ZIP archive with two entries:

An XML file in the root called reference.xml
An image in an images/ sub-directory called logo.png.

The following pipeline uses p:unarchive to extract its contents. The <p:for-each> construction after the p:unarchive creates an overview of what was extracted. The actual extracted files are discarded.

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">

  <p:input port="source"/>
  <p:output port="result"/>

  <p:unarchive/>

  <p:for-each>
    <p:identity>
      <p:with-input exclude-inline-prefixes="#all">
        <unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/>
      </p:with-input>
    </p:identity>
  </p:for-each>
  <p:wrap-sequence wrapper="unarchived-files"/>

</p:declare-step>