p:unarchive (3.1) 

Extracts documents from an archive file.

Summary

<p:declare-step type="p:unarchive">
  <input port="source" primary="true" content-types="any" sequence="false"/>
  <output port="result" primary="true" content-types="any" sequence="true"/>
  <option name="exclude-filter" as="xs:string*" required="false" select="()"/>
  <option name="format" as="xs:QName?" required="false" select="()"/>
  <option name="include-filter" as="xs:string*" required="false" select="()"/>
  <option name="override-content-types" as="array(array(xs:string))?" required="false" select="()"/>
  <option name="parameters" as="map(xs:QName, item()*)?" required="false" select="()"/>
  <option name="relative-to" as="xs:anyURI?" required="false" select="()"/>
</p:declare-step>

The p:unarchive extracts document from an archive file (for instance a ZIP file) and returns these on its result port.

Ports:

Port

Type

Primary?

Content types

Seq?

Description

source

input

true

any

false

The archive to extract the documents from.

result

output

true

any

true

The extracted documents.

Options:

Name

Type

Req?

Default

Description

exclude-filter

xs:string* (XPath regular expression)

false

()

A sequence of XPath regular expressions (as strings) that determine which files in the archive are not extracted. See Determining which files to extract.

format

xs:QName?

false

()

The format of the archive file on the source port:

  • If its value is zip, the p:unarchive step expects a ZIP archive on the source port.

  • If absent or the empty sequence, the p:unarchive step tries to guess the archive file format. The only format that this step is required to recognize and handle is ZIP.

  • Whether any other archive formats can be handled and what their names (values for this option) are depends on the XProc processor used.

include-filter

xs:string* (XPath regular expression)

false

()

A sequence of XPath regular expressions (as strings) that determine which files in the archive are extracted. See Determining which files to extract.

override-content-types

array(array(xs:string))?

false

()

Use this to override the content-type determination for the extracted files (the value of their content-type document-property). This mechanism works the same as for the p:archive-manifest step. See Overriding content types for an example.

parameters

map(xs:QName, item()*)?

false

()

Parameters used to control the document extraction. The XProc specification does not define any parameters for this option. A specific XProc processor might define its own.

relative-to

xs:anyURI?

false

()

This option can be used to explicitly set the base-uri document-property of the extracted documents. See The base URI of the extracted files

Description

The p:unarchive step allows you to extract one or more documents from an archive (for instance a ZIP file). The result will be a sequence of the extracted documents on the result port.

  • You can specify exactly which documents to extract or not to extract using the include-filter and exclude-filter options. See Determining which files to extract.

  • Sometimes it is important to specify the exact base URI of the extracted documents for subsequent steps. You can do this using the relative-to option. See The base URI of the extracted files.

  • Although probably rare, it is also possible to control the content type (MIME type) of the extracted documents, using the override-content-types option. See Overriding content types for an example.

Archives come in many formats. The only format the p:unarchive step is required to handle is ZIP. However, depending on the XProc processor used, other formats may also be processed.

Determining which files to extract

The include-filter and exclude-filter options determine which documents to extract. Both option must be a sequence of zero or more XPath regular expressions, as strings. For an example see Excluding documents. Basic operation:

  • The paths of the documents in the archive are matched against the regular expressions.

  • A document must be included and not excluded.

  • An empty include-filter option means: all documents are included. An empty exclude-filter option means: no documents are excluded.

In more detail:

  • First, the include-filter option is processed:

    • If it is empty (its value is the empty sequence ()), all documents in the archive are included.

    • Otherwise, the path of every document in the archive is matched against the list of regular expressions in the include-filter option (like in matches($path-in-archive, $regular-expression)). If one of the regular expression matches, the document is included, otherwise it is excluded.

  • Now the exclude-filter option is processed against the resulting list of entries:

    • If it is empty (its value is the empty sequence ()), no further documents excluded.

    • Otherwise, the path of every document in the archive is matched against the list of regular expressions in the exclude-filter option (like in matches($path-in-archive, $regular-expression)). If one of the regular expression matches, the document is excluded, otherwise it is included.

If the value for one of these options is a sequence with just a single values, you can set this by attribute:

<p:unarchive exlude-filter="\.xml$"/>

However, if more than one value is involved you must use <p:with-option> (providing a sequence with multiple values by attribute is not possible):

<p:unarchive>
  <p:with-option name="exclude-filter" select="('\.xml$', '\.jpg$')"/>
</p:unarchive>

The base URI of the extracted files

The relative-to option can be used to specify the base-uri document-property of the extracted documents:

  • If the relative-to option is not specified, the base-uri document-property of an extracted document is the full URI of the archive followed by the path of the document in the archive. For instance: file:///path/to/archive/archive.zip/path/in/archive/test.xml.

  • If a relative-to option is specified, it must be a valid URI. The base-uri document-property of an extracted document is this URI followed by the path of the document in the archive. For instance, assume we’ve set the relative-to option to file:///my/documents/: file:///my/documents/path/in/archive/test.xml.

The Basic usage and most other examples show what happens if you don’t specify a relative-to option. The Using relative-to example shows what happens if you do.

Examples

Basic usage

Assume we have a simple ZIP archive with two entries:

  • An XML file in the root called reference.xml

  • An image in an images/ sub-directory called logo.png.

The following pipeline uses p:unarchive to extract its contents. The <p:for-each> construction after the p:unarchive creates an overview of what was extracted. The actual extracted files are discarded.

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">

  <p:input port="source"/>
  <p:output port="result"/>

  <p:unarchive/>
  
  <p:for-each>
    <p:identity>
      <p:with-input exclude-inline-prefixes="#all"> 
        <unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/>
      </p:with-input>
    </p:identity>
  </p:for-each>
  <p:wrap-sequence wrapper="unarchived-files"/>
  
</p:declare-step>

Resulting overview of the extracted files:

<unarchived-files>
   <unarchived-file content-type="image/png" href="file:/…/…/test.zip/images/logo.png"/>
   <unarchived-file content-type="application/xml" href="file:/…/…/test.zip/reference.xml"/>
</unarchived-files>

Excluding documents

This example uses the same ZIP archive as Basic usage. The exclude-filter option excludes the entries ending with .xml:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">

  <p:input port="source"/>
  <p:output port="result"/>

  <p:unarchive exclude-filter="\.xml$"/>
  
  <p:for-each>
    <p:identity>
      <p:with-input exclude-inline-prefixes="#all"> 
        <unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/>
      </p:with-input>
    </p:identity>
  </p:for-each>
  <p:wrap-sequence wrapper="unarchived-files"/>
  
</p:declare-step>

Resulting overview of the extracted files:

<unarchived-files>
   <unarchived-file content-type="image/png" href="file:/…/…/test.zip/images/logo.png"/>
</unarchived-files>

The following example excludes all documents from the images sub-directory:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">

  <p:input port="source"/>
  <p:output port="result"/>

  <p:unarchive exclude-filter="^images/"/>
  
  <p:for-each>
    <p:identity>
      <p:with-input exclude-inline-prefixes="#all"> 
        <unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/>
      </p:with-input>
    </p:identity>
  </p:for-each>
  <p:wrap-sequence wrapper="unarchived-files"/>
  
</p:declare-step>

Resulting overview of the extracted files:

<unarchived-files>
   <unarchived-file content-type="application/xml" href="file:/…/…/test.zip/reference.xml"/>
</unarchived-files>

Using relative-to

This example uses the same ZIP archive as Basic usage. The following pipeline explicitly sets the base part of the URIs for the extracted documents to file:///my/documents/:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">

  <p:input port="source"/>
  <p:output port="result"/>

  <p:unarchive relative-to="file:///my/documents/"/>
  
  <p:for-each>
    <p:identity>
      <p:with-input exclude-inline-prefixes="#all"> 
        <unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/>
      </p:with-input>
    </p:identity>
  </p:for-each>
  <p:wrap-sequence wrapper="unarchived-files"/>
  
</p:declare-step>

Resulting overview of the extracted files:

<unarchived-files>
   <unarchived-file content-type="image/png" href="file:///my/documents/images/logo.png"/>
   <unarchived-file content-type="application/xml"
                    href="file:///my/documents/reference.xml"/>
</unarchived-files>

Overriding content types

This example uses the same ZIP archive as Basic usage. The following pipeline explicitly sets the content type for .png files to application/octet-stream:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">

  <p:input port="source"/>
  <p:output port="result"/>

  <p:unarchive>
    <p:with-option name="override-content-types" select="[ ['\.png$', 'application/octet-stream'] ]"/>
  </p:unarchive>
  
  <p:for-each>
    <p:identity>
      <p:with-input exclude-inline-prefixes="#all"> 
        <unarchived-file href="{p:document-property(/, 'base-uri')}" content-type="{p:document-property(/, 'content-type')}"/>
      </p:with-input>
    </p:identity>
  </p:for-each>
  <p:wrap-sequence wrapper="unarchived-files"/>
  
</p:declare-step>

Resulting overview of the extracted files:

<unarchived-files>
   <unarchived-file content-type="application/octet-stream"
                    href="file:/…/…/test.zip/images/logo.png"/>
   <unarchived-file content-type="application/xml" href="file:/…/…/test.zip/reference.xml"/>
</unarchived-files>

More information about how this mechanism works can be found in the description of the p:archive-manifest step.

Additional details

  • No document-properties from the document on the source port survive.

  • A relative value for the relative-to option gets de-referenced against the base URI of the element in the pipeline it is specified on. In most cases this will be the path of the pipeline document.

  • Paths in an archive are always relative. However, depending on how archives are constructed, a path in an archive can be with or without a leading /. Usually it will be without. For archives constructed by p:archive no leading slash will be present.

  • The only format this step is required to handle is ZIP. The ZIP format definition can be found here.

Errors raised

Error code

Description

XC0079

It is a dynamic error if the map parameters contains an entry whose key is defined by the implementation and whose value is not valid for that key.

XC0085

It is a dynamic error if the format of the archive does not match the specified format, cannot be understood, determined and/or processed.

XC0120

It is a dynamic error if the relative-to option is not present and the document on the source port does not have a base URI.

XC0147

It is a dynamic error if the specified value is not a valid XPath regular expression.

Reference information

This description of the p:unarchive step is for XProc version: 3.1. This is a required step (an XProc 3.1 processor must support this).

The formal specification for the p:unarchive step can be found here.

The p:unarchive step is part of categories:

The p:unarchive step is also present in version: 3.0.