p:archive (3.1) 

Perform operations on archive files.

Summary

<p:declare-step type="p:archive">
  <input port="source" primary="true" content-types="any" sequence="true"/>
  <output port="result" primary="true" content-types="any" sequence="false"/>
  <input port="archive" primary="false" content-types="any" empty="true" sequence="true">
    <p:empty/>
  <input/>
  <input port="manifest" primary="false" content-types="xml" empty="true" sequence="true">
    <p:empty/>
  <input/>
  <output port="report" primary="false" content-types="application/xml" sequence="false"/>
  <option name="format" as="xs:QName" required="false" select="'zip'"/>
  <option name="parameters" as="map(xs:QName, item()*)?" required="false" select="()"/>
  <option name="relative-to" as="xs:anyURI?" required="false" select="()"/>
</p:declare-step>

The p:archive step can perform several different operations on archive files (for instance ZIP files). The most common one will likely be creating one, but it could also provide services like update, freshen or even merge. The resulting archive appears on its result port.

Ports:

Port

Type

Primary?

Content types

Seq?

Description

source

input

true

any

true

The source port is used to provide the documents to be archived. How and which of these documents are processed is governed by the document(s) appearing on the other input ports and the combination of options and parameters. See below for details.

result

output

true

any

false

The resulting archive.

archive

input

false

any

true

Optional archives for operations like update, freshen or merge.

manifest

input

false

xml

true

An optional manifest document that tells the step how to construct the archive. If no manifest document is provided on this port, a default manifest is constructed automatically. See The XML archive manifest document format for details.

report

output

false

application/xml

false

A report about the archiving operation. This will be the same as the manifest, optionally amended with additional attributes and/or elements.

Options:

Name

Type

Req?

Default

Description

format

xs:QName

false

zip

The format of the archive.

  • If its value is zip (the default), the p:archive step expects a ZIP archive on the source port.

  • Whether any other archive formats can be handled and what their names (values for this option) are is implementation-defined and therefore dependent on the XProc processor used.

parameters

map(xs:QName, item()*)?

false

()

Parameters controlling the archiving. Several parameters are defined for processing ZIP archives (see Handling of ZIP archives). A specific XProc processor might define its own.

relative-to

xs:anyURI?

false

()

This is option is used in creating a manifest when no manifest is provided on the manifest port. If a manifest is present this option is not used.

Description

The p:archive step is the Swiss army knife for handling archives. Its most common use is creating archives, but it could also be used for operations like update, freshen or even merge.

To make all this possible, the operation of p:archive is unfortunately quite complicated. The details are below, here’s a summary:

  • What’s exactly in the resulting archive is controlled using a manifest document (see The XML archive manifest document format). In such a manifest you specify the URI of the document to add and the path of this document in the archive.

    A manifest of an existing archive, sometimes useful as a starting point, can be produced using the p:archive-manifest step.

  • Besides the documents in the manifest you can also specify documents to add by providing these on the step’s source port. Any document appearing on this port that is not already mentioned in the manifest is automatically added to the manifest. The path of such a document in the resulting archive can be controlled using the relative-to option.

  • When adding documents to the archive, p:archive compares the base URIs in the manifest with those of the documents appearing on the source port (the value of the base-uri document-property). If these match, the document on the source port is added. If not, the URI in the manifest is used to load a document (usually from disk).

Archives come in many formats. The only format the p:archive step is required to handle is ZIP. However, depending on the XProc processor used, other formats may also be processed.

The XML archive manifest document format

An archive manifest is an XML document that specifies files to process constructing the archive. It is also used as the result format of the p:archive-manifest step.

Its root element is <c:archive> (the c prefix here is bound to the http://www.w3.org/ns/xproc-step namespace):

<c:archive>
  ( <c:entry> |
    (any other element)
  )*
</c:archive>

 

Child element

#

Description

c:entry

*

An entry (a file) in the archive.

A <c:entry> element describes a single entry (a file) in the archive:

<c:entry name = xs:string
         href = xs:anyURI
         content-type? = xs:string
         comment? = xs:string
         method? = xs:string
         level? = xs:string
         (any other attribute) >
  (any child element)*
</c:entry>

 

Attribute

#

Type

Description

name

1

xs:string

The name of the entry. This is the path of the file within the archive.

Usually this is a relative path. However, depending on how archives are constructed, an absolute path (a path starting with a /) is possible. Archives constructed by XProc steps always produce relative paths (no leading /).

href

1

xs:anyURI

The URI of the entry. This plays an important role in determining which and how files are added to the archive, see below.

A relative value is made absolute against the base URI of the manifest itself.

content-type

?

xs:string

The content-type (MIME type) of the entry. The p:archive step ignores it, but the p:archive-manifest step always adds it.

comment

?

xs:string

An optional comment associated with the entry.

method

?

xs:string

The compression method of the entry. There is only one defined value: none, meaning, of course, no compression. Any other values are XProc processor dependent.

level

?

xs:string

The compression level of the entry. There are no defined values, all values are XProc processor dependent.

The p:archive algorithm

The p:archive step follows a, rather complicated, algorithm. It has two phases:

1 - Construct a complete manifest

First, the manifest (the document, if any, appearing on the manifest port) is checked and completed if necessary:

  • If no document appears on the manifest port, an empty manifest is created.

  • The base URIs of the documents appearing on the source port are compared against the list of base URIs in the manifest (the c:entry/@href values, made absolute). If there are documents on the source port that are not in the manifest, an entry (<c:entry> element) for this document is created:

    • The c:entry/@href attribute becomes the base URI of the document.

    • The c:entry/@name (which is the path/name of the entry in the archive) is computed against the value of the relative-to option:

      • If the base URI of the document starts with the value of the relative-to option, the c:entry/@name attribute value becomes the substring after this.

      • If the base URI of the document does not start with the value of the relative-to option, the c:entry/@name attribute value becomes the path of this base URI (without a leading /).

      For instance, assume the relative-to option is set to file:///some/path/. A document with base URI file:///some/path/etc/x.txt gets a c:entry/@name attribute value etc/x.txt. A document with base URI file:///someother/path/y.txt gets a c:entry/@name attribute value someother/path/y.txt.

The result of all this is that we now have a manifest that has entries (<c:entry> elements) for all documents appearing on the source port. It can also have entries for documents that are not on the source port: because such an entry was present in the initial manifest and no matching document on the source port was found for it.

2 - Process the manifest

The now completed manifest is processed. For every entry (<c:entry> element):

  • If the value of the c:entry/@href attribute matches the base URI of one of the documents appearing on the source port, this document is added to the archive.

    When appropriate (for instance for XML documents), the value of its (optional) serialization document-property is used for serializing it (convert it to text format).

  • For other entries, the value of the c:entry/@href attribute is used to load the file (for instance from disk if it starts with file:/) and add it to the archive.

    These documents are used “as is”: no parsing/serialization takes place.

In both cases, the value of the c:entry/@name attribute becomes the name/path of the entry in the archive. The values of the other attributes of the <c:entry> element might also get used, but this is dependent on the XProc processor used and/or the archive’s format.

The p:archive step is supposed to retain the order of the <c:entry> elements. This is, for instance, important when constructing an e-book in EPUB format: this has a non-compressed entry that must be first in the archive.

Handling of ZIP archives

When the value of the format option is absent or zip, the following applies:

  • The values of the c:entry/@name attributes in the manifest must be relative paths (without a leading /).

  • The archive port accepts zero or one ZIP archive. If this port is empty, an empty ZIP archive is used as its default value.

  • The parameters option is a map that associates parameters (the keys in the map) with values. For ZIP archives, the following parameters can be used:

    Parameter

    Description

    command

    Specifies the operation to perform. It’s default value is update. See below for a description of the commands.

    level

    For entries that have no c:entry/@level attribute specified, this is the default compression level for entries added or updated in the archive. For ZIP archives, its possible values are:

    • smallest

    • fastest

    • default

    • huffman

    • none

    method

    For entries that have no c:entry/@method attribute specified, this is the default compression method for entries added or updated in the archive. For ZIP archives, its possible values are:

    • deflated

    • none

    The command parameter can have one of the following values:

    Command

    Description

    update (default)

    The archive appearing on the archive port is updated:

    • An entry in this ZIP archive that corresponds with a c:entry/@name attribute in the manifest gets updated as specified in the <c:entry> element.

    • For other entries in the ZIP archive, first their name/path is made absolute using the base URI of the archive. If a file exists with that URI and is newer than the entry in the ZIP archive, it is updated.

    • For all <c:entry> elements in the manifest that have no corresponding entry in the ZIP archive, the document gets added.

    Please note that when there is no document on the archive port, p:archive will always create a new, fresh, archive.

    create

    This behaves like the update command except that timestamps are ignored and updates (if any) always take place.

    freshen

    This behaves like the update command except that no new files will be added.

    delete

    For the delete command a ZIP archive must be present on the archive port. It removes all entries in the ZIP archive that have a corresponding c:entry/@name attribute in the manifest. All other manifest entries are ignored.

Examples

Basic usage

In probably most cases, the p:archive step will be used to create an archive. If you have no special requirements this is easy: simply supply the documents for the archive on the step’s source port. The only thing you need to take into account is the name/path of the entries in the archive: for this the relative-to option is important.

Pipeline document:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">

  <p:input port="source" sequence="true">
    <p:document href="in1.xml"/>
    <p:document href="test/in2.xml"/>
  </p:input>
  <p:output port="result"/>

  <p:variable name="relative-to" select="resolve-uri('.', static-base-uri())"/>

  <p:archive relative-to="{$relative-to}"/>

  <p:store href="tmp/result.zip"/>
  <p:archive-manifest relative-to="{$relative-to}"/>

</p:declare-step>

Result document:

<c:archive xmlns:c="http://www.w3.org/ns/xproc-step">
   <c:entry name="in1.xml"
            content-type="application/xml"
            href="file:/…/…/in1.xml"
            method="deflated"
            size="92"
            compressed-size="81"
            time="2024-12-02T13:40:20+01:00"/>
   <c:entry name="test/in2.xml"
            content-type="application/xml"
            href="file:/…/…/test/in2.xml"
            method="deflated"
            size="99"
            compressed-size="85"
            time="2024-12-02T13:40:20+01:00"/>
</c:archive>
  • The pipeline’s input consists of two documents, in1.xml and test/in2.xml. Note that (because the p:document/@href attributes have relative values) the paths to these documents are relative to the location of the pipeline itself.

  • When we construct an archive we usually don’t want the full path of the files on disk in the archive also. In this case we choose to use their relative paths against the pipeline. To achieve this we need the path (directory) where the pipeline is stored. This is done with the expression resolve-uri('.', static-base-uri()) and stored in the relative-to variable.

  • We then create the archive using p:archive. The two input documents appear on its source port. We do not provide a manifest on the manifest port, so one will get constructed automatically.

  • The names of the entries in the resulting archive get constructed by “subtracting” the value of the relative-to option from the base URIs of the source documents. The results will be their relative names against the pipeline’s location.

  • We store the resulting zip and, just to show you what’s inside, ask for an archive manifest using the p:archive-manifest step.

Using the report port

The p:archive step also has a report port that outputs the manifest of the resulting archive. So, building on the Basic usage example, we could also have shown what’s inside the created archive like this:

Pipeline document:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0">

  <p:input port="source" sequence="true">
    <p:document href="in1.xml"/>
    <p:document href="test/in2.xml"/>
  </p:input>
  <p:output port="result" pipe="report@create-archive"/>

  <p:variable name="relative-to" select="resolve-uri('.', static-base-uri())"/>

  <p:archive relative-to="{$relative-to}" name="create-archive"/>

  <p:store href="tmp/result.zip"/>

</p:declare-step>

Result document:

<c:archive xmlns:c="http://www.w3.org/ns/xproc-step">
   <c:entry href="file:/…/…/in1.xml" name="in1.xml"/>
   <c:entry href="file:/…/…/test/in2.xml" name="test/in2.xml"/>
</c:archive>

Note that the information in the manifest is less than what p:archive-manifest produces. What exactly happens here is implementation-defined and therefore dependent on the XProc processor used.

Using a manifest

This example creates a manifest that references some additional file for the archive. Note that in the archive we give it a different name than its source using the c:entry/@name attribute. When the manifest is processed, p:archive notices that test/in2.xml is not on its source port and therefore loads it from disk.

Pipeline document:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="3.0" name="example">

  <p:input port="source" href="in1.xml"/>
  <p:output port="result"/>

  <p:identity name="manifest">
    <p:with-input>
      <c:archive xmlns:c="http://www.w3.org/ns/xproc-step">
        <c:entry name="test/extra.xml" href="test/in2.xml"/>
      </c:archive>
    </p:with-input>
  </p:identity>

  <p:variable name="relative-to" select="resolve-uri('.', static-base-uri())"/>
  <p:archive relative-to="{$relative-to}">
    <p:with-input pipe="source@example"/>
    <p:with-input port="manifest" pipe="result@manifest"/>
  </p:archive>

  <p:store href="tmp/result.zip"/>
  <p:archive-manifest relative-to="{$relative-to}"/>

</p:declare-step>

Result document:

<c:archive xmlns:c="http://www.w3.org/ns/xproc-step">
   <c:entry name="test/extra.xml"
            content-type="application/xml"
            href="file:/…/…/test/extra.xml"
            method="deflated"
            size="60"
            compressed-size="47"
            time="2024-09-03T10:36:32+02:00"/>
   <c:entry name="in1.xml"
            content-type="application/xml"
            href="file:/…/…/in1.xml"
            method="deflated"
            size="92"
            compressed-size="81"
            time="2024-12-02T13:40:20+01:00"/>
</c:archive>

Additional details

  • The only document-property for the document appearing on the result port is content-type (its value depending on the archive’s format). Note it has no base-uri document-property and no document-properties from the document on the source or archive port survive.

  • Documents appearing on the source port must have a base-uri document-property. All these base-uri document-properties must have a unique value.

  • A relative value for the relative-to option gets de-referenced against the base URI of the element in the pipeline it is specified on. In most cases this will be the path of the pipeline document.

  • The only format this step is required to handle is ZIP. The ZIP format definition can be found here.

Errors raised

Error code

Description

XC0079

It is a dynamic error if the map parameters contains an entry whose key is defined by the implementation and whose value is not valid for that key.

XC0080

It is a dynamic error if the number of documents on the archive does not match the expected number of archive input documents for the given format and command.

XC0081

It is a dynamic error if the format of the archive does not match the format as specified in the format option.

XC0084

It is a dynamic error if two or more documents appear on the p:archive step's source port that have the same base URI or if any document that appears on the source port has no base URI.

XC0085

It is a dynamic error if the format of the archive does not match the specified format, cannot be understood, determined and/or processed.

XC0100

It is a dynamic error if the document on port manifest does not conform to the given schema.

XC0112

It is a dynamic error if more than one document appears on the port manifest.

XC0118

It is a dynamic error if an archive manifest is invalid according to the specification.

XD0011

It is a dynamic error if the resource referenced by the href option does not exist, cannot be accessed or is not a file.

XD0064

It is a dynamic error if the base URI is not both absolute and valid according to RFC 3986 .

Reference information

This description of the p:archive step is for XProc version: 3.1. This is a required step (an XProc 3.1 processor must support this).

The formal specification for the p:archive step can be found here.

The p:archive step is part of categories:

The p:archive step is also present in version: 3.0.