Submission guide
Audience: archivists, system administrators, developers
Terms appearing in bold are referenced in the glossary.
Archival process overview
Pocket Archive receives new contents, and updates to existing contents, via submissions. A submission is an individual contribution to the archive that can add, update, or delete resources (a combination of any of these). A submission may include multiple resources, which can be related but do not necessarily have to.
- Archivist selects and lays out digital resources to be archived in his or her own workstation.
- Archivist creates a laundry list that includes an inventory of the resources and their metadata. This, together with the files and folders previously prepares, constitutes the SIP.
- Archivist transfers the SIP to the Drop box: first the files and folders, then the laundry list.
- Upon receipt of the laundry list, Pocket Archive processes the incoming materials and archives them.
- Pocket Archive generates a report after the process is complete (regardless of whether it was successful or failed).
- Depending on setup, Pocket Archive may delete the SIP from the Drop box if the submission succeeded.
- Depending on setup, Pocket Archive may (re-)generate the static site.
- If the archivist wants to update the archived resources, they can either request a full copy of the SIP, (or to only update metadata, only the laundry list), edit it and/or replace files, and re-submit the new SIP.
- The archivist can remove a resource and, optionally, all its members at any time.
Processing of the SIP (point 4 above) either succeeds or fails as a unit. This means that a submission will never perform only a part of the task that it is meant to do. This is called an atomic operation and it is designed to ensure consistency of the data.
Individual steps are described in detail in the following chapters.
Submission Information Package structure
A submission is performed by preparing a Submission Information Package, or SIP, which consists of data, i.e. files optionally arranged in a curator-defined folder hierarchy, and metadata, the latter gathered in a single file called a laundry list; and sending them both to Pocket Archive for processing.
A working SIP example including files and a laundry list, used for testing, is available as a quick reference (note: the CSV file is currently displayed as a raw file. To view it as a spreadsheet, download it and open it with Libreoffice or another spreadsheet editor). Other examples are illustrated further down in this document.
As the above life cycle chart shows, the SIP is a disposable artifact. Once it is successfully archived, it can be deleted. The full SIP can be regenerated by the archive and retrieved at a later time.
The original files in the archivist's workstation can be optionally kept and/or copied to local storage. This is stongly recommended, at least until Pocket Archive reaches a stable status and can be exclusively relied on for long-term preservation. More copies means more chances to recover data from corruption or loss, but it also means higher storage costs.
Source file & folder layout
Preparation of the SIP begins with selecting the materials to submit. Generally, it is good practice to select a group of artifacts more or less related to one another, e.g. a small coherent collection, or a day's work within a large collection that may take long to complete. It is not critical to get this part perfectly right, as more can be added to the archive at a later time. It is more important to keep submissions not too large, as a single malformed element can cause the whole submission to fail, and not too small, to avoid too many iterations that can become confusing. Submissions of tens to hundreds of files are in a quite safe range.
The arrangement of files and folders is important, the ordering of elements in a folder is less so. A file or sub-folder inside a parent folder creates a membership relationship between the two, so that, e.g. one can create the following structure:
my_collection
|
`- artifact1
| |
| `- file01.tiff
| |
| `- file02.pdf
|
`- artifact2
|
`- file3.mpg
This creates a collection, my_collection
, with two members, artifact1
and
artifact2
, the former containing file1.tiff
and file2.pdf
, and the latter
containing file3.mpg
.
Ordering of the files or folder in a SIP is defined in the laundry list, as we will see further down, so using file namings to force a certain order is not necessary (however it can provide a good starting point for large lists of files or folders under a parent).
Some file and folder structure will be also used in future versions of Pocket Archive to create more metadata, but at the moment this is not implemented.
Empty folders can be created and submitted: they can be used as placeholders for resources that have no files directly related. But the same effect can be obtained by other means with the laundry list.
Laundry list
Once the files to be included in the SIP is completed, a laundry list is compiled. This is basically, as the name suggests, an inventory of all the resources that go into the submission; but it provides much more information than that, by defining metadata and relationships between resources.
The laundry list is a CSV file.
Laundry lists may be edited in any application that supports CSV reading and
writing. Care must be taken to export the file to CSV. In LibreOffice, for
example, "Save" writes the file as .odt
format, which is not usable as a
laundry list. The spreadsheet must be instead exported as a .csv
format.
Multi-sheet documents
Many spreadsheet applications allow grouping multiple tables or sheets in one file. CSV supports only one table per file. While some may find it convenient to keep multiple laundry lists in one spreadsheet file, one must take care of exporting each sheet individually as a CSV.
Laundry list format
The first row of a laundry list is reserved for the header, which indicates the
field names. These can be in any order, but following a specific order is
recommended. The order used in this document and in all laundry lists
automatically generated by Pocket Archive is: content_type
, id
,
source_path
, and then all ordinary fields in alphabetical order.
Each subsequent row represents a resource (except in a multi-value case,
described below). The content_type
field is mandatory for each resource.
The source_path
field is only mandatory for files. All other fields are
optional for the submission, however, some schema definitions may have
constraints in this regard and may be at least strongly recommended. This
depends on the content model used.
Fields with a special meaning
content_type
: mandatory, single-valued. It defines the content type assigned to the resource. For files, it must befile
or a sub-type thereof, except for inferred resources (see below). For folders it must not be afile
or sub-type. Consult the content model of your archive for a list of defined type names.id
: optional, single-valued. If provided, it becomes the primary identifier for the resource, which is used anywhere information about the resource is retrieved. The IDs generated by default by Pocket Archive are 16-character random strings containing only uppercase and lowercase letters and digits. The depositor is responsible for ensuring that the provided ID is unique across the system. If left blank, the system generates an identifier that is guaranteed to be unique. However, re-submitting the laundry list a second time with the same blank field will create a duplicate resource.source_path
: mandatory for files, single-valued. It refers to the file or folder path relative to the package, using forward slash/
characters to separate folders and subfolders or files.has_member
: this behaves like all normal properties, but it has a special meaning when deleting resources. If the--members
option is provided, resources linked via thehas_member
property to the resource bing deleted are also deleted, along with their own members, recursively. See the "Deleting resources" section below.
Note: when a field is defined as "mandatory" above, this is intended per-resource. If the resource spans multiple rows, as when it has multi-valued fields, a mandatory field is only required to have a value on the first row of the resource.
Example of a table representing an artifact with two files:
content_type | id | source_path | creation_date | label |
---|---|---|---|---|
still_image | Sg9hYIISjRjlkP62 | my_collection/artifact1 | 12-07-2002 | My first deposited artifact |
still_image_file | 7hic19YTXA8Fudxo | my_collection/artifact1/file1.tiff | 09-22-2025 | |
still_image_file | Z509TdNhpTjPYDS4 | my_collection/artifact1/file2.pdf | 09-23-2025 |
Note the difference between the still_image
and the still_image_file
resources. We will get back to it further down.
Multi-valued fields
Some fields may allow multiple values. To provide multiple values for one or
more fields, additional values are added to rows below the previous. For these
additional rows, the special fields content_type
, id
, and source_path
must not be filled.
Example of a table with a single resource with multi-valued fields:
content_type | id | source_path | alt_label | description | label |
---|---|---|---|---|---|
still_image | Sg9hYIISjRjlkP62 | my_collection/artifact1 | An alternative label | A description of the artifact goes here. | This is the title and must have only one value. |
You can have as many as you like of these | Another description goes here. | ||||
FREE alt labels! (as long as supplies last) |
The submission process checks if the content_type
field is filled in a cell
to determine whether a row in the table is a continuation from the previous
one, adding multiple values. Having a row without content
type and with id
and/or source_path
is considered an error.
Ordering
The ordering of rows in a laundry list determines the ordering of the resources
in their container. The system automatically assigns an order to the resources,
using their source path and their position in the laundry list. Resources at
the top level, i.e. directly under the SIP folder, are not assigned an order, as
they are considered self-standing. If an order is needed for those, the
pas:next
property can be set to the desired resource (see point below
about relationships), or they can be put in an enclosing folder that acts as a
collection.
Relationships can be established between resources. These are stored as persistent links and appear as hyperlinks in the discovery interface. A relationship can only be set for a field that is configured as "resource" type. Consult your content model to find which properties are relationships.
To set a relationship with a resource in the same laundry list that doesn't have an explicit ID set, insert the source path of the resource. For a resource that has already an ID, either by being assigned one manually or by being already deposited, insert the ID string.
Example table with implicit and explicit relationships, some path-based and some ID-based:
content_type | id | source_path | has_member | label |
---|---|---|---|---|
collection | p9tXQGBb9iC7xEqm | my_collection-1 | This collection has implicit members from the folder hierarchy. | |
still_image | KHwYidw4R7xUAEMN | my_collection-2/image001 | Resource with an explicit ID. The ID can be used in a reference. | |
text | my_collection-2/text0001 | Resource without explicit ID. It can be referenced by source_path. | ||
collection | EUXRg9igmU9ouzVH | my_collection-2 | p9tXQGBb9iC7xEqm | This collection has explicit member relationships. |
my_collection-2/text0001 |
When the laundry list is processed for submission, the path-based references are replaced with IDs, which are automatically generated where not provided. Therefore, a laundry list generated from archived resources may look different from the original one. The generated laundry list should be used for re-submission.
Resource types and sub-types
This chapter is a very concise introduction to content modeling in Pocket Archive, which is treated in detail in the Content modeling introduction. It is strongly recommended to read that guide before archiving resources in earnest.
The three main resource types found in a submission are: Artifact, File, and Brick. See the Content modeling introduction for more information about these.
These three key content types are seldom used as-is. They usually have sub-types, which are defined in the content model. See the content modeling guide for more information about sub-types.
Types provided by Pocket Archive may have similar names but different uses. For
example, the still_image
type, a sub-type of artifact
, designates a visual
object, e.g. a photograh. still_image_file
may be the capture (e.g. scan) of
that object, but also the capture of a text
artifact if it is the scan of a
book page.
See the provided sample laundry list for examples of artifacts, files, and bricks making up a two-sided postcard. (Note: you may need to download the file and open it with a spreadsheet editor. The current platform shows the raw file.)
Submission ID and submission name
Each submission gets a randomly generated ID when it starts. This ID is attached to all the resources in the submission. This makes it easier to find out later on when and how a certain resource was submitted. It also makes it possible to generate a laundry list that contains all the resources of the original submission.
The ID is automatically generated and system-controlled. It cannot be changed.
A submission can also have a name, which is optional and user-defined. The
submission name is determined by the file name used for the laundry list. E.g.
pkar_submission-my_new_collection.csv
will use my_new_collection
, i.e. the
text between pkar_submission-
and .csv
, as the submission name. Submisson
names are not required to be unique. Of course, the laundry list file names
must be unique in the drop box they are deposited to.
Update
A submission is also used to update existing resources. Each resource update is a full replacement of all the resource's metadata, so a submission must include a full representation of each of the resources updated.
To facilitate this task while avoiding the need to hold on to all of the archive's laundry lists, Pocket Archive can generate a laundry list for one or more selected resources. This list, which represents the current state of the resources requested, can be edited and re-submitted for an update.
Advanced techniques
Some hidden tricks can be employed to facilitate the creation and management of larger submissions.
Implicit resources
TODO
Bulk ID generation
As mentioned before, explicitly adding IDs in a laundry list simplifies later editing and management. However, this is one of the most tedious parts of a laundry list creation.
Fortuntately, such repetitive and error-prone tasks can be easily automated with tools provided by most spreadsheet applications. A macro (a mini-program that runs in an application) for LibreOffice Calc is provided here to automatically generate 16-character IDs for all the cells selected in a table.
Deleting resources
Although some archivists argue against deleting anything from an archive, Pocket Archive acknowledge that in real life things may actually need to be removed. The cause may be a duplicate, or something that was not supposed to be archived, etc. In any case, the resource-conservative alignment of Pocket Archive supports deleting resources immediately and irreversibly.
A resource can be deleted via the pkar remove
CLI method, or by uploading a
special file to the drop box, named pkar_remove*
(asterisk means zero or more
characters—note that the file name does not need an extension). The delete file
must be a list of arhchival IDs, in the short URI form (par:<ID>
), one
per line.
If pkar_watch
, the process watching the drop box, was started with the -r
option, all members of the resources are recursively deleted (this means also
members of members).