Pocket Archive glossary

Audience: all users

WORK IN PROGRESS

This document is a glossary of terms specific to Pocket Archive and to mainstream archival, cataloging, and IT practices. The former are prefixed by an asterisk: *. Terms used in descriptions that also have a definition in this document are written in bold.

Archival master file

A file fit for conservation purposes. This is usually the version of a digital capture that contains the most information, and which can be used to generate other derivatives, most notably, a production master file.

This is a standard definition of the US Federal Agency Digital Guidelines Inititative (FADGI).

*Artifact

A human-made object with a cultural value. It can be a physical object, such as a book, a scuplture, or a document, or also digital data (e.g. a born-digital photograph or video clip, a software application, etc.). It roughly corresponds to the Intellectual Entity concept in the PREMIS data dictionary for preservation metadata.

Atomic, atomicity

An atomic operation is an operation on data that either succeeds or fails completely. Complex data structures can be handled via atomic operations, if the system handles them, to ensure that at each step of a transfer process, all parts of the data structure are intact.

Checksum

A sequence of bytes, usually visualized as an alphanumeric sequence (e.g., blake2:e974d0e881f151ee293519e[…]), that represents the "fingerprint" of a digital file. Many algorithms are available for generating a checksum for a file, but for each algorithm, a file has only one checksum. If even one bit changes in the file, the checksum changes completely. It is a fundamental tool for digital preservation, as it can easily indicate if a file has changed on the storage medium (due to storage corruption) or in transit (due to network glitches), or if it may have been forged.

Pocket Archive calculates and stores checksums in the BLAKE2b format, which is a less popular, but vey fast and secure algorithm. In future releases, it may support multiple algorithms.

*Brick

*Collection

In Pocket Archive, a Collection is a group of artifacts that are logically related in a way or another. There is no well-defined rule to how collections should be arranged, and that is a curatorial decision based on the materials at hand and the target audience.

Collections can contain other collections, other structural Bricks, or artifacts. They are presented in a special way, in that they can have a long description stored in a separate file, that can provide a rich presentation for the collection's landing page.

Any resource can belong to multiple collections. Laundry lists have means to indicate implicit memberships (e.g. a file inside a folder), which are added automatically by the submission process, and explicit ones, which are defined via the has_member property.

*Codename

The name used to reference a field in a laundry list by content managers. It is by convention made of lowercase letters, numbers, and underscores, e.g. path_name, submission_date, creator…

*Content model

In Pocket Archive, a content model is the complete set of definitions of all the content types in a Pocket Archive instance, the properties that define them, and how the interact with one another via relationships. Otherwise known as an ontology.

*Content type

Classification of a resource according to a content model. Each resource in Pocket Archive is assigned one and only one content type.

CSV

Stands for Comma-Separated Values. It is a file format for tabular data, which can be read and edited by several software packages, such as LibreOffice or Google Sheets. Opened in these applications, a CSV looks like a spreadsheet, and can be edited as such. Text formatting, column or row sizes, borders, and similar style features are not supported. CSV is pure data, which is all we are interested in.

Pocket Archive's laundry lists are formatted in CSV. Spreadsheet applications can be used to compile laundry lists, with the caveat that the file must be exported as a .csv file rather than the native spreadsheet application file (usually employed by the "Save" command).

*Descriptive resource

In Pocket Archive, this is a resource that the system may parse and understand as meaningul data, i.e., as RDF data. The Artifact and Brick content types are descriptive resources. Files, which are opaque resources, are paired with an implicit descriptive resource that presents the file's metadata so that the file can be described and fund in searches.

*Drop box

A folder, on a local or remote filesystem, that is being watched by a running Pocket Archive (pkar_watch) instance. Any laundry list that is put into this folder will trigger a submission process.

*Field

The description of a resource property as a column in a laundry list CSV. A field has a name (in the laundry list, the codename) and one or multiple values.

Fixity

The assurance that a digital file is intact and bit-by-bit identical to how it was submitted. Fixity is checked by verifying a checksum.

*Laundry list

A CSV file with tabular data listing all the resources included in a submission and their metadata. The Laundry list is produced by the depositor of a SIP and triggers an automatic submission process.

Linked Data, Linked Open Data

Data (in the case of Linked Open Data, published and freely accessible on the Web) in the RDF format. Linked (Open) Data is a popular publishing format among cultural heritage, humanitarian, and scientific institutions, and other organizations that value interoperability and the free exchange of data sets. Linked Data facilitates the aggregation and reconciliation of heterogeneous data sets produces by different sources, by relying on controlled vocabularies and unambiguous, globally unique identifiers.

Markdown

Plain-text writing format that can be converted to HTML or other formatted text by using conventional marks and embedded HTML. Markdown is very popular among technical documentation writers because it doesn't need a specialized application to write. This glossary and the other Pocket Archive documentation are written in Markdown.

Pocket Archive supports writing Markdown documents for its "long description" property that can be used to create content-rich introduction pages for Collections.

Metadata

Literally, data about data. Metadata are administrative and technical information about a physical or digital object that do not constitute the object itself, but are helpful to classify, inventory, find, and relate it.

Namespace

The prefix of a group of UIDs or URIs that is constant for a whole organization or business unit. It is a convention used to separate identifiers into broad categories for administrative purposes. Namespaces are used extensively in RDF and in Pocket Archive, however, they are a more technical aspect of archiving that is not easily visible by occasional users.

Namespaces in RDF can be shortened within a contained system, as they can be lengthy, and the mapping between the short prefix and the full-length namespace is maintained by that system. URIs published on the Web must be either in their fully-qualified form, or accompanied by the namespace mapping in the same document.

E.g.: the URI http://purl.org/dc/terms/contributor can be represented internally in Pocket Archive as dc:contributor, as long as the relation between dc: and http://purl.org/dc/terms/ is registered.

Pocket Archive supports user-defined namespaces and mappings, that can be configured by the archive administrator.

Ontology

See Content model.

*Opaque resource

In Pocket Archive, a digital file preserved in the archive. It is "opaque" in the sense that Pocket Archive is only aware of its presence and fixity, but it doesn't know about its contents. Each opaque resource is accompanied by a descriptive resource that contains its metadata and points to it.

*Presentation

In Pocket Archive, this is the whole package of Web pages, presentation files, and ancillary digital assets that make up a static site generated by Pocket Archive.

Presentation data are disposable and can be regenerated on demand. Pocket Archive does not decide whether or how a presentation should be published, or who has access to it. That is a decision left to the archive owners and system administrators.

*Presentation file

Also known as Derivative file and other names by FADGI. This is a file derived from a production master file that is fit for presentation. It often has a lower quality and lossy compression than its source, and it does not need to be preserved, as it can be regenerated without manual intervention.

Pocket Archive automatically generates presentation files and thumbnails during its static site generation process.

*Property

A metadata element that can be attributed to a resource. Properties are more or less strictly defined in the content model by the archive administrator and they may have a data type, a cardinality, and a range. See the content model introduction for more information.

*Production master

A file fit for generating presentation files. This is usually a file generated from the archival master file and manually adjusted for presentation, with elements that are imortant for preservation but not for public display, removed (e.g. color bars or working layers in a still image). Because it is manually adjusted, it should be preserved along with the archival version.

In cases of necessity, the same file may serve both archival and production master roles, however this is not a recommended practice and only acceptable when a proper archival master is not (or no longer) available.

Derivatives of this file are usually lower-quality copies that are automatically generated and not preservation-worthy.

This is a standard definition of the US Federal Agency Digital Guidelines Inititative (FADGI).

*Relationship

A relationship, in Pocket Archive, is a special type of hyperlink that points to a resource managed by Pocket Archive itself. Unlike hyperlinks in the WWW, which do not always own the resource pointed to and do not guarantee its existence, Pocket Archive guarantees the consistency of relationship links.

Resource

A self-standing unit of digital data that can be identified with a URI.

In RDF parlance, "everything is a resource", which means, every unit of information can be represented by a globally unique document on the Web.

In Pocket Archive, the definition of resource is more specific, and it refers to any record individually retrievable in the archive. Every resource is assigned a content type.

RDF

Acronym for Resource Description Framework. It is the data format used for Linked Data.

Pocket Archive uses RDF internally and is able to export RDF for interoperability with external systems. End users and content managers need not be concerned with the internals of RDF, but it is good to have an awareness of the underlying support for this format.

RDF was designed by Tim Berners-Lee, the "father of the Internet", and it is a format expressly made for the WWW. In RDF, everything is a resource, represented by a Web document, that can be identified globally by a URI. This format is particularly fit for aggregating and sharing data sets from heteroeneous sources, that may have been cataloged according to different standards, using different tools.

Pocket Archive uses RDF to maintain a flexible method to relate resources together and to facilitate sharing its data in the wild.

Schema (pl. schemata)

The complete set of rules governing a given content type. A schema defines all the properties applicable to a specific type and their constraints. It is written out as a set of files that include the content type in question and all its super-types.

SIP

Submission Information Package: a package of files, folders, and metadata that constitute a complete submission package. A SIP is normally prepared by archivists, either by hand or with the aid of automated tools, and is the first step of the actual archival process.

This is a term of the OAIS standard that defines guidelines for digital archival practices.

Static site

A web site that is made up entirely of static files. This means that all contents of the website are pre-generated and consist of actual files living on a filesystem, in contrast with most modern dynamic sites whose contents are mostly generated on demand by a continuously running process.

While much less flexible than dynamic sites, static sites are still widely used today. Dynamic sites rely on complex, often resource-intensive applications and infrastructure that can be subject to exploits and attacks of all sorts, and on more applications and infrastructure to prevent those attacks. Sites of small to medium size with predefined content can take advantage of a simple and economical static site, that needs only a simple web server to run.

Pocket Archive generates static sites for presentation. It also has the option to generate contents that can be viewed directly on the user's local computer with a web browser, without any web server or even any Internet connection [WIP note: serverless option is not yet implemented].

*Submission

The act of assembling and sending a curated data set to Pocket Archive for archival.

A submission is made up of files, often arranged in folder hierarchies (the data), and an accompanying inventory, or laundry list, that contains the metadata. A submission has a unique identifier that gets assigned to all the resources included in it.

UID

Unique Identifier. Usually, this identifier is intended to be unique only in the system it is working in. By default, Pocket Archive resources are assigned 16-character random strings, prefixed by a namespace to denote a resource. This is sufficient to keep millions of records in the archive without collision (i.e. duplicate IDs).

URI

Universal Resource Identifier. It is a globally unique identifier that is able to pinpoint a specific resource on the WWW. A URI may or may not resolve to an actual location on the Web. URIs are a key component of the RDF.

Pocket Archive uses URIs to identify individual resources, metadata properties, content types, and other entities. These are usually hidden from the end user but viewable in the resources' raw data representation.

UUID

Universally Unique Identifier. Similar to a UID, but with a reasonable guarantee of uniqueness in the global space (WWW). Uniqueness is usually guaranteed by a namespace prefix that is a Web domain name owned by the UUID publisher, and/or by a long string of random characters that make the chance of collision (overlap) small enough to be negligible, or by a progressive sequence controlled by the publishing system.