INTRODUCTION
In this paper I would like to do four things. First I
will present the reader with a brief survey of what can be found
on the Internet. Then I will touch on some problems these
objects present to the cataloger. Next, I will discuss some
developments in the engineering segment of information science.
Finally I will imagine what a working partnership between
engineers, librarians, and business might mean for access to
networked objects.
THE PROBLEM OF GENERA
I first became aware of networked objects by volunteering
to participate in OCLC's (the Online Computer Library Center's)
ground-breaking attempt to assess the fit between the
Anglo-American Cataloguing Rules, 2nd ed., 1988 revision (AACR2R)
and the kind of computer file one comes across while surfing the
Internet. In OCLC's published report of its experiment,
Assessing Information on the Internet (Dillon, p. 20) there is a
list of generic terms for the types of electronic objects that
can be found there. This list includes:
system
source
news
text
PC
data
images
games
executable files
unknown
To the relief of the participants, the OCLC experiment
focused on the approximately 10% of this universe represented by
text files, although other types of files were also included.
Certainly text files are the objects most nearly equivalent to
the majority of objects cataloged under AACR2R. But it is
equally certain that AACR2R is not well enough equipped to handle
many other computer file genera.
Networked sites, for example, are full of unfamiliar and
evolving species, unlabeled hybrids. Directories of addresses
are peppered with information on course listings at unspecified
institutions. These entries then nestle up against entries for
what look like title listings for actual books in actual
libraries. Spelunking the Internet, one imagines that these
ill-defined and poorly organized files are a somewhat strange,
computerized, sedimentary paleolith, because the components do
not naturally form part of one another. There are lots of
fragments on the Internet. There are, for example, recipes, with
a provenance as layered and ghost like to a novice Internetworker
as their copyright notices are prominent and their punctuation
odd.
Example 1:
.ig
Path: decwrl!recipes
From: liz@unirot (Mamaliz)
Newsgroups: mod.recipes
Subject: Recipe: Orange Pound Cake
Message-ID: <4241@decwrl.DEC.COM>
Date: 18 Jul 86 03:42:03 GMT
Sender: recipes @ decwrl.DEC.COM
Organization: The Soup Kitchen, Edison NJ
Lines: 70
Approved: reid@decwrl.UUCP
Copyright (C) 1986 USENET Community Trust
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the USENET copyright notice and the
title of the newsgroup and its date appear, and notice is given
that copying is by permission of the USENET Community Trust or
the original contributor.
.RH MOD.RECIPES-SOURCE POUND CAKE-1 D "19 May 86" 1986
.RZ "ORANGE POUND CAKE" "A luscious orange-flavored pound cake"
Absolutely the best cake I have ever eaten! I got the recipe from
the mother of a friend. I think Don's mom got the recipe off the
back of a sugar box.
.IH "1 cake"
.IG "1 lb" "butter" "450g"
.IG "1 lb" "powdered sugar" "450 g"
.IG "2 Tbsp" "grated orange rind" "30 ml"
.IG "6" "large eggs"
.IG "3\(12 cups" "sifted all purpose flour" "350 g"
.IG "\(14 tsp" "salt" "1 ml"
...
There are also instructions for mounting, running, and
maintaining both services and executable programs that one does
not always have nor know how to get. Another kind of material is
the networked equivalent of course-packs: a professor selects
portions of other texts and presents them as a discrete package.
There are files whose only title is on the menu through which
they're accessed and files that consist of frequently asked
questions (FAQs) on a variety of themes: sports trivia, veronica
services, AIDS facts, and the Cleveland FreeNet. Finding what
appears to be an excerpt from a newspaper, for which there is no
attribution, no date, and no raison d'etre is not uncommon. All
the context this one is given is that Bert Dalmer wrote it and
that it appears on "pg. 28."
Example 2:
pg. 28. Sluggers bats pound Irish as Sinak slams two homers.
Sports story by Bert Dalmer, pg. 28
It seems the Illini just aren't happy with a .297 team batting
average. Forry Wells and the Illinois baseball team continued to
tear up their opponents with their hitting, shelling Notre Dame's
top two pitchers Tuesday night in an 11-4 win in South Bend, Ind.
As if Tom Sinak's two home runs were not enough, Wells, who leads
the team with a .421 average, put the exclamation point on the
night with a ninth-inning grand slam. The Illini finished the
night with 15 hits.
...
Single poems can also be found. This poem lacks not
only the computer-ese introduction that came with the recipe, but
also the slightest attempt to situate it within print, as did the
story about the Illini. On the other hand, the author's name is
still included.
Example 3:
Winter is icummen in,
Lhude sing Goddamm,
Raineth drop and staineth slop,
and how the wind doth ramm,
Sing: Goddamm.
Skiddeth bus and sloppeth us,
An ague hath my ham.
Freezeth river, turneth liver,
Damn you, sing: Goddamm.
Goddamm, Goddamm, 'tis why I am, Goddamm.
So 'gainst the winter's balm.
Sing goddamm, damm, sing Goddamm,
Sing goddamm, sing goddamm, DAMM.
-Ezra Pound
Because literature is something we catalog, librarians
might take some comfort in knowing that poems can be usefully
identified by their first lines. Making that extra effort might
mean consulting Granger's Index to Poetry or an anthology of this
poet's work to find out whether the poem has a title. This one
does, after all, have one, even if it wasn't included here.
Could it be possible that the record for this poem would be
improved by a uniform title? This line of questioning will be
developed later in the paper, not as an inquiry into poems, and
not merely as an inquiry into titles, but as an inquiry into the
relationship between traditional cataloging rules and the
uncontrolled material one finds on the Internet. Would it be
useful to index this material by its first lines? Are we going
to see a return to the library of the incipits? Should we be
cataloging snippets? AACR2R is basically a tool for the entirety
and is not very good at snippets.
The Internet is not just texts and hypertexts and text
fragments, it's weather maps and other (potentially compound)
multimedia objects. It is journal illustrations, and soon it
will even be feature films. According to the 5/24/93 New York
Times (curiously enough, this citation comes from a 5/13/93
edition of EDUPAGE) the first Internet cult movie, "Wax: Or the
Discovery of Television Among the Bees" was successfully
digitized into the Internet from a mid-Manhattan video recording
studio on an unspecified Saturday. Although it was only
transmitted at 2 frames per second, the experiment was considered
a success, and more movies are in the works. "Oh good,"
catalogers might say. "We have chapters on maps and movies in
AACR2R."
No one will be surprised to learn that cataloging
Internet objects in the OCLC experiment was exceptionally
difficult. A common difficulty encountered when cataloging a
monograph might be, "How do I properly construct my series
statement?" When cataloging an Internet object, however, the
cataloger is most often challenged even before she begins a
transcription of the elements of the description. The first
challenge is to know and be able to name in a standardized way,
"What is this?"
Example 4:
ABSTRACT 92 A1 V 8192 Trunc=8192 Size =18 Line=9
Col=1 Alt=0
|...+.....1......+......2......+......3......+......4......+.....
.5......+..
....6......+
===***Top of File***
===###<&///|||///&#>>>>\\\\########||||||||//////&&&
===%%%$###++++++&&//////||||||||%%%%#####||||||??**//
===***End of File***
What are the specific material designators that apply to an
object like this? Is there a chapter in AACR2R that covers it?
Because definitions for the generic terms used by OCLC were not
published in "Assessing Information on the Internet", there was
some discussion on Autocat, the cataloging and authorities
listserv, in early 1993 that questioned the actual distinctions
between system files and source files, between executable files
and games, between data files and every other kind of file except
image and audio, and so forth. Surely we do need to do a better
job defining and entitling these generic forms. However, even
within the genera that we are most comfortable with, the text
file, the range of possibility on the Internet is enormous, as
has been illustrated above.
CATALOGING'S CONCERNS
One of the reasons that Internet material is so misshapen is that
it has not been subjected to the rigors of publishing. This
launches a vicious cycle of indescribability. No attempt has
been made to control most of this material because the material
is ephemeral, or it is too poorly put together to afford its
would-be describers any handles. Catalogers continue to resist
drafting viable descriptive conventions for this material because
it is too slippery to be generalized about. The material
continues to be released without any standard with which
to compare itself. The real (non-virtual), public appearance of
commercially available information has long since been shaped to
meet the consumer expectations of users. These expectations
generally include title pages, colophons, summaries, accompanying
manuals, and even statements of responsibility, just like books.
Descriptive conventions for Internet objects are not very well
developed because the objects themselves are created and released
in uncontrolled and unconventional ways.
Networked government documents or technical reports or
serials like their print equivalents can be difficult to
catalog. They are, however, something familiar, something to
which entire chapters of AACR2R are devoted. They are, after
all, serials; or they're functioning like monographs or like
pamphlets, or like graphic materials. When they are seen as
merely the electronic version of something we know about on
paper, the challenges that these things pose to the cataloger do
not seem to thwart the basic cataloging paradigm the way that
other kinds of Internet objects do. The challenge of networked
objects is in the essential mutability of virtual reality, the
chameleon formatting, the effortless changes in location, the
easy effacement of authorship, the transparent refreshment to
accommodate newer platforms, the pointer poised from within
another document that makes the original object a part of a
larger whole.
When one contemplates the current state of the Net,
questions are inevitable: Will the networked versions of these
objects present us with chief sources we can really work from?
Example 5:
; f r o
m
uakari!indri!ames!apple!rutgers!aramis.rutgers.edu!athos.rutgers.
edu!mende
Tue Jul 18 05:54:12 PDT 1989
;Article 69 of comp.emacs:
; p a t
h
:arkl!uakari!indri!ames!apple!rutgers!aramis.rutgers.edu!athos.ru
tgers.edu!m
ende
;>From:mende@athos,rutgers,edu (Bob Mende)
;Newsgroups: gnu.emacs,comp.emacs,alt.sex
;Subject: purity.el (part 1)
;Message-;ID:
;Date: 18 jul 89 04:00:08 GMT
;Organization: Rutgers Univ., New Brunswick, N.J.
;Lines: 603
;Xref: ark1 gnu.emacs:44 comp.emacs:69
;
;Since I have had over 100 requests for this, I am posting
it....enjoy.
;please replace the following three characters
;
; with a real delete
; with a real ctrl-c
; with a real ctrl-s
;
;;
;;Purity.el Emacs lisp program to administer the purity test.
;; Robert Mende (mende@aramis.rutgers.edu)
;; 5/5/89
;;
;;This file is not officially part of GNU Emacs, but can be if
FSF wishes
;;it to be so. Distributed under the GNU copyleft
;;GNU Emacs is distributed in the hope that it will be useful
;;but without any warranty. No author or distributor
;;accepts responsibility to anyone for the consequence of using
it
;;or for whether it serves any particular purpose or works at
all,
;;unless he says so in writing. Refer to the GNU Emacs General
Public
;;License for full details.
...
This was not an easy one to work from. It is doubtful
that the "purity-el," a questionnaire about sexual experience,
exists in monograph form somewhere, but let us pretend that it
does. Its title page surely would not resemble this electronic
title screen. On the other hand, if Internet chief sources are
only the electronic equivalents of what would have appeared on a
print version, would they be enough? Probably not. One still
wants to know the formatting history and something about any
editorial changes that might have been made. The
interconnectivity status of the object should be clear. It is
possible, after all, that networked status changes the nature of
an object in ways parallel to the subtle and important ways that
the presence of an observer changes the nature of data observed,
as physicists and anthropologists have known for decades. A
networked computer file is different from a non networked
computer file is different from a print item. It is more and it
demands more description.
What are the elements we want to include in the chief
source of our users' dreams? That is part of what needs to be
worked out. Appropriate labeling has attracted the attention of
some very important standards-developing bodies. The National
Information Standards Organization has worked on standards for
the construction of periodicals (Z39.1), for headers on
microfiche (Z39.32), and for manufacturer's labels on CD-ROMs
(Z39.68). At its 1993 Midwinter meeting, MARBI, a joint
committee of the American Library Association that concerns
itself with machine readable bibliographic information,
considered Proposal 93-9 about file label specifications for
machine-readable catalog (MARC) records sent according to the
File Transfer Protocol (FTP). When one sends a FAX, one fills
out an accompanying template to send along as an identifier. The
need for a chief source for Internet objects is plain.
Catalogers need a plan of action for convincing the producers of
these objects to provide a chief source for every object on the
Internet.
WELCOME THE ENGINEERS
The Internet Engineering Task Force is a group of
engineers many of whom seem to have worked also on Z39.50
implementations. They go by the abbreviation IETF. Clifford
Lynch, in a paper he presented to the IETF last March, said that
two groups had been working on three main problems associated
with accessing networked information. These three problems are
identification, location, and description. One group, a group of
electronic engineers and developers has focused on structures
that can be used to identify and locate networked objects. The
other group, which he commends for their testing of AACR2R, is
the library community, by which he means the Library of Congress,
MARBI, and OCLC.
The IETF is trying to develop standards that distinguish
between identification and location. The URN
(Universal/Unique/Uniform Resource Number) is meant to identify
an object uniquely by its content. Unfortunately, neither the
Library community nor the IETF has, as yet, a consensus on what
constitutes unique object content. Is it one that is bit-for-bit
different from any other object, or can an object's identity
transcend delible manipulation? Does the WriteNow version of a
file differ enough from the ASCII version to require a separate
identifier?
URN's are readily compared to ISBN and ISSN. As cited in
MARBI's Discussion Paper 68 (A 007 Physical Description Fixed
Field for Computer Files) , the International Serials Data System
(ISDS) Directors feel that separate ISSN are needed for serials
published in different media. We need a level of consensus as to
the uniqueness or non uniqueness of a medium that ISDS Directors,
IETF engineers, and struggling Internetworkers can agree to. To
get there, we have to sort through much complexity and we may
have to shatter a lot of tradition. There is nothing traditional
about Internet objects. Is PKZIP, a program that compresses
files, a medium? Does tarring a file make it different from an
untarred manifestation? (Tarring is a UNIX based protocol that is
used to compress and connect groups of files simultaneously
rather than compress them as separates.) Questioning the
identity of various manifestations of computer files complements
another semantic debate about whether one can catalog a database
or only the implementation of a database. Arguments have been
made that since the database is never available except through
its implementation, (GEAC, NOTIS, etc.) that one has no choice
but to catalog the implementation. Counter-arguments have been
made that cataloging the Platonic form of a database once will
allow an infinity of other catalog records for the
implementations to be somehow associated with that record. These
questions of identity and difference need to be resolved so that
identifiers can be constructed for unique objects. The IETF sees
identifiers as permanent. They are not substitutes for locators.
The IETF locator is often referred to as the URL,
(Universal/Unique/ Uniform Resource Locator). The URL is still
in development, but it is safe to say that it will probably not
be much different from the kind of network address one is used to
seeing. The URL, however, is not necessarily complete or
permanent regarding any particular object, like the URN. Objects
can be moved away from and into the space once occupied by
another object. An object may reside at multiple locations. The
syntax for FTP type objects is fairly straightforward: service
identifier (such as TELNET, FTP, etc.) followed by a protocol to
be used to retrieve particular objects. Some kind of registry
service is envisioned that will keep service protocols
standardized.
THE CATALOGING/ENGINEERING HANDSHAKE
Lynch sees the library community's foray into locator
structures, like the USMARC 856 field and the NOTIS system's A22
field, as a transitional development that mirrors some
encoding problems encountered in the IETF's own early proposals
for locator syntax. (Lynch 1993,9) Hard questions are now
being asked about locator structures for networked objects, many
of them left unanswered for the present: Is a locator structure
really the place for file size or compression information?
Although each file should surely have facts like this written
into itself so that they can be referred to by users, this type
of information is not intrinsically part of an address or a
location at all. Should the locator structure be
self-referential; should it be human readable? What is the
level of granularity that a locator structure needs to
accommodate? Title level? Article level? Paragraph level? To
locate an object, minimally users will need its host, path, and
name. Other things that users may want the object to
inform them about before they go to the trouble of
retrieving it are: information on the last update time, the
number of links to that item in a gopher network, the names of
veronica servers referencing the item, and so forth. These are
valuable data indeed, but not as part of an artificially
cluttered locator.
In discussing the idea of a name, which to a cataloger
reads more like the idea of a title, Tim Berners-Lee of the IETF
URL Working Group, says that "The life of a name is limited by
any information contained within it that may become prematurely
invalid. It is therefore necessary to limit the contents of a
name to the information required for [allowing a 'client' program
to retrieve or operate on objects via a 'server' program]. Other
extraneous information about an object (its size, data format,
authorization details, etc.) may change with time and
shouldn't be part of the name. One might expect such information
to be part of the 'header'... and for the header to be able to be
retrieved independently of the object."(Lynch 1993,4)
Lynch encourages the library community to explore the
problems of description (he calls it "content") in a more
fundamental way. (Lynch 1993,12) As a preliminary step, let
us focus the discussion of description on the problem of the
chief source. AACR2R's rule 9.0B1 says, "The chief source of
information for computer files is the title screen(s). If there
is no title screen, take the information from other formally
presented internal evidence (e.g. main menus, program
statements)." AACR2R recognizes that all the information may not
be "available" to the cataloger because she doesn't have
appropriate machines or software, and it makes broad provision
for this circumstance until in the end, if necessary, a cataloger
can use just about any source to catalog a computer file.
However, according to 9.1B3, catalogers are not supposed to use
the filename or data set as the title proper, unless this is the
only name given in the chief source. The OCLC guidelines
(Dillon 1993,B:3) go on to say that to use the filename title,
not only can there be no other title on the chief source, but
that the cataloger must be incapable of supplying a useful title.
It is within this context that a cataloger without the
capability of uncompressing a hex file might use the string
"Resource Info" as the title proper because of the filename
"resource-info-09.hqx", which she can read without acquiring the
software necessary to interpret the hexadecimal characters of the
file itself. Then again, after she gets a hold of BinHex, she
probably could see that the title screen reads simply, "Source
Info." Unless catalogers find a benevolent funding source that
can supply them with all the software and computing power they
need, it may make more sense to get rid of 9.1B3 for networked
resources and canonize the filename as title proper.
Conversational names, commercial names, and other natural
language-type names could be recorded as added titles.
In a brilliantly argued paper, Preston and Lynch state that
unless network information sources can be "to some extent
self-describing" it is difficult to envision that descriptive
records will ever really be provided for them. Most
organizations that will supply these resources do not have the
"expertise to prepare appropriate descriptive records in the
appropriate standard interchange formats." The Library of
Congress or the various university libraries cannot be relied
upon to supply this cataloging. One alternative is for the
"suppliers of information resources to fund the creation
of...records, or for the overall user community to fund
development of such descriptions as a community benefit".(Lynch
and Preston 1992,3) It seems unlikely that the user community
could become organized, knowledgeable, and munificent enough to
fund this development in a timely way. Benefits may accrue,
however, to the information suppliers, if they choose to develop
and fund appropriately self-referential records. For catalogers
to dialog with information suppliers along these lines is
professionally responsible. It is professionally responsible to
start defining descriptive parameters now, so that creators of
Internet objectscan easily and consistently invoke them in the
resources they release. Ideally, the descriptive data embedded
in the records themselves would be protocol-independent. The
data should slip into an object available via FTP with no less
difficulty than they reside in an object available via Z39.50.
Engineers may not be able to do everything, but they surely can,
with cataloger support, do this.
Because the name is bound up with the identification and
the identification is bound up with the location, and these
three topics are the proper pursuit of engineers, there is some
hope that an elegant, standard solution to names and locations
and identifications will become available. We, on the other
hand, who come to the problem from AACR2R instead of from our
compilers, have deep, unmet needs for some indication of whether
a record should cite other editions or works or whether it
should be appended to something else as a version, or whether it
should be classed with something else. We need to know
whether more will come or if the item is complete. We also need
to have subject headings and authority work, but these are all
topics for another paper.
It is not only the time to list what we need in these
records. It is also time to list what we must omit. What
happens if we, who are the inheritors of the Library of Congress
Rule Interpretations, in all their Mandarin ornateness, are too
unused to an unadorned, engineered elegance to work toward an
object describing itself? Can we continue to apply rules written
for an item-in-hand situation (AACR2R Rule 0.24) to a space where
the same item is not the item when it is not remote?
BRIGHT SPOT ON THE HORIZON
MARBI's Discussion paper 69 ("Accommodating Online
Systems and Services in USMARC") says, "As further work is done
on directory services, it may be possible to establish a
mechanism for using existing directory services to keep USMARC
records up-to-date. For example the InterNIC Information
Services in San Diego provides a template for systems to fill in
and thus be registered in the directory service,". (MARBI 69, 8)
While we're working to establish all the data elements we need
for networked resources, and we're talking with the engineers who
are creating headers drawn from marked-up text data, why not
examine the template that this company and others like it have
put together? Engineers and catalogers could learn something
from the business community.
One of the problems that was encountered back when people
tried to teach machines to catalog print materials without
professional catalogers as mediators was the fact that the
machine spoke machine language and the print material was mute.
The print material didn't flag its title. The creators of its
title page were layout artists, whose goal was not
standardization. Networked objects, on the other hand, are
written in machine-language. With a few good guidelines, we
could have a title positioned in the same place or marked the
same way every time. With a good template, we could find
creators of Internet objects willing to inscribe themselves into
the header. The size, version, and up-to-dateness of an object
could be extracted from the object itself. What would the
payoffs be for indexing and abstracting businesses, or for
academia, or for the government? How much access can users
afford?
CONCLUSION
The Internet is a non static space that is host to a
variety of information objects. Cataloging rules were not
drafted with these objects in mind, and it is difficult to apply
them. There has been some work done by computer scientists to
name, locate, and describe these objects in machine-driven ways.
Librarians can advance their profession by helping to build
bridges between the technical, economic, and service issues
surrounding access to networked objects. We should actively work
to dispel the frustrating idea that human catalogers can ever
seize the time, find the funding, or create the tools to handle
the Internet all by themselves.
REFERENCES
ALCTS/LITA/RASD. MARBI. [1992?] "Accommodating Online Systems and
Services in USMARC." Discussion Paper 69. Photocopy.
ALCTS/LITA/RASD. MARBI. [1993?] "A 007 Physical Description Fixed
Field for Computer Files." Discussion Paper 68. Photocopy.
Anglo-American Cataloguing Rules. 1988. 2nd ed. Chicago: American
Library Association.
Berners-Lee, Tim. 1993. "Uniform Resource Locators." Internet
Draft, IETF URL Working Group.
Dillon, et al. 1993. Assessing Information on the Internet:
Library Services for Computer-mediated Communication. Dublin,
Ohio: OCLC, Office of Research.
Lynch, Clifford A. 1993. "A Framework for Identifying, Locating,
and Describing Networked Information Resources." Draft for
Discussion at March-April 1993 IETF Meeting.
Lynch, Clifford A. and Cecilia M. Preston. 1992. "Describing and
Classifying Networked Information Resources." Preprint.
Electronic Networking: Research, Applications and Policy.
Judith M. Brugger is Catalog Management and Authorities Librarian
107 B Olin Library, Cornell University, Ithaca, NY 14850
MC JOURNAL: THE JOURNAL OF ACADEMIC MEDIA LIBRARIANSHIP
Vol. 1 #2
Fall 1993
ISSN 1069-6792
October 27, 1993