The Virtual Observatory and the Roman de la Rose: Unexpected Relationships and the Collaborative Imperative

by Sayeed Choudhury and Timothy L. Stinson, Johns Hopkins University

Do most faculty in the humanities see collaborative research and engagement with large datasets in their future?

Two projects led by Johns Hopkins University that seem to come from apparently opposite ends of the spectrum–the Virtual Observatory (VO) and the Roman de la Rose Digital Library–reveal some unexpected relationships that bring to light a shift in scholarly practices that cyberinfrastructure may bring.

The Virtual Observatory represents one of the quintessential cyberinfrastructure projects–large, complex datasets shared, visualized and analyzed by a distributed group of astronomers. The Rose project features a digital library (along with related services) comprising digital surrogates of Old French manuscripts copied from the late-13th to the mid-16th centuries, many of which are richly illuminated. How could the Rose project offer any insight into data-driven scientific investigation? Even a completely digitized corpus of all extant Rose manuscripts (some 250 are known worldwide) would not approach the scale of the VO datasets. But, upon reflection, we suggest there may be an important relationship between “data mining” and collaborative scholarly practices in the sciences and humanities.

Greater collaboration among humanities scholars in the discovery and production of knowledge is cited by the ACLS cyberinfrastructure report, Our Cultural Commonwealth, as one of the goals and characteristics of an effectively implemented cyberinfrastructure for the humanities and social sciences.  But it is posed in the report against the entrenched, traditional academic culture of the “individual genius” working in isolation.[1] Referencing a recent study on American literary scholarship online, the ACLS report commented that:

Despite the demonstrated value of collaboration in the sciences, there are relatively few formal digital communities and relatively few institutional platforms for online collaboration in the humanities. In these disciplines, single-author work continues to dominate. Lone scholars, the report remarked, are working in relative isolation building their own content and tools, struggling with their own intellectual property issues, and creating their own archiving solutions.

While this may be true, we have reason to believe that a change is at hand; what is more, when one considers the evolving historical relationship between the humanities and sciences, the picture becomes more complex.

Rudolphine Tables = Open Content Alliance?

Scientists were not always good collaborators. In discussing the Rudolphine Tables (Johannes Kepler’s 1627 star catalog and planetary tables that radically improved the ability to calculate planetary positions), computer scientist Michael Nelson made the startling suggestion that the Tables might be considered on a par with today’s Google Book Search or the Open Content Alliance, in their power to inspire a new generation of scholarship.[2] But he continues by noting that there were a host of issues standing in the way of the Tables’ publication, including “significant infrastructure costs (in the form of purpose-built observatories), professional jealousy, intellectual property restrictions, and political and religious instability.” This suggests that at the time, astronomy was a discipline defined by lone practitioners who would guard their data with great secrecy; in the “data-poor” environments of the early scientific era, scientists did not readily share data or collaborate.

In contrast, by 1627 when the Rudolphine Tables were published, the Roman de la Rose had been written, re-written, re-purposed, recast, illuminated, and shared many times over. In that era, before the development of scientific instrumentation, when “data” consisted of the spoken word, the written word, and illuminations, this body of manuscripts represents a “data-rich” environment where humanists didcollaborate in the creation of new knowledge.

Perhaps it is not a set of inherent characteristics within specific disciplines that defines their mode of scholarship or communication, but rather the relative ease or difficulty with which practitioners of those disciplines can generate, acquire or process data. While many may think that humanities materials are comparatively data-poor, we suggest they can be data-rich in numerous ways. A single Rose manuscript, for example, contains a tremendous amount of textual, visual, and semantic content that is sometimes difficult to extract in meaningful ways, and nearly impossible to represent adequately in a printed edition. As our ability to move these data into digital formats improves, we believe that humanists will be drawn into new forms of collaboration that will inspire new kinds of scholarship: large-scale digitization might bring the humanities into a new age of “data-driven scholarship,” much as the Rudolphine Tables inspired astronomers.

Data-Driven Convergence?

The NSF’s 2007 report, Cyberinfrastructure Vision for 21st Century Discovery, cites 27 recent cyberinfrastructure studies and reports from across the sciences, engineering, social sciences, and humanities[3]. This surely represents an unprecedented convergence of interest across C.P. Snow’s “Two Cultures”[4] in the promise of cyberinfrastructure and of data-driven research. There is no doubt that the sciences and engineering are leading the way for data-driven scholarship in our current environment, but many areas of humanities research are increasingly data-driven as well. As our digital library group at Johns Hopkins has learned more about the data curation needs of projects from a variety of disciplines, we have realized that we are facing a data deluge–not only relating to the Virtual Observatory, but also to the ever-increasing size and number of data files that other humanities projects such as the Roman de la Rose Digital Library are now generating.

Manuscripts, so evidently data-rich in the era in which they were created, today retain their former value and meaning while they inspire a new generation of humanists to create new sets of data. This includes the metadata needed to encode, organize, and understand the texts, annotations, and the visual art embodied in the manuscripts. Not only does this demonstrate the parallel need for data curation and preservation in the humanities and the sciences (for at the level of storage infrastructure, a byte is a byte and a terabyte a terabyte) but it underscores the fact that there is an increasing convergence of what it is that is analyzed by humanities scholars and scientists: data. In addition, there is an increasing overlap between the two communities in the tools needed for storing, accessing, and manipulating this data. Let us propose, then, that putting aside obvious aesthetic differences, scientific datasets are a modern “equivalent” of medieval manuscripts.

In fact, one could argue that manuscripts such as The Rose represented the richest sets of data/information available in their day and were stored for subsequent examination, analysis and repurposing. Additionally, they contained multiple types of data such as integrated texts and images, user annotations, and intertextual allusions and references. These intertextual references frequently pointed the reader to other texts available in the same monastic or university libraries. Thus the early codex, situated in a library of other codices to which it was linked in a semantic web of intertextuality, was a collection of active links, hyperlinks if you will, that simultaneously informed the reader how to navigate the text at hand and pointed outward to other relevant documents. The library was the “web” before the Web existed.

Digital tools are allowing us to capture, manipulate, and examine books and their data in ways that are revolutionizing the humanities. Entire libraries are now being digitized, linking their components in unforeseen ways. Libraries that have been dispersed by auction, theft or the vagaries of time may be virtually reassembled. And new libraries, whether a collection of all extant Rose manuscripts (which of course has never, and could never have been, assembled) or something on the immense scale of the Google Books project, are emerging, bringing with them powerful tools and possibilities for research that have barely been realized. Finding themselves in new kinds of data-rich, multi-media environments, created by the mass digitization projects as well as the continuing projects of libraries, museums and archives to digitize their special collections, image, moving-image and sound files, humanists are increasingly considering the potential for cyberinfrastructure-related research and teaching.


Lessons Learned?

Digital media provide an opportunity to reflect more accurately forms of medieval textuality and transmission that disappeared during the print era. The routine integration of text and image on computer screens, the recombinant nature of electronic texts, and the idea that anyone can copy, alter, edit, and retransmit a document (much to the chagrin of those with the most to lose from the potential collapse of traditional copyright laws), all have strong parallels in medieval texts and acts of textual transmission.

Print culture played a formative role in creating the notion of a single, authoritative text, as well as the expectation of an individual genius working alone. Technology in the form of the printing press shifted the scholarly landscape. Old models of collaboration, as well as the attendant mechanisms of creating, publishing, and transmitting works of scholarship, were replaced by a new world of large-scale publishing whose aim was the production of multiple, identical copies of a single authoritative text. The mechanization of production, copying, and transmission led to the virtual extinction of a scribal culture that produced unique versions of texts in which the roles of author, scribe, editor, and publisher were inextricably blurred.

Technology, both in its processes and tools, always will influence and shape a culture. But how do we ensure that the evolving cyberinfrastructure supports but doesn’t overly define the new forms of emerging data-driven scholarship? One of the imperatives for the humanities community is to define its own needs on a continuous basis and from that to create the specifications for and build many of its own tools.[5] At the same time, it will be worthwhile to discover whether new cyberinfrastructure-related tools, services, and systems from one discipline can support scientists, engineers, social scientists, and humanists in others. NSF (perhaps in collaboration with the NEH and IMLS ) might help track the portability of such resources.

Finally, we want to point out that the reason we can apply a historical lens to this issue today is because of earlier commitments to the preservation of our heritage. However, as highly-coveted manuscripts and other valuable physical objects are digitized, the resultant datasets are often not as highly regarded by libraries. We believe this represents a shortcoming of vision. For while the curation of physical codices will remain an essential role for libraries, the collection and curation of digital objects will assume greater importance for libraries of the future, and the infrastructure, budgetary priorities, and strategic plans of library organizations would do well to account for this sooner rather than later. In the digital age, data can become at risk in as short a period as five years, and we have already irrevocably lost important datasets. The importance of curating datasets to ensure long-term, persistent access cannot be overstated. Imagine the loss to science and scholarship if we had not preserved the Rudolphine Tables or the Roman de la Rose manuscripts.


Works Cited

[1] American Council of Learned Societies’ Commission on Cyberinfrastructure for Humanities and Social Sciences, Our Cultural Commonwealth (2006), p.28. See also p. 48  on how “traditional scholarly work, in the form of a single-authored, printed book or article published by a university press or scholarly society, is the currency of tenure and promotion, and work online or in new media, especially work involving collaboration, is not encouraged.” http://www.acls.org/cyberinfrastructure/acls.ci.report.pdf

[2] Michael L. Nelson, “I Don’t Know and I Don’t Care,” NSF/JISC Repositories Workshop, April 2, 2007http://www.sis.pitt.edu/~repwkshop/papers/nelson.html. Retrieved September 2, 2007.

[3] National Science Foundation Office of Cyberinfrastructure, Cyberinfrastructure Vision for 21st Century Discovery, 3:46 (2007): Appendix B, “Representative Reports and Workshops.” http://www.nsf.gov/od/oci/CI_Vision_March07.pdf.  Retrieved August 8, 2007.

[4] The term, coined by British scientist and novelist C.P. Snow in his 1959 Rede Lecture “The Two Cultures and the Scientific Revolution,” became a shorthand for the rift between the sciences and humanities in approaches to problems. See C.P. Snow The Two Cultures (Cambridge Univ Press: 1959, reprinted 1993).

[5] See, for example, the effort to articulate this in the 2005 report from the University of Virginia’s Institute for Advanced Technology in the Humanities, Summit on Digital Tools for the Humanities. http://www.iath.virginia.edu/dtsummit/SummitText.pdf Accessed October 6, 2007.

From Data to Wisdom: Humanities Research and Online Content

by Michael Lesk, Rutgers University

1. Introduction
President Clinton’s 1998 State of the Union Address called for “an America where every child can stretch a hand across a keyboard and reach every book ever written, every painting ever painted, every symphony ever composed.”[1] If that dream is realized, we would have the resources for all humanistic research online. What difference would that make for the humanities?

The widespread availability of online data has already changed scholarship in many fields, and in scientific research the entire paradigm is changing, with experiments now being conducted before rather than after hypotheses are proposed, simply because of the massive amounts of available data. But what is likely to happen in the humanities? So far, much of the work on “cyberinfrastructure” in the humanities has been about accumulating data. This is an essential part of the process, to be sure, but one would like to see more research on the resulting files. At the moment, we have a library that accumulates more books than are read. T. S. Eliot wrote “Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?”[2] Modern computer scientists have tended to turn this into a four-stage process, from data to information to knowledge to wisdom. We are mostly still at the stage of having only data.

Important humanities projects already use textual databases, including authorship studies dating back more than forty years. Soon, however, nearly every published work will be searchable online. Not only will this solve the problem of locating sources but, more importantly, it gives us the ability in textual studies to automate queries about, for example, the spread of stylistic techniques, the links between authors, and the references to cultural icons in books. Today, we are making the twenty-first-century equivalent of scholarly editions, with the ability to also conduct research on them.

Simply locating important physical items is a major gain from the collection of digital data. Art historians are unlikely to write significant essays based on thumbnail images, but they can use thumbnails to decide which museums to visit. Humanists are more likely than natural scientists to want to see a specific original object (medical researchers reading about a toxicity study in mice are unlikely to ask to see the actual rodents). Humanists’ need to consult manuscripts, paintings or a composer’s autograph scores will remain and this reinforces the value of online library catalogs and archival descriptions, even if the subjects themselves are not available remotely over the Web.

The new communications technologies, especially the spread of broadband Internet service, will improve our ability to do both interdisciplinary and international research. On my campus, the music library, the art library, and the library with the best collection of literature are in three different buildings, about two miles apart. On my screen, they are all together. Similarly, I can easily question colleagues worldwide, whatever their specialty; people are not necessarily less accessible because they are not in my building. Davidson also argues for the importance of international cooperation in building digital resources, pointing to two particularly successful projects: the International Dunhuang Project, reuniting in virtual space manuscripts and paintings from Dunhaung (many now in London, Paris or St. Petersburg), making them easier to view and study; and the Law in Slavery and Abolition project, which brings together abolition materials from many countries.[3]

However, the physical location of humanities infrastructure has still to be settled. Should each research institution have a repository that covers all subjects of interest to it? Should there be different data centers for each subject or should there be a few large national centers? It is probably easier to get political support if many institutions are able to participate in the creation of cyberinfrastructure, even if that causes some duplicate work. Even with the wide disparity of size among North American universities, the low cost of computer hardware and the ease of distributing software platforms such as Fedora or DSpace means that almost everyone can have their own specialty. In the sciences, for example, Eckerd College (an institution with 1750 students), maintains a database of dolphins seen off the east coast of the United States, while in literature many campuses have claimed their specialty–whether it be Thackeray at Penn State, Twain at Berkeley, Tennyson at San Francisco State, or Thoreau at Santa Barbara. In the UK, the Arts and Humanities Research Council has recently suggested that data services now be provided by university repositories, rather than by the somewhat centralized Arts & Humanities Data Service.

One of the biggest problems still to be solved is that of scholarly access to copyright-protected materials. Regrettably, much of the discussion of cyberinfrastructure in the humanities gets bogged down in copyright law. It’s not, as it is in the sciences, about how to manage the data or what to do with it, but what you are allowed to do with it. Broadly speaking, literature and textual study is less affected by this concern than music and film scholarship, since most texts are in the public domain (first published in the U.S. before 1923). However, nearly all recorded sound, movies and films remain under copyright, as does most photography and a considerable amount of twentieth-century literature, painting and sculpture. And even in literature, most modern commentary is published under copyright restrictions.

Since data mining routines require open access to material, it is not clear how we will reconcile the financial interests of rights holders with the scholarly interests of researchers. The entertainment industry is of course trying to establish principles for the use of intellectual property online, which may wind up controlling what researchers can do. In many cases, however, rights holders such as museums are willing to provide special access to scholarly users. The Metropolitan Museum of Art, for example, has greatly simplified the process of getting reproduction-quality images of many objects in their collection.

2. Changes in the Natural Sciences
In the traditional paradigm of scientific research, a researcher poses some hypothesis, designs an experiment to test it, and carries out the experiment, evaluates the results, and then decides whether the hypothesis is valid. Running the experiment is typically the slowest step in this process. Today, the experiment, or data collection, is often not conducted at all, since in many scientific areas, there are now enormous collections of online information already available. Consider astronomy as an example. A researcher wanting to discover the comparative spectral characteristics of particular stars used to have to sign up for time at Kitt Peak or some other observatory, wait until the assigned two-week slot came around, go to the telescope, and take observations, hoping for good weather those nights. But now there are more than forty terabytes of sky images online. Archives such as the Sloan Digital Sky Survey or the Two Micron All Sky Survey[4] may well provide the necessary information.

Molecular biology was the first scientific area to be transformed this way. Much genomic research is now done entirely by comparing information from different databases, for example looking at the sequences of DNA base pairs (or the structures of biomolecules) from various species. This is not only faster than research using traditional “wet chemistry,” but encourages a different kind of research focused on searching out evolutionary similarities rather than the characteristics of single species. Among the best known archives are the Protein Data Base [5] and the genome data archives, such as the National Library of Medicine’s Genbank. The same is true with other sciences as well: high-energy physics, climatology, and seismology all have large and growing data archives.

In addition to the use of research data by the science that gathered them, data-oriented research is also being conducted in the archives, from the creation and evaluation of new interfaces for access and retrieval, to methods for data mining and experiments on using data in education–the results from which may be applicable to other scholarly areas. For example, visualization techniques for displaying charts of temperatures, such as spreadsheet chart generators, can also display the use of words in documents or ages of buildings, while more complex 3-D visualization software can show chemical molecules or sculptures.

3. Data in the Social Sciences
Social scientists have been gathering large quantities of data for years, through public surveys, economic monitoring and other methods. The Inter-university Center for Political and Social Research (ICPSR, established in 1962, is the world’s largest archive of digital social science data, maintaining more than a terabyte of information from surveys. “Quantitative history” also got its start in the 1960s, exploiting economic data, genealogical data, polling data and more, yielding important insights and results (such as, in just one example, the evocation of the life of textile workers by Clubb[6]).

Social scientists also exploit data collected in the natural sciences. For example, geographic information systems are widely applicable in both areas. In archeology and ethnographic studies, GIS data is fundamental, and of course it is relevant in history as well.[7] We can make far better maps of battles from Marathon to Waterloo than any participant (or even a 19th century expert such as Major-General Stanley) could have.

One particular social science application, “collaborative filtering,” based on work by Hill that makes automatic predictions about the interests of a user by collecting information about the preferences of many users, is now a familiar feature of commercial sites.[8] This shares the social effects of citation indexing, but is applicable in wider areas, since it does not depend on the use of formal citations. If databases were centralized, and they tracked usage, it would be technically possible to suggest resources to scholars, even while maintaining anonymity.

4. Humanities: Beyond Data Accumulation
The rate of the accumulation of digital data in the humanities has been increasing rapidly. Digital conversion of text dates from the 1950s, when scholars realized it was much easier to make concordances on computers than by hand. In the 1970s, Project Gutenberg, in the process of converting all major works of English and American literature to machine-readable form, shifted to scanning text, rather than keying it into computers. The Trésor de la Langue Française was also busily providing digital text for most works of classical French literature (now 2600 texts available through the ARTFL Project), and the Thesaurus Linguae Graecae digitized most literary texts written in Greek from Homer to the fall of Byzantium. With the advent of the Million Book Project, Google Books, Amazon’s “Search Inside”, the Open Content Alliance and Microsoft’s books.live.com, an enormous fraction of the printed literature is now searchable and many are actually readable. Estimating the number of books published in English as around 20 million, of which about five million are out of copyright, more than 1 million seem to be searchable now.[9] Scanning technology continues to evolve at breakneck speed. Until 2005, the best book scanners (such as the Minolta PS3000) were still moving a scanning element across a page. With the rapid increase in the resolution of digital cameras, today an entire page can be digitized at once with adequate resolution for reading or optical character recognition.

Archiving, preserving and ensuring permanent availability of this data has become of vital importance, and digital archives are becoming essential to the future scholarly enterprise. Not only must an archive be able to keep items indefinitely and find them again, it must convince its users that it will be able to do this. Scholars are not likely to be happy to deposit work in an archive if they do not believe that it is permanent.[10] A NITLE survey  reported that faculty expect to be increasingly dependent on electronic resources and less on their institution’s library, although they also reported that the libraries are less aware of this trend than are their patrons.[11]

It might be that the focus of study shifts as new projects become possible using scanned works. It is now trivial to count how many times a letter or word occurs in an author’s works, but perhaps no easier to discuss the sources of an author’s inspiration or the relationship of somebody’s texts to contemporary culture. It is not clear whether projects that involve automatic analysis of humanistic materials will become as important as scientific data analysis. Brockman argues that what humanists do is read; for him, what matters is how people find their material; once scholars have a document, they won’t make any other machine-readable use of it. Similarly, Unsworth’s overview of humanities scholars of the future focuses on the importance of searching, especially in heterogeneous bodies of media and materials.[12] Again, we would hope than in the future people will not just use online resources to find the traditional materials, but to actually analyze the materials themselves.

5. Imagery: Paintings, Drawings and Photographs
Museums were initially very concerned about the misuse of their images online, and the scanning of cultural heritage visual works lagged behind text conversion. However, it has quickly picked up and many cultural heritage institutions have taken the lead. Examples now include the websites of the Museum of Fine Arts in Boston (330,000 images), the Musée du Quai Branly (100,000 images), the Fine Arts Museums of San Francisco (82,000 images), and the Victoria and Albert Museum (43,000 images). More broadly of course there are the 9 million digital objects in the “American Memory” site of the Library of Congress and the 550,000 images of photographs, prints and documents in the New York Public Library’s “Digital Gallery.” The Mellon’s ARTstor project contains over 500,000 images, available for educational licensing, covering a very wide variety of artists, periods, and cultures. A larger fraction of photographs are in copyright than paintings, and some very large agencies (Corbis and Getty Images, for instance) have enormous photographic resources.

While these visual resources are now being used increasingly by faculty in many other disciplines, trends to watch in the area of digital investigation of the visual arts include:

  • Investigating the characteristics of paintings using digital techniques: Maître presented mathematical methods for analyzing the light reflected from paintings, in order to understand which pigments were used; Hardeberg showed how to analyze a painter’s palette, for example studying the range of colors in paintings by La Tour, Goya and Corot; and Lombardi investigated automatic analysis of style.[13]
  • Using digital images for restoration purposes: Pappas investigated restoring faded colors in paintings, and Giakoumis at patching cracks digitally. Although today it is typical that such studies are done when paintings are originally scanned, in the future one can expect research into materials will be separated from their digitization, as has already happened in science and in literature.[14]
  • Annotating paintings and photographs: Wang showed how to use machine-learning software to decide how to label different images: images such as terracotta warriors or roof tiles were recognized with accuracies ranging from 82% to 98%, with fewer than 10 training examples for each (holding out hope for content-based image search for art materials).[15]
  • Conducting authorship studies: Li and Wang examined Chinese paintings extracting stroke and wash information useful for distinguishing different artists from classical to modern times; Lyu examined Brueghel drawings in order to determine features that would identify authorship.[16]

6. Music and Moving Images
Scholars working with music and with moving images are seriously compromised by copyright issues. In the United States, virtually all recorded music is protected by copyright and the recorded music industry is vigilant in enforcing copyright law, with no effective understanding or agreement about the practical implementation of “fair use.” Articles about music will be available from online journals, catalogs will help locate relevant materials, and performance reviews will be relatively easy to locate in cities with newspapers online. However, listening to performances will not be so easy, as performances available for sale online tend still to comprise a narrow range of currently popular music.

Scores published before 1923 in the United States are out of copyright, and sites such as the Lester Levy Collection at Johns Hopkins provide a large variety of old songs in sheet format. To help amateur enthusiasts, some websites have converted classical music to MIDI format, and musical OCR software can assist in converting scores to MIDI. Online music search and retrieval is being actively developed, although it is behind the level of text retrieval.[17] Music scholarship is thus more likely to be based in traditional cataloging, references from other scholars, and personal experiences. Library resources are still critical: the music industry’s insistence in maintaining copyright control does not mean that it is keeping old material available for sale.[18]

Automatic musical analysis also lags behind textual analysis. Bill Birmingham developed methods for extracting themes from complex scores, but we generally do not have the kind of authorship studies that exist for text or art.[19] Netthiem’s summary of the situation concludes that “few applications of statistics in musicology have so far been fully convincing.”[20] When music scholarship is about general cultural context, it will of course benefit from the web resources created for textual material.

For film scholars, the availability of DVD recordings has made it enormously easier to watch films that were very rarely seen when only theatrical performances were possible. Generally, the money spent on buying DVDs is comparable to that spent on books (about $24 billion per year) but few libraries purchase DVDs at anything like the same rate at which they purchase books. Similarly, the state of preservation of film and video for future study is behind that for printed material. About half the movies made before 1950, for example, are lost, as is nearly all early television. The Vanderbilt Archive of televised news is a notable exception, providing a searchable collection of copies of the main network newscasts back to 1968.

With NTSC-quality digitization notably less accurate than film (even HDTV quality is below that of 35mm film), film scholars may feel that even collections of current DVDs are not suitable for in-depth study, while for television programs, DVDs are better than the quality of what was broadcast.  For successful commercial films, conversion is being done by the movie companies. But for material with no likely market, it is unclear how it will be preserved. Film libraries at places like USC, UCLA, NYU, the American Film Institute, and the Museum of the Moving Image, just to mention a few, are attempting to address this. In some cases, commercial companies are cooperating with these non-profit organizations to support scholarly research into cinema. We can hope that in the future, when the commercial value of old movies is better understood, more research use will be allowed in circumstances when the film owners realize there is little financial risk.

7. Sculpture, Architecture, and Urban Design

With the development of a variety of techniques for 3-D scanning, it is now possible to model three-dimensional forms in space. One example is Marc Levoy’s imaging of Michelangelo’s David. Using a laser rangefinder he imaged the sculpture to an accuracy of 1/4 mm, enabling scholars to tell which chisel Michelangelo had used on what parts of the stone.[21]

While contemporary buildings are designed using 3-D CAD programs, earlier ones can be represented as 3-D models (using software such as Photomodeler or Realviz). Virtual reconstructions can also show buildings at different times in their history, and can even represent buildings that no longer exist (the accuracy of the reconstruction depending on the level of documentary detail available. In Germany, the group “Memo 38” completed a 3-D reconstruction of the destroyed synagogue at Wiesbaden, aided by drawings, photographs, memories and architectural expertise, and has been extended to include eleven other synagogues by Koob and others at Darmstadt.[22]

Other reconstructions include architect Dennis Holloway’s images of Native American structures from the southwest, re-created from drawings, foundation measurements and surviving elements; the images have a tremendous impact for tribal artists who can “see” complete buildings that their culture once used, instead of just the ruins that remain. Entire modern cities (such as Los Angeles and Kyoto) have been recreated and may have applications ranging from tourism to emergency response, while the most ambitious historical reconstruction is Rome Reborn, a network of sites that will ultimately include the modeling of 7,000 buildings. Jacobson, looking at applications for 3-D archaeological models, remarks that many users will benefit from seeing an object rather than reading about it, and even more from being able to walk through it, using their memory of spaces and motion. So the use of virtual reality models of historical cities or sites can be expected to increase public interest in and support of cultural heritage.[23]

8. Born-Digital Creations
Artists are now using computers in a wide variety of creative ways extending over music, dance, text, imagery, and interactive software games. The analysis of this material by scholars has barely started, and it is not clear who will both be willing to collect it and actually have legal permission to do so–libraries, museums, or the artists themselves.[24] Worse yet, this kind of material is likely to depend on the details of the computer software used to make it and show it; even the increases in processor speed that we see every few months may change the emotional impact of the work as the speed of the display changes. Preserving and accessing the material may depend on software from companies that go out of business or discontinue support. One dramatic example of the problems posed by digital preservation is that of the BBC’s digital Domesday book. In celebrating the 900th anniversary of William the Conqueror’s 1086 survey of England, the BBC built its 20th-century version on an analog laser-disc technology that disappeared from the marketplace almost immediately, along with the BBC Micro computer that ran the software to operate the player. As a result, the Digital Domesday became rapidly unavailable, while the 1086 version was still readable. The BBC has recovered the original material, and built an equivalent website, but has legal difficulties making it generally available.[25]

9. Future: the economics and politics of access
There still remain many unsolved economic, legal and political problems surrounding digital resources and scholarly access. We need long-term financial support for the building and preservation of digital resources. Although some universities have received large donations for computational humanities research, most notably the University of Virginia, it is more common for this work to be supported by research grants, which do not provide indefinite funding for long-term storage and support. The National Science Foundation has recently announced a competition for digital archiving support, which is most welcome as a possible source of funding. We also have library cooperatives, such as LOCKSS, that try to leverage individual library contributions into a large shared system. The Mellon Foundation stands out for its support of long-term archiving and research into new business models. However, we still do not know what kind of business model will support sustainable digital resources. Should repositories be organized by university or by discipline? Should there be a few very large ones or many repositories with small specialties? Should funding be sought as endowment funds, pay per use, subscriptions, or something else? Fortunately, one cause for optimism is the steadily declining cost of disk space and computer equipment; if you can afford to keep something this year, the cost to keep those bits around in five years will be less.

What level of access can be provided for material still under copyright? Our Cultural Commonwealth, the cyberinfrastructure report of the American Council of Learned Societies, urges that all content be freely available under open access. It does not, however, put forward any specific goal for content, nor does it make recommendations for dealing with the tough intellectual property issues that restrict online use of nearly all recorded music, film and video, and large chunks of photography.

In an early survey of archaeologists’ use of data, Condron found widespread interest in mapping data and site records, but great confusion over whether access should be open and free or whether there should be fees to offset production costs. This leads to the publication-related question of whether or not the presence of online materials may inhibit traditional paper publication. If faculty members can only get tenure via traditional publications, they will tend to resist anything that might cause a decline in traditional publishing. Meanwhile, tenure evaluation committees are starting to ask for citation counts and journal impact factors; Stevan Harnad has shown that online publications are now generally more cited than works that appear only on paper. A faculty member in the future may be more anxious to have a website that is highly-visited and linked-to than a journal paper. Overall though, the situation is so confused currently that we don’t understand whether data archives should charge those who put things in, those who take things out, both, or neither.[26]

Even in scientific areas where copyright is less relevant, we still find political or ethical issues affecting the availability of infrastructure data. Should the scientist who first collects some information have special privileges to use it? Different areas have different ethical rules. Anyone publishing a paper that reports measuring a protein structure is expected to deposit the structure in the Protein Data Bank immediately, while astronomers have a convention of two years private use, despite the potential enormous commercial importance of some protein structures compared with the complete absence of commercial applications for cosmology. Of course, the Dead Sea Scrolls were kept secret for decades: humanities communities will also have to work out whether any rights attach to scholars as well as to the original creators or publishers of works, and how long these should last.

Most important in the long run will be the development of better techniques for analyzing and using the data accumulated in humanities repositories. Scholars can find works, they can view works, at least as surrogates, and they can exchange information with other scholars. Repositories thus make traditional research easier. But will they enable new and significant kinds of research? We would like to see more authorship studies, critical evaluation, annotation, and the like. Today, computers can count; they can read a little, see a little, hear a little, and feel a little. But as yet they do not read, see, hear, or feel at the levels needed to provide insights for humanities scholars.

NOTES

[1] W. J. Clinton, State of the Union Address (1998), http://clinton4.nara.gov/textonly/WH/SOTU98/address.html.

[2] T.S. Eliot, Complete Poems and Plays 1909-1950 (New York: Harcourt, 1952), 96.

[3] C. N. Davidson, “Data Mining, Collaboration, and Institutional Infrastructure for Transforming Research and Teaching in the Human Sciences and Beyond,” CTWatch Quarterly 3, no. 2 (2007), http://www.ctwatch.org/quarterly/articles/2007/05/data-mining-collaboration-and-institutional-infrastructure/.

[4] M. Skrutskie, The Two Micron All Sky Survey at IPAC (2007), http://www.ipac.caltech.edu/2mass/.

[5] See http://www.pdb.org or http://www.rcsb.org and links therein. Protein Data Bank project leaders are Dr. Helen Berman (Rutgers) and Dr. Philip Bourne (UCSD).

[6] J. M. Clubb, Erik W. Austin, and Gordon W. Kirk, Jr., The Process of Historical Inquiry: Everyday Lives of Working Americans (Columbia University Press, 1989).

[7] D. Rumsey, “Tales from the Vault: Historical Maps Online,” Common-place 3, vol. 4 (2003), http://common-place.dreamhost.com//vol-03/no-04/tales/index.shtml, http://purl.oclc.org/coordinates/b3.htm.

[8] See W. Hill, Larry Stead, Mark Rosenstein and George Furnas, “Recommending and evaluating choices in a virtual community of use,” Human Factors in Computing Systems, in CHI ’95 Conference Proceedings, (ACM, 1995), 194-201.

[9] J. Unsworth, et al, “Supporting Digital Scholarship,” in SDS Final Report (University of Virginia IATH, 2003), http://www3.iath.virginia.edu/sds/SDS_AR_2003.html.

[10] This has been estimated by comparing the titles found in a large library of known size with those in the various online systems. This statistic is somewhat misleading since the most important and frequently used books appear in major libraries the most often and are likely to be more quickly entered into the major scanning efforts. For example, Prescott’s History of the Conquest of Mexico can be searched in every one of the four big projects mentioned above.

[11] R. Schonfeld and Kevin M. Guthrie, “The Changing Information Services Needs of Faculty,” EDUCAUSE Review 42, no. 4 (2007), 8-9.

[12] W. Brockman, Laura Neumann, Carole Palmer, and Tonyia Tidline, Scholarly Work in the Humanities and the Evolving Information Environment (Washington, D.C.: Digital Library Federation and Council on Library and Information Resources, 2001), http://www.clir.org/PUBS/reports/pub104/pub104.pdf; J. Unsworth, The Scholar in the Digital Library, IATH (Charlottesville, 2000), http://www.iath.virginia.edu/~jmu2m/sdl.html.

[13] H. Maître, Francis Schmitt, Jean-Pierre Crettez, Yifeng Wu and John Yngve Hardeberg, “Spectrophotometric image analysis of fine art paintings,” Proc. IS & T and SID 4th Color Imaging Conf. (Scottsdale, AZ, 1996), 50-53; J.Y. Hardeberg, Jean-Pierre Crettez, and Francis Schmitt, “Computer Aided Image Acquisition and Colorimetric Analysis of Paintings,” Visual Resources: an International Journal of Documentation 20, no. 1 (2004), 67-84; T. Lombardi, “The Classification of Style in Fine-Art Painting,” PhD diss., Pace University (2005), http://csis.pace.edu/~lombardi/professional/dthesis.html.

[14] M. Pappas and Ioannis Pitas, “Digital Color Restoration of Old Paintings,” IEEE Trans. on Image Processing 9, vol. 2 (2000), 291-294; I. Giakoumis and Ioannis Pitas, “Digital Restoration of Painting Cracks,” Proceedings of the 1998 IEEE International Symposium on Circuits and Systems 4 (1998), 269-272.

[15] J. Z. Wang, Jia Li, and Ching-chih Chen, “Machine Annotation for Digital Imagery of Historical Materials using the ALIP System,” in Proc. DELOS-NSF Workshop on Multimedia in Digital Libraries (Crete, 2003).

[16] J. Li and James Z. Wang, “Studying Digital Imagery of Ancient Paintings by Mixtures of Stochastic Models, IEEE Transactions on Image Processing 13, no. 3 (2004), 340-353; S. Lyu, Daniel Rockmore, and Hany Farid, “A digital technique for art authentication,” Proc. Nat. Acad. Sci. 101, no. 49 (2004), 17006-17010.

[17] J.S. Downie, “Music Information Retrieval,” Annual Review of Information Science and Technology 37 ( 2003), 295-340.

[18] T. Brooks, “Survey of Reissues of U. S. Recordings” (sponsored by the Council on Library and Information Resources and the Library of Congress, Washington, D.C., 2005), http://www.clir.org/PUBS/reports/pub133/pub133.pdf.

[19] W. Birmingham, Bryan Pardo, Colin Meek, and Jonah Shifrin, “The MusArt Music-Retrieval System,” D-Lib Magazine 8, no. 2 (2002).

[20] Nigel Nettheim, “A Bibliography of Statistical Applications in Musicology,” Musicology Australia 20 (1997), 94-106.

[21] M. Levoy, K. Pulli, B. Curlesss, S. Rusinkiewicz, D. Koller, L. Pereira, M. Ginzton, S. Anderson, J. Davis, J. Ginsberg, J. Shade, and D. Fulk, “The digital michelangelo project: 3d scanning of large statues,” Computer Graphics, SIGGRAPH 2000 Proceedings (2000), 131-144.

[22]  G. Kerscher, “Architecture Digitalized” (talk, Section 23D of CIHA conference, London, England, September 3-8, 2000), http://www.unites.uqam.ca/AHWA/Meetings/2000.CIHA/Kerscher.html; M.  Koob, Synagogues in Germany–A Virtual Reconstruction, http://www.cad.architektur.tu-darmstadt.de/synagogen/inter/start_de.html (accessed November 9, 2007).

[23] D. Holloway, “Native American Virtual Reality Archeology,” Virtual Reality in Archeology, ed. Juan Barcelo (London: Archeo Press, 2000);  B. Jepson, Urban Simulation Team at UCLA, http://www.ust.ucla.edu/ustweb/PDFs/USTprojects.PDF (Note videos and models at http://www.ust.ucla.edu/ustweb/projects.html, accessed November 9, 2007); Y. Keiji, “Virtual Kyoto through 4D-GIS and Virtual Reality,” Ritsumeikan 2, no. 1 (Winter 2006), http://www.ritsumei.ac.jp/eng/newsletter/winter2006/gis.shtml (Note link to the actual 3-D models); J. Jacobson and Jane Vadnal, “Multimedia in Three Dimensions for Archaeology,” Proceedings of the SCI’99/ISAS’99 Conference (1999), http://www.planetjeff.net/IndexDownloads/sci-isas-99.pdf.

[24] J. Lewis, “Conserving Pixels, Bits, and Bytes,” Artinfo (August 2, 2007), http://www.artinfo.com/articles/story/25439/conserving_pixels_bits_and_bytes?page=1.

[25] J. Darlington, Andy Finney and Adrian Pearce, “Domesday Redux: The rescue of the BBC Domesday Project videodiscs,” Ariadne 36 (2003), http://www.ariadne.ac.uk/issue36/tna/; A. Charlesworth, Legal issues arising from the work aiming to preserve elements of the interactive multimedia work entitled “The BBC Domesday Project,” (2002), http://www.si.umich.edu/CAMILEON/reports/IPRreport.doc .

[26] F. Condron, Julian Richards, Damien Robinson and Alicia Wise, Strategies for Digital Data (York: Archeology Data Service, 1999); C. Hajjern, Y. Gingras, T. Brody, L. Carr and S. Harnad, “Open Access to Research Increases Citation Impact,” (paper published by the Electronics and Computer Science Eprints Service, Univ. of Southampton, 2005), http://eprints.ecs.soton.ac.uk/11687/.

Beyond the ACLS Report: An interview with John Unsworth

by Kevin Guthrie, Ithaka

I sat down with John Unsworth for 90 minutes at the American Library Association‘s conference in Washington, DC. John is Dean and Professor of the Graduate School of Library and Information Science at the University of Illinois, Urbana-Champaign and chaired the ACLS Commission on Cyberinfrastructure for the Social Sciences and the Humanities. Its report, Our Cultural Commonwealth, was released in December 2006. See Gary Wells’s review of the report in this issue. While the Commission’s report does not focus directly on liberal arts colleges, the development of a working cyberinfrastructure for the humanities and social sciences would definitely affect them, presumably in positive ways that would enhance their teaching and learning capacities.

Kevin Guthrie: I have to start with the somewhat obvious question. You have been thinking about this and talking about this for a long time: what exactly is cyberinfrastructure?
John Unsworth: We worked hard on the definition of cyberinfrastructure for the report and there’s a pretty good one there that builds on the one established in the 2003 National Science Foundation (NSF) report, Revolutionizing Science and Engineering Through Cyberinfrastructure (to which the ACLS Report was one of many responses). I like to think of cyberinfrastructure as the middle layer of a cake. The base layer is all of the hardware and basic operating systems-level technology on the network. Fiber optic cables, storage devices, things like that. The icing is made up of specific applications to serve a particular purpose. Software applications and tools that can be shared by different people for different purposes represent the middle layer of the cake and are what we mean by cyberinfrastructure. It is important to point out that cyberinfrastructure is not just equipment or software, it also includes the human interactions, protocols, standards, work processes, and so on, needed to make the system work and to structure collaborative or related activities. So, in the case of highways, the infrastructure is not just the roads, it is the maintenance crews that keep them functional, the speed limits and police cruisers that ensure safety, the understanding that you pass on the left. All of these elements are part of the “infrastructure.”

One of the compelling quotes from the NSF report is “if infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for a knowledge economy.”[1] Your definition is helpful, but I do sometimes feel challenged to get my arms around how to build such an infrastructure on a system-wide basis. Are there any real-world examples of cyberinfrastructure yet? Would you say that eBay is a kind of cyberinfrastructure that facilitates the online exchange of goods? It has software applications, services for exchanging money (such as PayPal), acceptable mechanisms for posting items for auction, social protocols for evaluating the quality and reliability of sellers and buyers. Is that a reasonable example?
Yes, I think that’s a good one. One of the other ways of thinking about the development of cyberinfrastructure is to recognize that it is not something that gets established in one fell swoop. It’s so broad and encompassing that it tends to get built incrementally. We’ve seen the first wave in the building of the cyberinfrastructure that will transform scholarship in the form of digitized content. Over the last decade or so, libraries, publishers, individuals and nonprofits have created great quantities of digitized content with tools and applications to facilitate their use. When that content is combined with the born-digital content being created every day on the Web you have a huge layer of content on which to build valuable tools. Search engines are a layer on top of that content. Together these establish a base component of cyberinfrastructure upon which we can build.

Your example highlights the fast pace and relentless nature of technological progress. What starts out as new and innovative becomes a commodity layer so fast. A publisher might have built unique content and a home-grown search engine in the late 1990s and could have built a loyal and growing following based on the value of searching that content. Then, enterprises like Google enter the search business and their tools become widely used so that publishers will have to add a new layer to the cyberinfrastructure to continue to be useful and valuable.

Let me step back for a moment to the process of creating the cyberinfrastructure report. One thing I noticed in the description of that process was that there was international representation on the commission. Was there a difference in how representatives from other parts of the world viewed the challenge as compared to here in the U.S.?
I was struck by the difference in the funding structures for higher education and scholarship. For the most part, in other parts of the world, there is very little private philanthropy. There are some exceptions, for example the Wellcome Trust in the U.K, but for the most part there are no major foundations providing recurring grant funding into the environment, and government funding may not distinguish between the sciences and the humanities. I have to admit that initially the international model struck me as the better model, because it means that (apparently) you are competing in the same larger funding environment as your computer-science colleagues–and when funding is dominated by government funded agencies, especially a single agency as is the case in many places, there is a huge opportunity to ensure that activities are coordinated and even integrated. But over the course of watching reactions to our report play out in the community, I’ve come to appreciate the flexibility and vibrancy of the system here in the U.S. and the value in having diverse funding options and opportunities. Yes, there is less vertical integration, but there is some protection in that. The announcement earlier this year of a reduction of funds to the Library of Congress’s NDIIPP program[2] is one example of the perils in relying too much on single-source government funding, as are the cutbacks to the Arts and Humanities Data Service in the U.K.[3] So I have come to believe that one system is not necessarily better than the other; each has its pros and cons.

Continuing on the international theme, did the commission engage in issues related to cyberinfrustructure in under-resourced parts of the world, such as in developing countries?
You might be surprised at what counts as an under-resourced part of the world, with respect to technology. It’s not just countries in the developing world that have a long way to go in building cyberinfrastructure. For example, when I was at the University of Virginia, we brought in a number of American Studies scholars from around the world to ask what they would like to have networked access to in our library’s special collections. Their response was that the collections were great, but in Ireland, for one example, students wouldn’t have the lab or classroom facilities to make working with networked resources practical.

More than just the base layer, though, what is definitely in short supply at some institutions are the human resources that are needed to make cyberinfrastructure work. There are simply not enough skilled people out there, and this is very true of small colleges in this country as well as overseas. And even if you are fortunate at a small place to have some capability to help enable digital scholarship, that human capacity is mobile. You can invest a lot in it and it can leave you. If a key person leaves a small place, you can be totally back to square one. That is why it is important to remember that cyberinfrastructure includes people.

So there was the cyberinfrastructure report for the sciences, and then there is this report for the humanities and social sciences. Is there a difference? Why is there a need for more than one report?
In many ways there is convergence and the distinctions are blurring. But there are some very real differences in the way people do their work across disciplines that have to be taken into account in the resources and tools they need and will use. In some ways, this is less about the discipline per se, and more about the nature of the resources they depend on. So, for example, in areas of the sciences that depend heavily on massive quantities of observed data (say, from the Hubble telescope) there is already considerable collaboration and tools are needed to facilitate that. In areas of the humanities there are areas of research that are more singular.

Having said that, though, I do believe we are moving to a world where there will be massive amounts of data that humanists will need to sort and understand. Like the data from the telescope, it will have to be processed in a way that will require large-scale data mining activity and that will promote more collaboration. There is going to be much more “distant reading” of texts.[4] By that I mean computers will “read” texts and process them in a variety of ways, in accordance with more sophisticated semantic comparison and search tools, and prepare them for higher level analysis by scholars. The mass digitization projects are going to accelerate this process.

Speaking of the mass digitization projects, Google announced its Google Books Library Project while the commission was writing up the report. Did that affect your deliberations and your final conclusions?
It didn’t impact our report that much. We saw the potential and pointed to Google’s effort and the Open Content Alliance as important examples. I think Google’s announcement had a bigger impact on the report’s reception, as humanities scholars could actually imagine that millions of books would be available in digital form. The idea of cyberinfrastructure became much more real to them. Prior to those projects such a thing was only the stuff of dreams.

Some concluding thoughts: what other impacts has the report had?
We have been encouraged by the way the report has been received as a framework to think about these issues, and maybe even to help inform foundations’ and government agencies’ grantmaking strategies. I have been told on numerous occasions that the report has been helpful in this regard. Last spring, the Council on Library and Information Resources convened a group of federal funding agencies and foundations to discuss follow-up to the report and there was some good progress. We then had a meeting to discuss the need and potential for centers of excellence to take components of cyberinfrastructure forward. One of the things that I learned through those conversations is that the funding agencies tend to operate like any other enterprise (colleges and universities included), and collaboration is hard. He who starts it, owns it. If the report can provide a structure that helps guide even a small amount of collective action or coordination of resources, that would be a very good thing. The requirements of cyberinfrastructure are beyond the means of any single funding agency; in fact they exceed the resources of those agencies combined. We require sustained investment from the private, governmental and nonprofit and university sector to realize the great potential that digital and network technologies offer for scholarship for the next century.

Kevin Guthrie is President of Ithaka.
[1] National Science Foundation (NSF), Revolutionizing Science and Engineering Through Cyberinfrastructure (2003), 5.

[2] $47 million was rescinded from the budget of the National Digital Information Infrastructure and Preservation Program, Feb 17, 2007. See, for example, “LC Hit By $47 Million Cut in Digital Preservation Funds,” Library Journal (March 20, 2007), http://www.libraryjournal.com/article/CA6426077.html (accessed September 2,  2007).

[3] The Arts and Humanities Research Council announced that it would cease funding the Arts and Humanities Data Service as of March 31, 2008. The JISC is engaged in a review to determine the future of the AHDS and has stated that it will not fund the service in its current form alone. See http://ahds.ac.uk/news/futureAHDS.htm (accessed September 2, 2007).

[4] See Franco Moretti, Graphs, Maps, Trees: Abstract Models for a Literary History (Verso), 2005.

The (Uncommon) Challenge of the Cultural Commonwealth

by Gary Wells, Ithaca College

A Review of Our Cultural Commonwealth: ACLS Report on Cyberinfrastructure for Humanities and Social Sciences

Our Cultural Commonwealth, the 2006 report of the American Council of Learned SocietiesCommission on Cyberinfrastructure for Humanities and Social Sciences, makes a rational and forceful case for coordinated action and clear policy on technology in the humanities and social sciences.

Emphasizing the importance and potential of technology for the study and dissemination of cultural knowledge through the development of an appropriate digital infrastructure, the report provides an important basis for forward-looking discussion and planning, not only about technology, but also about the nature of cultural scholarship in a digital environment.

The commission, chaired by John Unsworth, set out to make the case for building, implementing, supporting and integrating a technological foundation for the humanities and the “non-normative” social sciences. These disciplines have traditionally been slow, even reluctant, to embrace information technology (a reluctance compounded by the lack of a sense of urgency in this regard by college administrators). Our Cultural Commonwealth should be read by academic, governmental and cultural leaders, as well as scholars and academics, who have a vested interest in securing the future vitality of the humanities and social sciences.

The report is divided into three parts, corresponding to the commission’s charge: an analysis and overview of the history and current state of humanities and social science cyberinfrastructure; an analysis of the needs and potentials of such an infrastructure; and an action plan for the future, including a set of eight recommendations in support of the framework established by the report. It complements, to a certain extent, the National Science Foundation‘s 2003 report, Revolutionizing Science and Engineering through Cyberinfrastructure.

The charge to the ACLS commission points to the convergence of humanists and scientists within this technology infrastructure, and to the need to give the humanities and social sciences a voice in shaping and directing that infrastructure. The report defines “cyberinfrastructure” as the technology components “shared broadly across communities of inquiry but developed for specific scholarly purposes” (authors’ emphasis, page 1). Infrastructure is the least sexy, most expensive and most essential component of the information technology landscape. As the authors acknowledge, cyberinfrastructure is more than the hardware that stores and transmits information; it also includes the software tools, practices, standards and data sets that are essential to meaningful interaction between scholars and their subject of inquiry.

Technology already exerts a powerful influence on the study and dissemination of cultural knowledge. The Internet has become a significant source of information for the specialist and layman alike. Cultural institutions like museums have embraced the web as a parallel site for their collections; academic institutions exploit internal and external digital databases of text and images to supplement traditional media; libraries are reinventing themselves as digital media centers. The practice of scholarship has been altered dramatically by such technologies as email, ubiquitous networking and digital presentation formats.

If technology has already exerted such a profound influence over cultural studies, why is a humanities and social science cyberinfrastructure an issue? The answer is that if the humanities and social sciences are to benefit fully from information technology, this cyberinfrastructure must address the specific needs and practices of those disciplines. One size does not fit all, and the humanities and social sciences have largely had to make do with inadequate tools, incompatible standards, tiny budgets and uninterested leaders. The “chalk and talk” reputation of the humanities and social sciences has resulted in a profound gap between the potential for technology to enrich these disciplines, and the reality of what tools, data and support are actually available.

The commission’s recommendations first define the characteristics that a humanities/social science cyberinfrastructure must have, including accessibility, sustainability and interoperability. In addition, the report suggests that facilitating collaboration and support for experimentation are key characteristics, the implications of which I will return to shortly.

The eight recommendations themselves call for, among other things, elevating the strategic importance of such an infrastructure within institutional planning, developing policies to foster openness and collaboration, and seeking coordination between public and private sectors. Other recommendations include support for digital scholarship in the humanities and social sciences, the creation of national centers for digital scholarship, support for standards and appropriate tools to carry out this scholarship and developing digital collections of data for scholars and the public.

The report’s recommendations are broadly inclusive, in keeping with the commission’s charge. The audience for the ACLS report is itself diverse: administrators, scholars, faculty, legislators, librarians, indeed anyone who has a stake in the humanities and social sciences or in the technology planning and development process. With this breadth of reach, the report must necessarily skim lightly over some of the issues that, for one constituent or another, might appear to be crucial. For example, the commission’s recommendation that any cyberinfrastructure must leverage the interaction between public and private sectors raises the question of conflict between the business interests of private sector stakeholders and the need for reaching a wide audience at the lowest cost. The thorny problem of intellectual property rights and the “fair use” limitations of those rights must necessarily color any discussion of accessibility and interoperability. The economic benefits of the humanities, the “value-added” proposition that intellectual capital translates into material and cultural gain, is discussed as a way of justifying investment in a humanities cyberinfrastructure, although the fuller argument is beyond the scope of this report. This is not a weakness of the report, but it does indicate that the issues with which the commission grappled are complex and difficult.

The investment in time and money necessary to implement the goals outlined in the report must be seen against the larger picture of information technology in contemporary education and cultural practice. As any technology officer at a university, college, museum or library already knows all too well, the great cost of building, supporting and maintaining even the general technology infrastructure of an institution (including networking, file storage, email, security, and hardware) often prohibits or severely curtails the opportunity to develop specialized capabilities. While the authors of the report believe that investment in humanities cyberinfrastructure should be a matter of strategic priority, and indeed they make this their first recommendation, the argument for this priority rests upon (understandably) general notions of public good and future intellectual discovery. This is a recurring theme in the report–the public good brought by the humanities and social sciences justifies the effort to establish and support a unique technological ecosystem accessible to scholars and public alike. But the competition for resources within academic and cultural institutions is fierce, and the public good is often juxtaposed against the demands of marketing and institutional survival.There are other, more specific challenges contained within the recommendations of the ACLS commission. At first glance, the recommendation that leadership for technology must be cultivated from within the humanities and social sciences makes a great deal of sense–who else understands the nature of the task more than those who practice the discipline? But how is this accomplished? How does the system breed technologically-savvy scholars willing to step up to a leadership role without sacrificing their academic interests?

Another challenge is the recommendation for “open standards and robust tools” with which data may be manipulated, examined, shared and presented. Who creates these tools? What mechanism will support a dedicated cadre of software specialists who work closely with the social scientists and humanists in creating such tools? How do these cadres share information about such work to avoid an endless reinvention of the same wheel? Does the specialist lose out because no one is available to help design a necessary but highly-specific application? These challenges are not insurmountable, but they are formidable, especially within an environment of constrained resources and limited vision, which, unfortunately, includes a vast majority of academic and cultural institutions.

The most daunting challenge alluded to in the report is the impact that technology has on the very culture of humanities and social science practice and scholarship. The report notes that established practice and a long tradition have made the humanistic disciplines conservative and risk-averse. The culture of the solitary scholar is so much a part of the fabric of academia that the notion of scholarly collaboration and collective discovery may seem alien and even threatening. Certainly collective effort is required to assemble the data that is essential to scholarly practice, and the existing scholarly and academic infrastructure that supports the humanities and social sciences is built upon the efforts of many individuals pooling their knowledge and providing access through journals, books, and collections. But the products of research, the fruits of scholarly labor, are more likely acts of individual insight, driven by established disciplinary expectations and practice, and turning inward to an audience of other experts.

When the authors of the ACLS report cite “collaboration” and “experimentation” as an expected and desired outcome to a robust humanities/social science cyberinfrastructure, we see that technology has already significantly altered the culture of scholarly practice. In an era where the voice of the expert is increasingly drowned by the collective voices of the masses, when social networking lends power to the many at the risk of ignoring the individual, the notion that a humanities and social sciences cyberinfrastructure can foster both the deep reflection of the single scholar and yet appeal to the needs and interests of the informed public may seem contradictory. The challenge of a collaborative and experimental methodology in the humanities especially, enabled by a deep and effective technology foundation, contains both the allure and anxiety of radical and disruptive change.

Cyberinfrastructure For Us All: An Introduction to Cyberinfrastructure and the Liberal Arts

This is going to be big. According to Arden Bement, Director of the National Science Foundation, the Cyberinfrastructure Revolution that is upon us “is expected to usher in a technological age that dwarfs everything we have yet experienced in its sheer scope and power.”[1]

With a trajectory shooting from the solitary performance of legendary room-size machines (with less computing power than today’s handhelds) to the complex interactions within a pulsing infrastructure of many layered, parallel and intersecting networks, “computing” is continuing to develop exponentially. But in fact, as David Gelernter has put it, “the real topic in computing is the Cybersphere and the cyberstructures within it, not the computers we use as telescopes and tuners.”[2]

We are currently in the middle of the second big opportunity we’ve had to collectively take stock of our computing capabilities, assessing social, intellectual, economic, and industrial requirements, envisioning the future, and calling for coordinated planning across agencies and sectors. The early 1990s was the first such period. As the technical components of the Web came together in Geneva, Senator Al Gore’s High Performance Computing and Communication Act of 1991 led to the creation of the National Research and Education Network and proposals for a “National Information Infrastructure” (NII). These led in turn to funding structures that enabled the construction of hardware and software, of transmission lines and switches, and of a host of physical, connectible devices and interactive services that gave rise to the Internet we know today.

Just as the NII discussions had a galvanizing effect on building those earlier networks, the National Science Foundation’s 2003 report on Revolutionizing Science and Engineering Through Cyberinfrastructure is having a similar effect today. The product of a more sophisticated understanding of our civilization’s dependence on computer networking–a dense, multi-layered cyberinfrastructure that goes beyond switches and technical standards–the NSF report calls for a massive set of new investments (public and private), for leadership from many quarters, for changing professional practices, and for necessary institutional and organizational changes to match the opportunities provided by the tremendous recent advances in computing and networking. That report, often referred to as the Atkins report (justifiably named after Dan Atkins, the visionary chair of the NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure), inspired no less than 27 related reports on cyberinfrastructure and its impacts on different sectors.[3]

These reports have essentially laid out the territory for how best to harness the power of distributed, computer-assisted collaborative production. They forcefully and formally call attention to the shift in economic and social production from a classic industrial base to a networked information base. Interestingly, this time around, the reports not only acknowledge, they highlight humanistic values and the role of the arts, humanities and social sciences (“the liberal arts”) in a way that was not done in the documents of the National Information Infrastructure. At the heart of NSF’s mission to build the most advanced capacity for scientific and engineering research is the emphasis that it be “human-centered.”[4] This is an invitation for the liberal arts to contribute to the design and construction of cyberinfrastructure (CI).

Most significant of those 27 reports for the liberal arts community is Our Cultural Commonwealth, the 2006 report of the Commission on Cyberinfrastructure for the Humanities and Social Sciences, created by the American Council of Learned Societies (ACLS). The report underscores the values of designing an environment that cultivates the richness and diversity of human experience, cultures and languages, using the strengths of this community: “clarity of expression, the ability to uncover meaning, the experience of organizing knowledge and above all a consciousness of values.”[5] It reminds us of the founding legislation of the NEH that asserts that, parallel to the core activities of the sciences, there needs to be a healthy capacity, provided by humanities disciplines, to achieve “a better understanding of the past, a better analysis of the present and a better view of the future.” As we understand the power of software tools to parse massive amounts of data, and the potential of collaborative expertise to wield those tools and articulate the results, we need to emphasize the place of the values of individual and collective creative imagination.

In the wake of these reports, as the term “cyberinfrastructure” gains currency, as initiatives are born and decisions made, this seemed a good moment for Academic Commons to capture a range of perspectives from scholars, scientists, information technologists and administrators, on the challenges and opportunities CI presents for the liberal arts and liberal arts colleges. What difference will cyberinfrastructure make and how should we prepare?

How do we get there from here? Reviewing Our Cultural Commonwealth, art historian Gary Wellsnotes some key challenges. First, who’s to pay for some of the necessary transformations and how? Budget, especially for technology, has always been a big issue for a community which, in Wells’s words, “has had to make do with inadequate tools, incompatible standards, tiny budgets and uninterested leaders.” There’s a gap between what is possible and what is available for faculty right now. How do we effectively make the case for attention to CI among the other competing demands for a limited budget? How can the budget be expanded, especially when there are strong calls to make this CI both a means for greater collaboration within and among academic disciplines but also as a route out to the general public? Who will lead this call to arms?

While institutional response and organizational change is called for, classics scholar and Georgetown University Provost James O’Donnell, a bold yet pragmatic voice for envisioning change, affirms that change will have to come from the faculty, who have been mostly quite complacent about the future of the Web. Humanists, for the most part, are changing their practices incrementally through the benefits of email and the Web, but the compelling vision that will inspire faculty to develop a new kind of scholarship is still missing, despite the individual accomplishments of a notable few.[6]

Cyberinfrastructure draws attention to another significant challenge to academic liberal arts culture: in a word, collaboration. While that culture is created through scholarly communication–journals, conferences, teaching, the activity of scholarly societies and the continuing evolution of “disciplines”–much of the daily activity of the humanities is rooted in the assumption that humanities research and publication is essentially an individual rather than a collaborative activity. Will CI bring a revolution in the degree of real and active collaboration in research and the presentation/publication of the results?

In confronting this thorny issue, Sayeed Choudhury and colleague Timothy Stinson step back and take a long-term view. Perhaps scientists were not always such good collaborators. Perhaps there’s a cycle to the culture and practice of disciplines as they evolve. With tongue slightly in cheek, looking backward as well as forward, they make a modest proposal for a new paradigm for humanities research.

Computer scientist, Michael Lesk has had a long interest in bridging the Two Cultures and in building digital libraries. While at the NSF, he spearheaded the development of the Digital Libraries Initiative (1993-1999) that funded a number of advanced humanities projects.[7] Observing a new paradigm at work in the sciences, where direct observation is often replaced by consulting results posted in massive data repositories like the Sloan Digital Sky Survey, the Protein Data Base or Genbank, he turns to the humanities and sees little progress beyond the digitizing of material. But while waiting for new creative uses of what digitized material there is, Lesk underscores the significant economic, legal, ethical and political problems that need resolution. Citing just one: there is still great confusion among all players about which economic models should apply: who pays for what, when and how?

But again, how do we begin? John Unsworth, chair of the ACLS Commission, and now well-versed in defining and describing CI (you’ll enjoy his culinary analogies in his discussion with Kevin Guthrie), sees construction of a humanities cyberinfrastructure as necessarily incremental.[8] The first wave is the fundamental (but still difficult) task of building the digital library: bringing together representations of the full array of cultural heritage materials in as interoperable, usable and sustainable a digital form as possible. This is ‘content as infrastructure.’

Different disciplines are doing this with different degrees of success. Aided now by the operations of Google, the Open Content Alliance [see a profile in this issue], the Internet Archive and others, our libraries and archives have made available a wide panoply of materials in digital form: certainly the core texts of Western history and culture, and a considerable array of material from the West and other cultures in other media. The Getty’s Kenneth Hamma, however, argues here that, despite the images that are available in some form, many museums are holding a lot of cultural heritage material hostage. Even public domain work is still under digital lock and key by many gatekeepers who worry about the fate of “their” images once they are released into the digital realm. Millions of well-documented images of objects held by museums (of art, history, natural history), more easily accessible in very high-definition formats, will have a tremendous impact on all kinds of disciplines, let alone on the traditional ‘canon’ of works central to Art History. Along these lines, museum director John Weber writes convincingly here of the potential offered by CI for campus museums such as his own to be radically more relevant and useful for curricula around the globe by transforming museum exhibitions into three-dimensional, interactive and visceral “texts” for study and response.

While even public domain material is proving elusive, material still under copyright is often a nightmare both to find and to use in digital form, as the traditional sense of “fair use” is under siege and many faculty clearly have a lot to learn about copyright law.[9] Elsewhere, John Unsworth has cited intellectual property as “the primary data-resource-constraint in the humanities” (paralleling privacy rights as the “primary data-resource-constraint in the social sciences”). Believing the solutions to be partly technical, Unsworth sees them as the “primary ‘cyberinfrastructure’ research agenda for the humanities and social sciences.”[10] Michael Lesk underscores this message in his essay in this issue, reporting that much of the cyberinfrastructure-related discussion in the humanities is not so much “about how to manage the data or what to do with it, but what you are allowed to do with it.” Some combination of technical, social and legal answers are surely called for here.

But all of this, as Lesk reiterates, is just the beginning. Only a comparative handful of scholars in a variety of fields have begun to build new knowledge and experiment with forms of new scholarship. Here, we are fortunate to have noted media scholar Janet Murray open up some other paths in her gripping account of what the process and products of a new cyberscholarship might look like.

Murray’s starting point is that a new medium requires new genres and new strategies for making meaning; she suggests some approaches that will be more practical as the Semantic Web, sometimes nicknamed Web 3.0,[11] approaches. When software can analyze everything online as if it were in the form of a database, we will have access to tremendously powerful tools that will enable us to conduct “computer-assisted research, not computer-generated meaning.” Such structure will help us “share focus across large archives and retrieve the information in usable chunks and meaningful patterns.” Just as the highly evolved technology of the book (with its segmentation and organization into chapters and sections, with titles, section heads, tables of contents and indices, etc.) allows us greater mastery of information than we had using oral memory, so better established conventions of “segmentation and organization in digital media could give us mastery over more information than we can focus on within the confines of linear media.” Overall, she stresses cyberinfrastructure’s potential as a “facilitator of a vast social process of meaning making” (a more developed collaborative process) rather than focusing on the usual data-mining approach.

For a closer look at how one discipline might change with access to cyberinfrastructure, we asked three art historians (Guy Hedreen, Amelia Carr, and Dana Leibsohn) to discuss their expectations. How might their practice and their discipline evolve? Their roundtable discussion focuses initially on the critical importance of access to images (the “content infrastructure”) before moving on to consider the importance of taking responsibility for fostering new forms of production “more interesting than the book.” Ultimately, CI will be useless unless it not only revolutionizes image access and metadata management, but also helps us to think differently about vision and objects: “what kind of image work is the work that matters most?”

Zooming out again to get the big picture beyond any one discipline, I’d like to encourage all readers of this collection to read the recent, groundbreaking report out of a joint NSF/JISC Repositories Workshopon data-driven scholarship. The report, The Future of Scholarly Communication: Building the Infrastructure for Cyberscholarship, defines cyberscholarship (“new forms of research and scholarship that are qualitatively different from traditional ways of using academic publications and research data”), reviews the current state of the art, the content and tools still required, the palpable resistance to the changes necessary for it to take hold, and some of the international organizational issues. It even sketches out a roadmap for establishing an international infrastructure for cyberscholarship by 2015. Reviewing the report, Gregory Crane, one of the workshop participants, zeroes in on the core issue, the first requirement for launching sustainable cyberscholarship: getting a system of institutional repositories for scholarly production in place, working and actively being used by scholars. By the way, two of the papers in this Academic Commons collection (those by Choudhury and Murray) had their roots in position papers delivered at the NSF/JISC Repositories Workshop.

How all this goes down on the college campus is examined here by physicist Francis Starr, speaking from his experience in installing the latest in “cluster computing” at Wesleyan University. While hooking into a network is part of what cyberinfrastructure is about, so is developing one’s own local infrastructure as efficiently as possible. His main theme though is the equal importance of human expertise (local and distributed) and installed hardware. This theme is carried further by Todd Kelley in his demonstration of the wisdom of using cyber services that outside organizations can provide. Kelley stresses the balance to be achieved among the human, organizational and technological components when implementing such services.

Finally, chemist Matthew Coté beautifully illustrates how cyberinfrastructure might be visible on a small liberal arts campus through the example of one small but powerful new building: the Bates College Imaging and Computing Center. Designed more specifically to bring the arts and sciences together in exemplifying the potency of the liberal arts ideal (as codified by Bates’s recently-adopted General Education Program), the building should prove to be one of the most creative and plugged-in cyberinfrastructure-ready places on campus. Its almost iconic organization into lab, gallery/lounge and classroom links group research and learning, individual creativity and discovery and the key role of open social space. Artists, humanists, and scientists are equally welcome in this space with open equipment (with training programs and nearby expertise for help in using it). As Professor Coté puts it, “Its array of equipment and instrumentation, and its extensive computer networking, make [the Imaging Center] the campus hub for collaborative and interdisciplinary projects, especially those that are computationally intensive, apply visualization techniques, or include graphical or image-based components.”

Where do we go from here? The focus of these pieces has been on institutions and disciplines. Cyberinfrastructure will bring significant changes to both, and the evolution of both are intertwined. But cyberinfrastructure is not a one-way street, but rather a massive intersection. Just as Web 2.0 has provided more of a user-oriented network in which communities create value from multiple, individual contributions, so the future limned here by our guests is one that will depend not only on large supercomputing centers and government agencies but on the changing practices of multiple arrays of individuals, all of whose developing practices are at work in designing this new environment.

NOTES

[1] “Shaping the Cyberinfrastructure Revolution: Designing Cyberinfrastructure for Collaboration and Innovation.” First Monday 12, no. 6 (June 2007). http://firstmonday.org/issues/issue12_6/bement/index.html. Accessed September 26, 2007.

[2] David Gelernter, “The Second Coming–A Manifesto.” The Edge, 2000. http://www.edge.org/3rd_culture/gelernter/gelernter_p1.html.
Accessed October 30, 2007.

[3] National Science Foundation Office of Cyberinfrastructure, Cyberinfrastructure Vision for 21st Century Discovery, Sec3:46 (2007): Appendix B, “Representative Reports and Workshops.” http://www.nsf.gov/od/oci/CI_Vision_March07.pdf. Retrieved August 8, 2007.

[4] “The mission is for cyberinfrastructure to be human-centered, world-class, supportive of broadened participation in science and engineering, sustainable, and stable but extensible.” Cyberinfrastructure Vision, Sec3:2.

[5] Our Cultural Commonwealth, p.3.

[6] See for example, Edward Ayers’s questioning article, “Doing Scholarship on the Web: 10 Years of Triumphs and a Disappointment,” Chronicle of Higher Education 50, no. 21 (January 30, 2004) B24-25.

[7] See Michael Lesk, “Perspectives on DLI-2 – Growing the Field.” D-Lib Magazine 5 no. 7/8 (July/August 1999) http://www.dlib.org/dlib/july99/07lesk.html. Accessed October 30, 2007.

[8] For a superb introduction to the issues, see John Unsworth’s address at the 2004 annual meeting of the Research Libraries Group: “Cyberinfrastructure for the Humanities and Social Sciences.”http://www3.isrl.uiuc.edu/~unsworth/Cyberinfrastructure.RLG.html. Accessed October 30, 2007.

[9] See, for example, Renee Hobbs, Peter Jaszi, Pat Aufderheide, The Cost of Copyright Confusion for Media Literacy. Center for Social Media, American University 2007.http://www.centerforsocialmedia.org/files/pdf/Final_CSM_copyright_report.pdf. Accessed October 31, 2007.

[10] Unsworth, ibid.

[11] The classic document here is Tim Berners-Lee, James Hendler and Ora Lassila, “The Semantic Web.” Scientific American (May 2001) http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21. Accessed October 30, 2007. See Berners-Lee’s recent thoughts in Nigel Shadbolt, Tim Berners-Lee, Wendy Hall, “The Semantic Web Revisited,” IEEE Intelligent Systems 21, no. 3 (May/Jun, 2006) 96-101.http://eprints.ecs.soton.ac.uk/12614/01/Semantic_Web_Revisted.pdf. Accessed October 30, 2007.

css.php