Notes for Susan Hockey “The History of Humanities Computing”

Key concepts: archive, concordance program, edition, humanities, humanities computing, index verborum, lemmatization.


Related theorists: Paul Bratley, Theodore Brunner, Roberto Busa, Nancy Ide, Willard McCarty, Mosteller, Michael Sperberg-McQueen, Wallace.


Introduction

Initial scope of humanities computing as applications to research and teaching within humanities arts subjects, with bias for textual sources.

Rigor and systematic unambiguous procedural knowledge characteristic of sciences applied to humanities problems previously treated serendipitously, as in through narratives relying on literarcy associations and prior scholarship.

(3) Suffice it to say we are concerned with the applications of computing to research and teaching within subjects that are loosely defined as “the humanities,” or in British English “the arts.” Applications involving textual sources have taken center stage . . . but by its very nature, humanities computing has had to embrace “the two cultures,” to bring the rigor and systematic unambiguous procedural methodologies characteristic of the sciences to address problems within the humanities that had hitherto been most often treated in a serendipitous fashion.

Beginnings: 1940 to early 1970s

Periodization perspective of humanities computing begins with 1949 to early 1970s era by signal work of Busa project indexing the words of Aquinas.

Busa helped by Thomas Watson at IBM to transfer texts to punched cards and write a concordance program.

(4) Unlike many other interdisciplinary experiments, humanities computing has a very well-known beginning. In 1949, an Italian Jesuit priest, Father Roberto Busa, an Italian Jesuit priest, began what even to this day is a monumental task: to make an index verborum of all the words of St Thomas Aquinas and related authors, totaling some 11 million words of medieval Latin. Father Busa imagined that a machine might be able to help him, and, having heard of computers, went to visit Thomas J. Watson at IBM in the United States in search of support (Busa 1980). Some assistance was forthcoming and Busa began his work. The entire texts were gradually transferred to punched cards and a concordance program written for the project.

First humanities computing software developed to parse and lemmatize medieval Latin in what came to be semi automatic fashion; echoes in to my own attempt to develop tapoc software to automatically write dissertation.

Rudimentary hypertextual features advertised as Latin cum hypertextibus; user guide in Latin, English, Italian.

(4) His team attempted to write some computer software to deal with this and, eventually, the lemmatization of all 11 million words was completed in a semi-automatic way with human beings dealing with word forms that the program could not handle. . . . A CD ROM of the Aquinas material appeared in 1992 that incorporated some hypertextual features (“cum hypertextibus”) (Busa 1992) and was accompanied by a user guide in Latin, English, and Italian.

Examples of quantitative approaches to style and authorship studies predating automatic computing dwarfed in scope by affordances of the latter.

(5) The use of quantitative approaches to style and authorship studies predates computing. . . . But the advent of computers made it possible to record word frequencies in much greater numbers and much more accurately than any human being can.

Early humanities computing work by Mosteller and Wallace of authorship of disputed Federalist Papers interested in statistical methods, demonstrating consciousness of purposes as well as reflection on expansion of techniques.

(5) Mosteller and Wallace were primarily interested in the statistical methods they employed, but they were able to show that Madison was very likely to have been the author of the disputed papers.

Data input, storage, and representation recognized as key technological limitations; nod to Unicode as breakthrough.

(5) Character-set representation was soon recognized as a substantial problem and one that has only just begun to be solved now with the advent of Unicode, although not for every kind of humanities material.

Serial processing limitation of magnetic tape affected encoding of historical material, forcing into single linear stream.

(6) Most large-scale datasets were stored on magnetic tape, which can only be processed serially. . . . This was not so problematic for textual data, but for historical material it could mean the simplification of data, which represented several aspects of one object (forming several tables in relational database technology), into a single linear stream.

COCOA concordance program provided markup and economical file space usage; fixed format coding the other major citation technique.

(6) Modeled on a format developed by Paul Bratley for an Archive of Older Scottish texts (Hamilton-Smith 1971), COCOA enables the user to define a specification for the document structure which matches the particular set of documents. . . . COCOA is also economical of file space, but is perhaps less readable for the human.
(6) The other widely used citation scheme was more dependent on punched card format. In this scheme, often called “fixed format,” every line began with a coded sequence of characters giving citation information.

One off IBM conference in 1964 forerunner of later humanities computing conferences.

(6) In 1964, IBM organized a conference at Yorktown Heights. The subsequent publication, Literary Data Processing Conference Proceedings, editied by Jess Bessinger and Stephen Parrish (1965), almost reads like something from twenty or so years later, except for the reliance on punched cards for input.

ALLC/ACH conferences began in 1970.

(6-7) The first of a regular series of conferences on literacy and linguistic computing and the precursor of what became the Association for Literacy and Linguistic Computing/Association for Computers and the Humanities (ALLC/ACH) conferences was organized by Roy Wisbey and Michael Farringdon at the University of Cambridge in March, 1970.

Computers and Humanities journal started publication in 1966.

(7) Another indication of an embryonic subject area is the founding of a new journal. Computers and the Humanities began publication in 1966 under the editorship of Joseph Raben.

Dedicated computing centers established for humanities research in 1960s; TuStep software set high standards.

(7) The 1960s also saw the establishment of some centers dedicated to the use of computers in the humanities. . . . The TuStep software modules are in use to this day and set very high standards of scholarship in dealing with all phases from data entry and collation to the production of complex print volumes.

Key problems focus on textual material, the symbolic, inherited form early period.

(7) Working in this early period is often characterized as being hampered by technology, where technology is taken to mean character sets, input/output devices and the slow turnaround of batch processing systems. . . . What is more characteristic is that key problems which they identified are still with us, notably the need to look at “words” beyond the level of the graphic string, and to deal effectively with variant spellings, multiple manuscripts, and lemmatization.

Consolidation: 1970s to mid-1980s

Widening range of interests at conferences and consolidation of common software platforms like Oxford Concordance Program noted during second period from 1970s to mid 1980s.

(8) ICCH attracted a broader range of papers, for example on the use of computers in teaching writing, and on music, art, and archeology. The Association for Computers and the Humanities (ACH) grew out of this conference and was founded in 1978.
(8) The second version of the COCOA concordance program in Britain was designed to be run on different mainframe computers for exactly this purpose (Berry-Rogghe and Crawford 1973). . . . Dissatisfaction with its user interface coupled with the termination of support by the Atlast Laboratory, where it was written, led the British founding bodies to sponsor the development of a new program at Oxford University. Called the Oxford Concordance Program (OCP), this software was ready for distribution in 1982 and attracted interest around the world with users in many different countries (Hockey and Marriot 1979a, 1979b, 1979c, 1980).

Oxford Text Archive an initiative to avoid duplication of effort in text archiving and maintenance; text preparation rather than programming began to take majority of project time.

(8) The need to avoid duplication of effort also led to consolidation in the area of text archiving and maintenance. With the advent of packaged software and the removal of the need for much programming, preparing the electronic text began to take up a large proportion of time in any project. The key driver behind the establishment of the Oxford Text Archive (OTA) in 1976 was the need simply to ensure that a text that a researcher had finished with was not lost.

TLG quintessential project focused on creating a new research archive versus preserving individual projects by others.

(8-9) The OTA's approach was to offer a service for maintenance of anything that was deposited. It managed to do this for some considerable time on very little budget, but was not able to promote the creation of specific texts. Groups of scholars in some discipline areas made more concerted attempts to create an archive of texts to be used as a source for research. Notable among these was the Thesaurus Linguae Graecae (TLG) begun at the University of California Irvine and directed for many years by Theodore Brunner. Brunner raised millions of dollars to support the creation of a “databank” of Ancient Greek texts, covering all authors from Homer to about AD 600, some 70 million words (Brunner 1993).

Debate over learning programming debate included replacement for Latin as mental discipline, but too difficult and distracting from core humanities work; principle languages SNOBOL and Fortran.

Programming replaced by interface use as primary humanities computing activity taking time from core practices.

(9) A debate about whether or not students should learn computer programming was ongoing. Some felt that it replaced Latin as a “mental discipline” (Hockey 1986). Others thought that it was too difficult and took too much time away from the core work in the humanities. The string handling language SNOBOL was in vogue for some time as it was easier for humanities students than other computer languages, of which the major one was still Fortran.

Disk storage and relational technologies still created problems in forcing information into tables.

(9) There were some developments in processing tools, mostly through the shift from tape to disk storage. Files no longer had to be search sequentially. . . . However, relational technologies still presented some problems for the representation of information that needed to be fitted into tables.

Preponderance of vocabulary studies leveraging concordance programs.

Dissemination through conferences and journals the other primary feature of second period.

(9-10) A glance through the various publications of this period shows a preponderance of papers based on vocabulary studies generated initially by concordance programs. . . . The important developments during this period lay more in support systems generated by the presence of more outlets for dissemination (conferences and journals) and the recognition of the need for standard software and for archiving and maintaining texts.

New Developments: Mid-1980s to Early 1990s

Personal computer period of mid 1980s to early 1990s freed humanities computing from the computing centers, their expertise and scrutiny; result was much duplication of effort but also innovation, comparable to cathedral versus bazaar models of software development.

(10) The initial impact in humanities computing was that it was no longer necessary to register at the computer center in order to use a computer. Users of personal computers could do whatever they wanted and did not necessarily benefit from expertise that already existed. This encouraged duplication of effort, but it also fostered innovation where users were not conditioned by what whas already available.

Macintosh attractive for ability afforded by GUI to display non standard character sets and build hypertexts via Hypercard programming tool.

Hypercard first simple programming tool available to humanities scholars.

(10-11) The Apple Macintosh was attractive for humanities users for two reasons. Firstly, it had a graphical user interface long before Windows on PCs. This meant that it was much better at displaying non-standard characters. . . . Secondly, the Macintosh also came with a program that made it possible to build some primitive hypertexts easily. HyperCard provided a model of file cards with ways of linking between them. It also incorporated a simple programming tool for making it possible for the first time for humanities scholars to write computer programs easily.

Electronic mail shared at 1985 conference led new era of immediate communication, later Humanist ListServ in 1987.

Humanist the model electronic discussion list, credited for developing and maintaining distributed scholarly community defining humanities computing.

(11) At the 1985 ALLC conference in Nice, electronic mail addresses were exchanged avidly and a new er of immediate communication began.
(11) On his [Willard
McCarty] return from the [1987 ICCH conference] he discovered the existence of ListServ, and Humanist was born (McCarty 1992). The first message was sent out on May 7, 1987.
(11)
Humanist has become something of a model for electronic discussion lists. . . . Humanist has become central to the maintenance and development of a community and it has made a significant contribution to the definition of humanities computing.

Development of TEI from SGML major intellectual development of third period, inspired by 1987 meeting at Vassar to ponder standard encoding scheme.

(12) In terms of intellectual development, one activity stands out over all others during this period. In November 1987 Nancy Ide, assisted by colleagues in ACH, organized an invitational meeting at Vassar College, Poughkeepsie, to examine the possibility of creating a standard encoding scheme for humanities electronic texts (Burnard 1988).

Text Encoding Initiative reflects interest in markup in addition to providing systematic attempt to categorize and define all features of humanities texts of interest to scholars, yielding over 400 tags.

(12) The size, scope, and influence of the TEI far exceeded what anyone at the Vassar meeting envisaged. It was the first systematic attempt to categorize and define all the features within humanities texts that might interest scholars. In all, some 400 encoding tags were specified in a structure that was easily extensible for new application areas. The specification of the tags within the Guidelines illustrates some of the issues involved, but many deeper intellectual challenges emerged as the work progressed. Work in the TEI led to an interest in markup theory and the representation of humanities knowledge as a topic in itself.

Divergence of spin off disciplines like computers and writing, and linguistic computing, which served defense and speech analysis communities without much communication between them.

(13) Gradually, certain application areas spun off from humanities computing and developed their own culture and dissemination routes. “Computers and writing” was one topic that disappeared fairly rapidly. More important for humanities computing was the loss of some aspects of linguistic computing, particularly corpus linguistics, to conferences and meetings of its own. . . . In spite of a landmark paper on the convergence between computational linguistics and literary and linguistic computing given by Zampolli and his colleague Nicoletta Calzolari at the first joint ACH/ALLC conference in Toronto in June 1989, there as little communication between these communities, and humanities computing did not benefit as it could have done from computational linguistics techniques.

The Era of the Internet: Early 1990s to the Present

Impact of Web initially missed by entrenched humanities computing practitioners, just as Microsoft did; TEI adherents criticized HTML as weak, appearance based markup system like word processor formats.

(13) Initially, some long-term humanities computing practitioners had problems in grasping the likely impact of the Web in much the same way as Microsoft did. Those involved with the TEI felt very much that HyperText Markup Language (HTML) was a weak markup system that perpetuated all the problems with word processors and appearance-based markup.

Delivery of scholarly material over Internet became new focus.

(13) Anyone can be a publisher on the Web and within a rather short time the focus of a broader base of interest in humanities computing became the delivery of scholarly material over the Internet.

Archive versus edition debate, emphasizing navigation or scholarly added value; Hockey argues predominance of navigation concerns over analysis tools and techniques of prior period.

(14) The term “archive” was favored by many, notably the Blake Archive and other projects based in the Institute for Advanced Technology in the Humanities at the University of Virginia. “Archive” meant a collection of material where the user would normally have to choose a navigation route. “Edition” implies a good deal of scholarly added value, reflecting the views of one or more editors, which could be implemented by privileging specific navigation routes. SGML (Standard Generalized Markup Language), mostly in applications based on the TEI, was accepted as a way of providing the hooks on which navigation routes could be built, but significant challenges remained in designing and building an effective user interface. The emphasis was, however, very much on navigation rather than on the analysis tools and techniques that had formed the major application areas within humanities computing in the past.

Libraries new players in putting collections on the Internet.

(14) Although at first most of these publishing projects had been started by groups of academics, it was not long before libraries began to consider putting the content of their collections on the Internet.

Example of Orlando Project creating new material and forms of scholarly writing.

(14-15) With substantial research funding, new material in the form of short biographies of authors, histories of their writing, and general world events was created as a set of SGML documents (Brown et. al. 1997). It was then possible to consider extracting portions of these documents and reconstituting them into new material, for example to generate chronologies for specific periods or topics. This project introduced the idea of a completely new form of scholarly writing and one that is fundamentally different from anything that has been done in the past.

New collaborative projects made possible by Internet technologies; importance of project management underappreciated.

(15) The Internet also made it possible to carry out collaborative projects in a way that was never possible before. . . . The technical aspects of this are fairly clear. Perhaps less clear is the management of the project, who controls or vets the annotations, and how it might all be maintained for the future.

TEI extensibility clashed with needs of libraries for durable, closely followed standards, raising questions about philosophy of TEI.

(15) The TEI's adoption as a model in digital library projects raised some interesting issues about the whole philosophy of the TEI, which had been designed mostly by scholars who wanted to be as flexible as possible. Any TEI tag can be redefined and tags can be added where appropriate. A rather different philosophy prevails in library and information science where standards are defined and then followed closely – this to ensure that readers can find books easily.

Addition of multimedia added new dimension to humanities electronic resources, but at time of writing mostly limited to manuscript images, awaiting ubiquitous high speed access, perhaps through convergence with television.

(15) An additional dimension was added to humanities electronic resources in the early 1990s, when it became possible to provide multimedia information in the form of images, audio, and video. . . . The potential of other forms of multimedia is now well recognized, but the use of this is only really feasible with high-speed access and the future may well lie in a gradual convergence with television.

Media theorists began studying electronic resources themselves, especially hypertext.

Noticeable gap between sayers and doers among media theorists.

(16) Electronic resources became objects of study in themselves and were subjected to analysis by a new group of scholars, some of whom had little experience of the technical aspects of the resources. Hypertext in particular attracted a good many theorists. This helped to broaden the range of interest in, and discussion about, humanities computing but it also perhaps contributed to misapprehensions about what is actually involved in building and using such a resource. Problems with the two cultures emerged again, with one that was actually doing it and another that preferred talking about doing it.

Introduction of academic programs final symptom of emerging discipline; compare to Hayles survey in How We Think.

(16) The introduction of academic programs is another indication of the acceptance of a subject area by the larger academic community. For humanities computing this began to happen by the later 1990s although it is perhaps interesting to note that very few of these include the words “Humanities Computing” in the program title.

Other parties trying to define the field and provide research agendas.

(16) As the Internet fostered the more widespread use of computers for humanities applications, other organizations began to get involved. This led to some further attempts to define the field or at least to define a research agenda for it.

Conclusion

TEI credited with influence on development of XML, especially its hyperlinking mechanisms.

(16-17) It [the TEI] represents the most significant intellectual advances that have been made in our area, and has influenced the markup community as a whole. The TEI attracted the attention of leading practitioners in the SGML community at the time when XML (Extensible Markup Language) was being developed and Michael Sperberg-McQueen, one of the TEI editors, was invited to be co-editor of the new XML markup standard. The work done on hyperlinking within the TEI formed the basis of the linking mechanisms within XML.

Humanities computing can grow interest in cultural heritage among lifelong learners and general public, which Bauerlein should praise.

(17) Humanities computing can contribute substantially to the growing interest in putting the cultural heritage on the Internet, not only for academic users, but also for lifelong learners and the general public. Tools and techniques developed in humanities computing will facilitate the study of this material and, as the Perseus Project is showing (Rydberg-Cox 2000), the incorporation of computational linguistics techniques can add a new dimension.



Hockey, Susan. “The History of Humanities Computing.” A Companion to Digital Humanities. Eds. Susan Schreibman, Ray Siemens, and John Unsworth. Malden, MA: Blackwell, 2004. 3-19. Print.