Small logo
















Project Casaubon

Highly Preliminary Notes on Next-generation Scholarly Communications

Joseph J. Esposito
Portable CEO
(831) 425-1143
espositoj@gmail.com

November 28, 2004

[This is an early draft of an evolving project.  It is being added to the "library" section of the Processed Book Project to give it an address online, as it is mentioned in the project summary]

This is a rough, early, and incomplete draft of some ideas concerning scholarly communications.  The core idea is ambitious; it is not clear if it is workable.  It is also sufficiently "big picture" to give me pause; accordingly, I have named it "Project Casaubon" in an ironic and humbling reference to the sterile pedant of George Eliot's magnificent Middlemarch, whose never-finished "Key to All Mythologies" is a tonic example to those who don't believe in starting small and working from the bottom up.

Throughout I will be referring to and borrowing from various writers and trends in the communications field, and I apologize in advance for not always being scrupulous about citing who said what where and first.  The aim here is not to say something new but to craft a speculative synthesis of various ideas and themes now current.  Indeed, it is my view, and this is my central thesis, that we are now in a position to tie a number of disparate projects together and that the whole will happily be greater than the sum of the parts.

1. The backdrop.  Lurking in the background of this discussion is the question of "the future of publishing."  There are different conjectures as to what this future will look like, conjectures that vary by segment (STM journals, trade publishing, college texts, K-12, etc.), time frame (what will happen next year or in three years or in ten—and everybody's favorite: what will happen ultimately, presumably just before the sun burns itself out), and perspective (academic librarian, stock analyst, literary agent, not-for-profit STM publisher, etc.).  While there are generic questions (publishing is, after all, publishing), some topics are highly specific, such as the difference between when K-12 classrooms become fully digitized (probably never) and when STM journals cease to publish in hardcopy.  In this memorandum I would like to focus primarily on academic publishing, though I will draw in some illustrations from other areas.  The academy is, after the financial industry, the most wired segment of the society, and what happens there will likely be a model for the dissemination of content everywhere else.

2. The current situation. No summary of the current situation is going to please everybody, but I will stick my neck out and say that while just about everybody in academic publishing today believes that this is a transitional period, for the most part the vision of the next period is only a half-step from the old, as though when paradigms shift they do so in digestible amounts so as not to disrupt anyone's metabolism.  Oddly, considering the implicit conservatism of "half-step-ism," discourse concerning scholarly communications often resembles CNN's Crossfire, in which participants simply shout at each other.  So we hear, for example, that the world of Open Access is coming tomorrow and will destroy all legacy publishing—and good riddance.  Or we hear that alternatives to proprietary journals will compromise peer review, leaving the public at risk for fraud.  Or (my personal favorite) someone will assert that traditional publishers should continue to do everything that they currently do, including ongoing investment of capital, but that they should stop charging for access, for the good of humanity.  I know this last point sounds bizarre in my formulation, but it was the position asserted by a Nobel Laureate to a client of mine, a leading STM publisher.  (I have made my own contribution to the Crossfire-like debate in an article on Open Access, The Devil You Don't Know.)  My own view, in quieter moments, is that we are beginning to see the convergence of a number of strands of scholarly publishing, that this convergence will not take place overnight, but that if we can see where things are heading, we can begin to build what we will need when we get there.  In other words, we need long-term vision coupled with practical plans, and we would do well to avoid the histrionics that somehow have beset the research community over this issue.

3. Anecdote concerning Hewlett Packard. In early 2002 I consulted with HP concerning a publishing-related technology called Digital Content Remastering, or what I called optical character-recognition and scanning on steroids.  As part of my project I began to investigate how the HP corporate library worked with documents, both internally generated documents and formally published ones.  Despite the fact that HP is one of the world's largest tech companies and is in the business of building archives and services for other companies, I was surprised to discover that formally published documents did not reside on the HP Intranet except in rare and haphazard instances, where someone may have clipped something and embedded it in another document, usually email.  I had a long discussion with the head of the library about this, who told me that it was a conscious decision on her part to host no content.  She had been through that, especially when CD ROM was the medium of choice, and she quickly saw that her library staff was being converted into computer systems administrators.  It was simply more efficient to purchase passwords from a publisher, post them on the Intranet, and have librarians and other corporate users access the content on the publisher's site.

There are some big implications in this.  The main one is the fact that in order to search for a particular topic, a researcher (whether a library professional or, say, the product manager for DeskJet printers) had to log onto numerous sites; the inefficiency of this is staggering.  There was no opportunity for aggregated search, dynamic linking, or indeed anything else that digital media uniquely make possible.  HP had taken a half-step from the printed page, but not the full step into a full-blown digital repository.

At the same time, HP, like all major corporations, devotes a lot of time to such things as content management and so-called knowledge management.  (The term knowledge management strikes me as pretentious, but corporate America can't live without it.)  Knowledge management implies the existence of a corporate repository, not unlike the institutional repositories that many universities are now contemplating or building.  There is, alas, no discussion of marrying the repository for knowledge management with the content from formally published documents.  Indeed, it would be exceedingly difficult to do so, as the formally published content resides off-site (on multiple sites, each with its own set of access controls) and is managed by an entirely different department from the LAN administrators.

4. Articulating the problem. I submit that the situation at many universities is not unlike HP's, though, of course, the information-management requirements of even a second-tier research institution dwarf that of the largest corporations.  The problem is simply that a document has meaning and value in its stand-alone form, but a different meaning and potentially a far greater value when it is put into the context of other documents.  All documents, in other words, are expressions and extensions of an intellectual community, not just of individuals, as we all stand on the shoulders of giants, but the way we store information understates the role and value of that community.  (I have developed this point at greater length in ''The Processed Book.)

Let's think for a moment of the tangible expression of such a community, the documents that capture the discourse (not knowledge, but discourse; an activity, not a fixed state) of a particular field.  I defer to librarians to create a complete taxonomy, but I will offer an ad hoc tool here for purposes of discussion.  First we have formally published documents.  The major research libraries each spend between $4 million and $20 million annually on materials (and a figure about double that to process and access that content).  These figures do not include the budgets for rare book archives, nor do they address the special case of Harvard, which spends more by far.  Every year a larger proportion of those materials is delivered in electronic form, and most electronically published documents are accessed over the Internet from a publisher's or intermediary's (e.g., Highwire's) servers.  Second, there are institutional repositories such as the new DSpace service, which archive the intellectual output of faculty and staff.  (DSpace, incidentally, was developed by HP, which doesn't use it.)  Third, there are the collections at libraries that are unconstrained by copyright law (e.g., a collection of private documents, which may be in any medium).  Finally, there are a library's holdings of copyrighted material that is not in a suitable form for online access (e.g., hardcopy monographs), even assuming that the copyright holders would authorize it.  What is needed is to get all of these categories onto the same software platform, where the documents can be mutually supportive of one another through cross-referencing, linking, and various machine processes that are only now coming into view (e.g., using a complete document as a search query applied to the entire repository, yielding a network of related documents whose relationships to one another are indicated without manual editorial intervention).

Integrating all these documents may seem like a hopeless, vainglorious task, and perhaps it is, at least if it is to be accomplished in one go.  That hasn't stopped people from trying, though; indeed, one strain of the Open Access debate envisions a seamless network of all research materials available to anyone, anywhere, at any time, at no cost to the user and with the convenience of a mouse-click.  As a practical matter, however, there are so many things in the way of this New Alexandrian Library, ranging from copyright conflicts to fundamental questions as to how the infrastructure is to be defined, built, and funded, it may be that a less ambitious tortoise will get much further than the hares of Alexandria and Open Access.

5. The Casaubon platform.  The goal of Project Casaubon would be to create a common platform for all academic materials.  We needn't get into where the data for such a project would physically reside, since with the Internet and given enough bandwidth to move things around rapidly, physical location is meaningless to a user equipped with a Web browser.  It may be that Casaubon would have a single, global, centralized database to house everything.  (I am not getting into questions of back-up and long-term archiving.)  Or it may be that a Casaubon database would be unique to each institution.  Or perhaps some institutions would have special prominence and house data for a consortium (Example: Stanford, as the preeminent institution for the central coast of California, might manage the database for any number of smaller institutions such as San Jose State and Cabrillo College).  Or perhaps Casaubon would distribute the database across a wide range of institutions.  No matter: this is for computer architects to address.  It is not for technical reasons that Casaubon does not exist today, but because the rationale to place content into such a database is missing.

Project Casaubon would include a set of specifications for what a Casaubon-compliant document is.  Publishers would be required to put their documents into this format; failure to comply will cut into sales.  The published document would then be uploaded to the Casaubon repository, where the various Casaubon tools for searching, indexing, automated linking, and much, much more could be brought to bear on the document.  The library would no longer license content that resides on a publisher's server or anywhere else other than Casaubon.  The repository would grow with the Casaubon-compliant documents, creating a seed to attract other documents.

My instinct is that the place to start such a project is with scholarly monographs, in part because this is something of a neglected part of a library's thinking about digitization, at least in comparison to the digital energy that is going into the serials area, especially for STM.  A few years ago Questia, a commercial start-up, attempted to do something like this, but the plan was doomed from the beginning, as it bypassed libraries and took on the burden of having to market to 15 million undergraduates, essentially giving Questia the overhead of a consumer marketing company.  With Casaubon the costs are a fraction of Questia's and the utility is much greater.  After the initial cost of developing the specifications for the repository or repositories and setting up the database management system with attendant user interfaces, the repository grows book by book, article by article, one small philanthropic grant at a time.  Thus, the cost of building the system is stretched out over a long period to time and the work is spread across the global academic community.

6. The "multicultural" archive.  The multiple cultures I refer to here are the cultures of commerce and the not-for-profit sector.  One of the curious things about much of the current discussion of scholarly communications is the strong feeling in the not-for-profit world that anything that touches the world of commerce is necessarily tainted.  Putting purists and extremists aside (that is, the kind of people who see Starbucks as an agent of global imperialism), there is still a large contingent of scholars and librarians who are very suspicious of anything involving commercial publishers.  And not without good reason: commercial publishers have acted horribly toward the library community in recent years, often combining extortionate price increases with deteriorating customer service.  Thus, more and more members of the academic community now seek to build their own separate world for scholarly communications.  It is in this separate world that the Open Access sect resides.

The problem with this is that the future of scholarly communications will be pluralistic.  Reed Elsevier, Springer, and John Wiley are not going away anytime soon.  Even if the most optimistic projections of OA advocates come to pass, in five years' time a majority of the 24,000 peer-reviewed journals will still be published under a proprietary user-pays model.  The research community will be creating and assessing materials from professional societies, commercial publishers, newly formed OA publishers, institutional repositories, preprint servers, and more.  It simply is inefficient for academic librarians to attempt to build "a world elsewhere."  Better to encourage these various forms of communications to work together, to become mutually supportive, for the benefit of the research community.  (Here it should be noted that for the most part, OA is a shibboleth of librarians.  With some important exceptions, researchers have not supported it.  It is simply wrongheaded for the library community to attempt to foist a system of scholarly communications upon the research community that libraries ostensibly serve.)

Creating a pluralistic environment for scholarly communications is not simply practical; it is also shrewd.  Commercial publishers deploy capital.  If that capital is invested into infrastructure that is also used by not-for-profit practitioners, the not-for-profits benefit.  It is axiomatic that the best source of funding is other people's money.

7. Documents and metadocuments.  If all scholarly documents were placed atop a common platform irrespective of source or type, and if the aim of such an integrated database were simply to find a particular document, then the Casaubon project is not worth the effort.  Project Casaubon should be more than a larger database than currently exists upon which searches can be conducted.  The aggregated document, the metadocument, must display qualities that are not discernible in any single document resident in the database.  There is a world of difference between Dickens's Great Expectations and the field of Victorian literature.  We can and should ask different questions of a single text or document (e.g., Great Expectations) and an entire database.  For example, it is meaningful to note that the protagonist of GE is an orphan, but it is also meaningful to note how often orphans appear in Victorian literature.  Another kind of question is what percentage of all characters of the literature of the period are orphans, how prominent their roles are (measured by number of appearances or lines of dialogue), and their fates in their fictional world.  We may wonder how many orphans there were in the real Victorian world (statistics adjusted by country, region, religion, gender, and economic status) and map these figures onto those derived from literature, with the aim of developing a variance report—in effect, a statistical measure of the gap between reality and literary realism.  The point here is not that any particular analysis is valuable, but that the tools to allow a researcher to draw useful or tedious conclusions are not available.  The metadocument of Project Casaubon becomes, as it were, a literary document in itself and is thus primed for investigation from any perspective the research community cares to bring to it.

It is noteworthy that this is an area where the corporate world is already significantly advanced.  Usually called data mining, analyses of large databases are becoming routine, and for the most part with little regard to the content of any particular document included in the database; rather, the emphasis is on discovering themes that straddle multiple documents.  The business press is filled with stories of how corporate marketers study customer behavior in order to help design new programs and products.  When we swipe our shoppers' cards at the supermarket cashier, we are participating in the creation of a database about which we know little.  Interestingly, partly for privacy reasons and mostly because the information is of no economic interest, the data collected usually strips away personal identifications.  The Safeway supermarket chain does not care that Joseph Esposito purchased a jar of Advil (extra large!), but it does care that the purchase was of a particular size, in a particular Zip Code, at a certain time of day, and that along with that purchase Esposito, now anonymous, also purchased a number of other items, revealing a pattern or trend that could lead to new marketing activity.  There is a sense that the author of Great Expectations is not necessarily relevant to the study of the period, as large databases lead to statistical extrapolations, which are inherently impersonal.

8. The agnostic research platform. Intellectual historians may wonder why research platforms are designed as they are and what implications are built into those designs.  Speaking as an outsider to the academic community, I am fascinated by the emphases on authenticity, originality, and precedence.  Viewing edgy, post-modern independent films through my Netflix subscription, I wonder if authenticity is all it's cracked up to be; and as I read Laurence Lessig's commentary on the current intellectual property system, with the emphasis on how creative people borrow or reuse materials already in existence, the idea of originality does not seem self-evident.  It's clear that precedence is important—for the credentialing system of the modern research university.  But just as Dickens himself is not always relevant to the study of certain aspects of the world in which he lived, even when his texts are part of such a study, similarly the identity of a particular researcher and his or her place in the development of a particular line of thought are not always at the center of an analysis of a field.  There is Freudianism without Freud and market capitalism without Adam Smith.

A platform for academic research materials should do more than assist university administrators in reaching tenure decisions.  It should also enable the creation of new software tools for inventive investigations of the growing database of materials.  In software parlance, the platform must have an exposed Application Program Interface or API..  It is not possible to anticipate all the tools that can and will be created; that is precisely the point.  The aim of the platform is to encourage investigations of which we may have no inkling at this time.  There may be Marxist tools and Structuralist tools and Feminist tools and tools of any other stripe; Casaubon must make no comment.  This is different from the tools currently available (though the number and kind of these tools grow daily).  Researchers now can in many cases search the text of a particular document or search a collection of documents or search multiple collections (federated searching); and researchers can often follow links for citations, which brings them to new documents.  But there is no tool to help ascertain, say, the use and implications of architectural metaphors in life sciences research from the 1930s to the present or to map the entire content of a database against an unabridged dictionary, whose entries have been classified by domain (the set of mathematics or tennis, for example).  Information technology is now used primarily to find a particular document with its presumably fixed text, but the fixity of the text is itself a cultural assumption, which can limit certain kinds of research.  One does not have to be an advocate of dynamic or processed text to believe that research in this area is worthy or pursuit.

9. Envisioning Project Casaubon.  Casaubon would be a software platform—not simply software but a software platform.  It would encourage the writing of new applications.  It seems desirable that the platform be published under an Open Source license and that all extensions of the platform also be Open Source.  (There are philosophical and economic issues to be reviewed concerning Open Source.)  The platform would serve as a repository for academic materials; specifications for the structure of documents to be placed in the repository would be published for the use of all participants.  It is envisioned that the documents would cover the widest range, from commercially published materials to digitized versions of hardcopy and microfilm archives.  This is emphatically not an Open Access project, though OA materials are likely to be included in the repository, which will be scrupulously pluralistic.

Casaubon would bring together a number of different projects and activities in the academic community.  OA publications, including preprints, would mingle with formal, peer-reviewed papers from established proprietary journals.  Archival materials would reside side by side with today's newspaper—and, one hopes, a software tool would be devised to automatically link the newspaper to related material in the archives.  Every document is potentially presented, as it were, as the sun in a pre-Copernican universe, around which orbit the "planets" drawn from the broader database.  Current projects (e.g., Oxford Scholarship Online) would be migrated to Casaubon, to the mutual advantage of Casaubon and Oxford Scholarship itself.  New methods of investigation would continually be added to the service, which would grow in size and in the number of ways it can be viewed and manipulated.  Casaubon would include business rules for the use of materials, rules designed to encourage the broadest participation by the creators of documents.

There is an assumption at the bottom of this proposal, namely, that the way to tie a number of projects together is to tap the creative energies of as many members of the academic community as possible, who will participate as authors, software developers, sources of funding, and curators.  The project is to be designed as something to which things, including pieces of the platform itself, can and will continually be added to it.  Although there will be central administration of the development of the platform, this is not a "top-down" planning activity, but rather a project whose aim it is to encourage participation at as many nodes in the network as possible.

Copyright © 2004 by Joseph J. Esposito