How much time do academics waste chasing down references and managing them right now? The ceremony of fishing, saving, organizing and inserting references may be taking a significant percentage in any academic’s time allocation table.
James Howison & Abby Goodrum make a very good point about how little use we currently make of metadata. Why music and images gets tagged, but not academic papers? It seems that you can do a search by artist name easily, but not by author name when using pdfs (not natively at least)In my case, I try to make up a filename that contains all the key terms, author names, etc that I anticipate I may need. Then, I index the filenames only (not the full text) using a desktop search program (locate 3.0). This is definitely a lot worse than the way my music is organized my music and I didn’t dedicate much time to it since it already came tagged or was easily mass-tagged using a program that talks to amazon or CDDB. I wonder how we got to the point that even after dedicating ten more times more resources to organizing references than music they are still harder to find and handle.
Howison ventures to say that the experience of managing mp3s is far more fluid than managing any other documents, certainly more than managing pictures, word documents, and of course, academic papers in PDF form. This is just because music files have embedded metadata that travel with the media, while academic papers don’t.
Right now, publishers use pdf because of its ease of distribution, but implicitly and silently assume that articles will be printed out for reading and that librarians will still want a paper copy using up space on a shelf (even though most academics I know try to avoid the trip to the library more and more).
This implies that many decisions are made considering a ‘paper world’: black-and-white images (cheaper to print), limits in page numbers, structure of the page (two-column schemas), etc). And figures are, of course, static.
Metadata that travels with the reference
It seems that last-generation reference managers like zotero and collaborative sites like CiteUlike and Connotea are getting closer to a tag-based (metadata) classification of pdfs. Still, since the metadata do not travel with the pdf, the tagging effort is not shared when a reference is sent by email for example.
Actually, The PDF standard has the ability to store a pre-defined and limited set of metadata, such as author, title, etc, but these fields are rarely used by the document creators (publishers mainly). Rudimentary as it might be (no fields for journal name, pages, etc), its use could alleviate some of the problems we face today.
In an ideal world, a committee of a few large universities should impose a new standard of metadata, and maybe create software libraries (open source) to handle that kind of metadata. Publishers would create pdfs that follow these guidelines in the future.
However, a new format that seems to standardize documents, OASIS OpenDocument Format (ODF) could be the solution for this. It seems that word processors are slowly taking an interest in reference management. Word 2007 features a reference manager, although it is really primitive and not usable for serious academic use. OpenOffice has been behind ODF for a while. if ODF becomes a de-facto standard, we may not need to rely on PDF. And ODF is XML, so adding different fields that can be mined by reference managers shouldn’t be hard. ODF is overseen by the Organization for the Advancement of Structured Information Standards (OASIS). That way, the metadata is no longer an extension of the document: the entire document could be parsed and each component could contribute in its indexing. This would make easy to do what citeSeer is trying to do ‘the hard way’ (parsing author, title, etc out of the papers that we academic have in our homepages, and making them available and searchable).
The need is there. I think the company/University dept. that gets this right will have a winner. For example, the Zotero forums express this need as follows:
(post by CuriousGeorge) Here is what I would like to do ideally:
1. Begin literature review on new topic using databases like JSTOR, Proquest, and Web of Science.
2. Use Zotero’s current “folder” icon in address bar to select articles of interest.
3. Zotero downloads citation information (this already works well), abstract (this often works), and the associated PDF file (with this option enabled in Zotero preferences, it currently works well in JSTOR but not other databases like Proquest).
4. Zotero stores all PDFs in one folder and automatically renames the PDFs based on the associated citation information in the format “Author, Year, Article Title.pdf” (or customized format selected by user).
5. PDFs are read in the browser window and notes are taken in the associated Zotero entry.
6. Zotero allows search in any combination of citation information, abstract/notes, and full text of website/PDF snapshots (stored locally).
7. Lit Review is built by creating new notes that synthesize various articles (these notes take advantage of the “related” option in Zotero to link back to the associated references).
8. The lit review notes and “related” citations are exported to a word processor.
9. The word processor is dynamically linked to the Zotero database for adding new citations and for searching the Zotero database for quotes/notes.