On metadata, indexing, and mucking around with PDFs

February 19th, 2007 by jose

How much time do academics waste chasing down references and managing them right now? The ceremony of fishing, saving, organizing and inserting references may be taking a significant percentage in any academic’s time allocation table.

James Howison & Abby Goodrum make a very good point about how little use we currently make of metadata. Why music and images gets tagged, but not academic papers? It seems that you can do a search by artist name easily, but not by author name when using pdfs (not natively at least)In my case, I try to make up a filename that contains all the key terms, author names, etc that I anticipate I may need. Then, I index the filenames only (not the full text) using a desktop search program (locate 3.0).  current workflow for reference managementThis is definitely a lot worse than the way my music is organized my music and I didn’t dedicate much time to it since it already came tagged or was easily mass-tagged using a program that talks to amazon or CDDB.  I wonder how we got to the point that even after  dedicating ten  more times more resources to organizing references than music they are still harder to find and handle.

Howison ventures to say that the experience of managing mp3s is far more fluid than managing any other documents, certainly more than managing pictures, word documents, and of course, academic papers in PDF form. This is just because music files have embedded metadata that travel with the media, while academic papers don’t.

Alternative content

Right now, publishers use pdf because of its ease of distribution, but implicitly and silently assume that articles will be printed out for reading and that librarians will still want a paper copy using up space on a shelf (even though most academics I know try to avoid the trip to the library more and more).

This implies that many decisions are made considering a ‘paper world’: black-and-white images (cheaper to print), limits in page numbers, structure of the page (two-column schemas), etc). And figures are, of course, static.

What happens if we drop these assumptions? pdfs could contain color images, and even sound and video if needed. Animations or interactive 3D charts could help visualizing results. A simple interface could make the figures and tables react to user actions. Imagine folding parts of tables, or seeing the effect of one variable on others by moving a slider. This is all technically possible within a pdf (thanks to JavaScript).

Metadata that travels with the reference

It seems that last-generation reference managers like zotero and collaborative sites like CiteUlike and Connotea are getting closer to a tag-based (metadata) classification of pdfs. Still, since the metadata do not travel with the pdf, the tagging effort is not shared when a reference is sent by email for example.

Actually, The PDF standard has the ability to store a pre-defined and limited set of metadata, such as author, title, etc, but these fields are rarely used by the document creators (publishers mainly). Rudimentary as it might be (no fields for journal name, pages, etc), its use could alleviate some of the problems we face today.

In an ideal world, a committee of a few large universities should impose a new standard of metadata, and maybe create software libraries (open source) to handle that kind of metadata. Publishers would create pdfs that follow these guidelines in the future.

However, a new format that seems to standardize documents, OASIS OpenDocument Format (ODF) could be the solution for this. It seems that word processors are slowly taking  an interest in reference management. Word 2007 features a reference manager, although it is really primitive and not usable for serious academic use. OpenOffice has been behind ODF for a while. if ODF becomes a de-facto standard, we may not need to rely on PDF. And ODF is XML, so adding different fields that can be mined by reference managers shouldn’t be hard. ODF is overseen by the Organization for the Advancement of Structured Information Standards (OASIS). That way, the metadata is no longer an extension of the document: the entire document could be parsed and each component could contribute in its indexing. This would make easy to do what citeSeer is trying to do ‘the hard way’ (parsing author, title, etc out of the papers that we academic have in our homepages, and making them available and searchable). 

The need is there. I think the company/University dept. that gets this right will have a winner. For example, the Zotero forums express this need as follows:

(post by CuriousGeorge) Here is what I would like to do ideally:
1. Begin literature review on new topic using databases like JSTOR, Proquest, and Web of Science.
2. Use Zotero’s current “folder” icon in address bar to select articles of interest.
3. Zotero downloads citation information (this already works well), abstract (this often works), and the associated PDF file (with this option enabled in Zotero preferences, it currently works well in JSTOR but not other databases like Proquest).
4. Zotero stores all PDFs in one folder and automatically renames the PDFs based on the associated citation information in the format “Author, Year, Article Title.pdf” (or customized format selected by user).
5. PDFs are read in the browser window and notes are taken in the associated Zotero entry.
6. Zotero allows search in any combination of citation information, abstract/notes, and full text of website/PDF snapshots (stored locally).
7. Lit Review is built by creating new notes that synthesize various articles (these notes take advantage of the “related” option in Zotero to link back to the associated references).
8. The lit review notes and “related” citations are exported to a word processor.
9. The word processor is dynamically linked to the Zotero database for adding new citations and for searching the Zotero database for quotes/notes.


Link to James Howison’s paper

If you enjoyed this post, make sure you !

11 Responses to “On metadata, indexing, and mucking around with PDFs”

  1. atom proberNo Gravatar Says:

    There is no reason to abandon PDF. PDF supports XMP. XMP allows all the dublin core metadata that Zotero, refbase, OpenOffice.org, and other products are using.

    We just need to have publishers care enough to put this data in and more end-user tools to index/view/search/edit it.

  2. BadgerOneNo Gravatar Says:

    atom prober is spot on with his comment

  3. MartinNo Gravatar Says:

    XMP is an interesting solution. For those using LaTeX and BibTeX to manage their references, I recommend trying the new version 2.2 of JabRef (http://jabref.sourceforge.net).
    It can write the BibTeX Metadata to a PDF file and also import the Metadata from a PDF file.

    In other words – you can store the bibliographic information (Journal name, page range, Authors, Title, …) in a structured way in the pdf files.

    Even if the publishers do not yet provide this information in their pdf articles, the user already can add that information and benefit while searching for the right reference.


  4. joseNo Gravatar Says:

    Thanks Atom, BadgerOne, Martin,

    That’s really nice. Does anyone know of any PDF creator that writes XMP for those not using latex? I use the open source PDFcreator. This offers saving some base fields, but I doubt that’s XMP. Can you post a link to a pdf that has those XMP fields filled? What software other than jabRef can read, catalog and write XMP-enriched PDF?

  5. joseNo Gravatar Says:

    Thanks Atom, BadgerOne, Martin,

    That’s really nice. Does anyone know of any PDF creator that writes XMP for those not using latex? I use the open source PDFcreator (http://sourceforge.net/projects/pdfcreator/). This offers saving some base fields, but I doubt that’s XMP. Can you post a link to a pdf that has those XMP fields filled? What software other than jabRef can read, catalog and write XMP-enriched PDF?

  6. KevinNo Gravatar Says:

    I don’t care whether it supports XMP or not. There’s no need to do PDF. We need to get over this what it looks like on paper mentaility. Give me text.

  7. joseNo Gravatar Says:

    : Good point. In fact, in my case, I just have to use adobe acrobat (expensive!) simply to highlight and comment pdfs. Text would be better, with formatting being up to the user (e.g., CSS). Sometimes, I don’t like the fonts or the fact that the paper is two-column. Not much you can do with it if it’s in PDF format.

    Problem is, I don’t think any new format is going to take over pdf anytime soon. What happened to mp3 – ogg? Mp3 is proprietary, we pay a cannon when we buy an mp3 player. Ogg gives equivalent -if not better- quality. It is open-source, and here you don’t find any of the typical criticisms ot OSS: “The interface sucks, too geeky” (there is no interface in a file format!). “The documentation sucks” (no doc either). But very few people I know use ogg (I do), and most mp3 players don’t even support it.

  8. Academic Productivity » The definitive hack for your music collection and how to use it to help you reach productivity nirvana: MusicIP review Says:

    [...] I have talked about how managing music and academic paper collections are similar here; See also ‘noise for academics‘ by [...]

  9. MarkNo Gravatar Says:

    I’m joining the thread a bit late, but I’m sure the discussion concerning document storage and attributes continues.

    I appreciate journal articles in PDF format, and prefer it to text-based documents (as in full-text articles that come without proprietary format but in html) as it replicates the journal look and feel. I don’t feel that this is too wedded to the paper age, but rather continues the investment we’ve all made in publishing and consuming the articles.

    I DO want to learn more about how to use metadata to my advantage, and will check into resources already mentioned. I use Thomson’s Endnote, which I like (version X for mac – earlier incarnations were problematic) but which I wish could handle my PDFs better. In particular, I want Endnote not only to store them (tagged with keywords, etc.) but to functonally work with them: to embed citation data in the PDF, for example, or even my own abstract. I often generate my own PDFs from scans, so that they are not even searchable by text beyond the title.

    For highlighting, I’m starting to work with SKIM, which uses the metadata capacity to store highlighting. Evidently the shortcoming at present is that the metadata may not be accessible to all programs. And as someone noted, transportability is key.

  10. Pandammonium: blogs [pandammonia] Says:

    [...] On metadata, indexing, and mucking around with PDFs | Academic Productivity [...]

  11. FrankNo Gravatar Says:

    Essentially all Reference Management Software available today performs extremely poorly when it comes to metadata. Mendeley and Zotero claim to be able to scan PDFs and automatically retrieve information such as author, title, year, but their performance is dismal. I have hundreds of PDFs and in 99 out of 100 files both are unable to retrieve accurate information. Worse – Zotero PDF scan does not work on 64bit systems.
    Both Zotero and Mendeley do not support writing to XMP that means even if one enters metadata that information is not stored inside the PDF. JabRef is the only software that supports writing XMP to PDF, but does not scan PDFs for existing metadata. Endnote neither scans/retrieves metadata nor does it permit to write to the XMP of the PDF.
    What every simple music program like iTunes does with music files, what any kind of photo management software does on the fly – writing metadata tags into the files and searching the internet for more details – NOT one single Reference Management Software can do!!!
    Really, really sad…

Leave a Reply