Visualizing search data: What’s the right amount of visibility?
Written on April 29, 2009 by Michael Lascarides
In a fortuitous bit of timing for us here at the Library, Google released its Analytics API last week. We had already been discussing ways of exposing site usage statistics to NYPL staff, and the API makes that job a lot easier. This means that we can set about accessing our 18 months or so of Analytics data with some programming, and thereby turn that raw data into web sites and other visualizations. As soon as the API was announced, I hacked together a Ruby on Rails application to start digging in to the ungodly amount of cool data now exposed (side note: I’ll soon be posting a technical how-to for any Ruby-ists interested in exploring their Analytics). Within a couple of hours, we were pulling live stats data into a locally-formatted web application.
First came a tag cloud of search terms sized in proportion to their popularity. Next came a thumbnail gallery of Digital Gallery images ranked by popularity, and lastly a keyword explorer where we could enter a keyword (say, “dvd”, “military”, “japan”, “staten island” or “art deco”) and retrieve a thorough analysis of that keyword: where searchers for that keyword are located, what other terms they searched for, and what pages they visited the most after searching. (Click the thumbnails above for example results)
This is revolutionary for us. We’ve already been exploring lists of the most popular search terms, but to see them proportionally scaled in a tag cloud really drives home the relative popularity of different terms. And the keyword explorer is just amazing. We’ve long known that the Digital Gallery is extremely popular among military history buffs (to pick a random subject), but to see the follow-up searches for “military” on the DG really paints a picture of the breadth of topics and level of detail of these enthusiasts.
There is no doubt that we’ll be using the aggregate search data internally. For example, looking at searches containing “dvd” reveals a huge list of movies and DVDs being sought by our users whether we have them in stock or not, which can then be used by our Acquisitions department to meet demand. And searches relevant to divisions (e.g., “maps” or “manuscripts”) or locations (”staten island”) can help staff in those areas gain valuable insight into the needs and wants of their users.
But almost immediately, we also started thinking about ways of using some of this data to bring context to the Library experience of the end users. Could we display popular searches by location on screens in the branch libraries? Could we make maps of the country showing the most popular searches by city? Should we let end users explore other relevant searches by keyword? The API solves a big chunk of the technical hurdles, which leaves only the question, “Are we sure this is a good idea?”
We checked with our counsel’s office about whether our privacy policy precludes us from publishing this data or its derivatives, and to my surprise, they didn’t have any problem with it from a legal point of view. Still, it makes me nervous to start producing any public products using user search data in any form until we’ve had a thorough discussion of what current best practices are.
Just to be clear, this data from Analytics is always aggregate data. There’s no way to say that a particular search came from a particular person or computer. But it can potentially get pretty specific: terms that were searched a single time, for example, or the city from which a specific term was searched.
So I’d like to throw open the question to the library community and the public: what is the balance of rich context versus privacy? Should we make none of this data public, or all of it (that is, allow users to see the context of every search term)? Should we only make contextual search information public when it reaches a certain threshold of use (the “safety in numbers” argument)? If so, what is that threshold? Is it a case-by-case basis? Is it possible to generalize any guidelines? Have other institutions explored a similar policy?
We’re very interested in hearing from you in the comments.
Filed in: Ruby on Rails, Search, privacy, visualization.




I really don’t think that privacy concerns are relevant when you’re talking about aggregate data – single searches are not going to be of much use/interest to others, but searches performed by dozens of others are. One of my favorite features of the last Barnes & Noble web site redesign is the search cloud – invariably, that’s the easy way to find a title you can’t remember, but can recognize. Especially if it’s an Oprah book, or recently mentioned on a TV show, or recently announced (i.e., Dan Brown’s newest). I think a search cloud would be a great addition to the library home page, and would get a lot of use.
Personally, I would be most interested in the most general data – the total searches. I think breaking it down by branch/neighborhood/city would not really produce data all that different, and at a certain point *would* start to have privacy concerns.
On the other hand, breaking it down by type of search could be interesting, if enough people are changing the search type away from just Keyword. Most popular author(people) searches, most popular titles, most popular subjects.
It’s great that you’re thinking about handing this data back to the users, where it can be most valuable. I agree that presenting this sort of aggregate data shouldn’t have too many privacy concerns, but it’s good that you’re thinking about it. I think by now it’s clear that “anonymous” search data from one individual, isn’t. But, as you say, there’s safety in numbers.
Search clouds can be useful, but they can also have the unintended effect of driving traffic away from the long tail and toward the “short head.” There’s a positive feedback loop created as more-popular searches show up bigger, and so are clicked on more, and so become even more popular. This can tend to drive demand rather than just reflecting it. It might be interesting to try normalizing search results, displaying not the most popular searches, but the ones whose popularity has increased the most in the past week.
It would be really cool to leverage this search information to facilitate patrons’ discovery of new titles, like Amazon does. Users would see something like “Patrons searching for american modernist poetry also searched for william carlos williams.”
Perhaps you could also use this to help build user-defined metadata about records. You know when a user viewed a certain record after a search–add that search phrase as metadata for the record. Over thousands of searches, this could be valuable; you show the aggregated metadata to help describe the resource to patrons. You could even use it in ranking for searches, although this would tend to build another feedback loop.
[...] ‘Visualizing search data: What’s the right amount of visibility?’ from the New York Public Library. [...]
I believe the greatest privacy concern with Google Analytics lies not in the ways that libraries might aggregate and use analytics data of library website use but in allowing Google to do the same.
The current privacy policy states that Google will not share information with third parties without consent, but there’s no guarantee that this will always be true. From the Analytics TOS: “Google reserves the right to change or modify any of the terms and conditions contained in this Agreement or any policy governing the Service, at any time…” Consent to any new terms is required, but continued use of the service counts as consent. Are we all willing to watch sleeplessly the Google Privacy Policy page for any changes in case we accidentally agree to new, less agreeable terms? Even if third parties are denied access to this information, Google itself can use the data for a variety of purposes, including advertising.
Would we allow Google to access the borrowing records for our patrons, even if they promised to keep them secret? These tracking scripts are loaded with the page, so in most cases we are tracking, saving, and giving corporations access to our patrons’ library habits without their consent. Is that what libraries should be doing?
After some discussion here at the University of Virginia Library and with the University’s IT organization about server logs and tracking tools–including a wonderfully informative session led by our legal counsel–the consensus was that the utility of Google Analytics was outweighed by the privacy concerns it raises. If we really wanted the power of a tool like Google Analytics, installing the Urchin tool or other locally controlled analytics software seemed like a better solution that turning all our patrons’ web habits over to a corporation. A community-led decision was made to proceed with Urchin in an effort to enrich our patrons’ library experience while still trying to preserve the confidentiality and privacy that continue to be an integral part of the library’s mission.
[...] being used in exciting new ways. For example, Michael L. of the New York Public Library is visualizing search data as tag clouds, and Patrick C. is using the Google Analytics API in conjunction with technical software to [...]
[...] being used in exciting new ways. For example, Michael L. of the New York Public Library is visualizing search data as tag clouds, and Patrick C. is using the Google Analytics API in conjunction with technical software to [...]
[...] being used in exciting new ways. For example, Michael L. of the New York Public Library is visualizing search data as tag clouds, and Patrick C. is using the Google Analytics API in conjunction with technical software to [...]
[...] L的可视化数据搜索和Patrick C用Google [...]
[...] being used in exciting new ways. For example, Michael L. of the New York Public Library is visualizing search data as tag clouds, and Patrick C. is using the Google Analytics API in conjunction with technical software to [...]
I have just started to look at moving my offers to NY and find this whole idea very interesting, good lucj to us all for the networking opportunities.
Peter A
[...] Nota.- Aún no encuentro otro nombre para este tipo de desarrollos.P.D. – Aqui está el link al artículo de la NYPL: http://labs.nypl.org/2009/04/29/visualizing-search-data-whats-the-right-amount-of-visibility/ [...]
[...] This post was Twitted by moffittk – Real-url.org [...]
@Joe G: Very good points about having the data on Google’s servers. We had that conversation here and on balance decided we were comfortable with the online version of Analytics for now, but it’s probably worth revisiting as the technology and business environments evolve.
Otherwise, I’m quite surprised that the general sense seems to be “go for it!” We’ll keep you posted where our experiments lead.
[...] – An Adobe Air based desktop app for quick and easy access. http://labs.nypl.org/2009/04/29/visualizing-search-data-whats-the-right-amount-of-visibility/ – The New York Library are looking at how to visualise their search data as tag clouds. There is [...]
[...] being used in exciting new ways. For example, Michael L. of the New York Public Library is visualizing search data as tag clouds, and Patrick C. is using the Google Analytics API in conjunction with technical software to [...]
[...] being used in exciting new ways. For example, Michael L. of the New York Public Library is visualizing search data as tag clouds, and Patrick C. is using the Google Analytics API in conjunction with technical software to [...]
[...] being used in exciting new ways. For example, Michael L. of the New York Public Library is visualizing search data as tag clouds, and Patrick C. is using the Google Analytics API in conjunction with technical software to [...]
[...] being used in exciting new ways. For example, Michael L. of the New York Public Library is visualizing search data as tag clouds, and Patrick C. is using the Google Analytics API in conjunction with technical software to [...]
[...] being used in exciting new ways. For example, Michael L. of the New York Public Library is visualizing search data as tag clouds, and Patrick C. is using the Google Analytics API in conjunction with technical software to [...]