[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
On metrics
Joe Esposito's recent post alerted the list to bepress's (The
Berkeley Electronic Press's) download efforts. We recently spent
substantial resources overhauling our methodology for counting
full-text downloads of articles, and applied the new methodology
to all our logfiles in order to provide our users with our most
accurate estimate of legitimate downloads. See
http://www.bepress.com/download_counts.html.
We consider this effort to be part of our effort to act true to
our motto as "The New Standard in Scholarly Publishing Since
1999." We discovered that we had a lot of room for improvement
in the way our downloads were counted, and so we did our best to
fix this for bepress customers.
A lot of people on liblicense have been asking very good
questions about what we did, why we did it, and whether it
matters. Here are some answers
WHY DOWNLOADS MATTER.
For sound reasons, Liz Lorbeer questions the value of download
counts. We appreciate her concerns. We were simply responding
to what we heard from users.
Put simply, download matter to bepress because they matter to our
users: authors, libraries. Digital Commons subscribers. Many of
my colleagues monitor the downloads of their papers like my wife
monitors the fever of our baby. Of course, they know that
downloads are a highly imperfect metric of a good paper and of
successfully reaching an audience. Are the people actually
reading the paper? Do they like the paper or cite it? Or do
people simply click on a catchy title? Who knows? For some time,
"Fuck" was the most downloaded paper at bepress; for a while, it
surpassed a Paul Krugman article in downloads per day since
publication. Could this have more to do with the title of the
paper than its content? I shouldn't judge without reading it,
but I wonder.
For better or for worse, downloads are being used as a sign of
prestige and productivity for merit reviews, journal acceptance
and the success of repositories. Therefore accurate measurement
is important.
Accuracy requires, at the least, that double clicks by people
should be counted only once and that downloads by automated
processes, as opposed to downloads by people, should not be
counted at all. However, although double clicks are easy to
identify, automated processes are not.
Peter Shepherd, Director of COUNTER, asks whether "inflation is
proportionately the same across vendors." I'll go out on a limb
and say that it seems awfully unlikely. It takes expense,
research, and cleverness to catch automated processes.
Publishers and Repository owners may have little or no incentives
to limit downloads counts, as things stand. After all, everyone
wants more downloads.
HOW DOWNLOAD INFLATION VARIES
Since we only studied bepress data, we can't tell you how count
inflation varies across publishers, but we can say something
about how it varies between open access content and restricted
access content.
Download inflation is highly variable, just as we suspected.
Open access papers have dramatically higher inflation than
restricted access, but restricted access inflation remains
substantial. Even within these two classes download inflation
varies a great deal. One paper for example had over 8500
downloads even with our old filters. With our newer more
accurate filters, it had only 6. Happily, that is not typical,
but significant download inflation is typical in our sample.
Some of our findings surprised me a lot. I, for example, shared
the skepticism that was recently expressed by Phil Davis, a PhD
student at Cornell, and by Peter Shepherd, the Director of
COUNTER, a valiant project that tries to make sure that when
humans or machines double click there is only one count. All
three of us figured that for subscription based academic
journals, where permitted access is limited to those from IPs at
subscribing institutions, that double clicks by humans could be
significant, but downloads by automatic processes would be
negligible. My tech group hurt my feelings by calling me naive.
And, I guess they were right.
It is true that the problem of download inflation is much worse
in open access than in restricted. But the problem is
substantial in restricted access journals too, assuming that
bepress's experience is representative. In fact, we "catch"
automated processes coming from subscriber IPs downloading our
restricted access journals in roughly equal number to double
counts that COUNTER compliance eliminates. So, if COUNTER
matters for restricted access journals, what we have done matters
too.
How can automated processes come from a closed community I asked
our tech team? First, they disabused me of my professorial idea
that all members of the academic community are benign. They
remind me that computer viruses may be written by college kids or
perhaps by professors like me and that denial of service attacks
can come from them too. In addition, people outside the academy
may highjack machines within the closed community and use them.
Computer science researchers interested in building new fangled
search engines might download thousands of papers not to read but
to serve as a database for their research. Moreover, LOCKSS
crawlers turn out to download a lot of restricted access content.
Are the other publishers excluding those counts? I hope so, but
do not know. If other publishers are on this list, please do
tell. Our restricted access journals are probably subject to
more automated processes than other publishers because we have a
liberal guest access policy intended for humans, but imperfectly
restricted to them. However, we isolated that effect and still
find lots of downloads that we identify as coming from automated
processes arising (at least most directly) from the IP addresses
of the closed communities of our subscribers-again, we reject
downloads from automated processes in roughly equal number to
COUNTER rejections. So my tech team wins again. I was naive.
On which subject, bepress excludes all downloads coming from
within bepress. Do other publishers? Should we? Some of our
downloads are human interest no different from any other. These
should be counted I think. Other downloads are connected with our
business, testing response time and the like. To be
conservative, we exclude them all. Do other publishers? Again, I
don't know.
WHY DID WE INVEST IN REDUCING OUR DOWNLOADS.
We gathered together a few big bepress meetings last winter and
spring. We discussed several things. First, we were hearing more
and more about anomalies: papers with far more downloads than
was plausible. Second, it was clear that new madness happens on
the internet all the time. Once upon a time, we spent a lot of
time making as sure as we could that we only reported human
downloads. Should we open this can of worms again? I had two
hesitations.
One hesitation was technical. Distinguishing the download of an
automated process from a human interested in reading an article
seems difficult. Some automated processes call themselves out and
declare "I am a crawler," but if they don't, then in the
immediate, all downloads appear alike. One must look at data
signature patterns to distinguish. This seems like a job for the
NSA or for Steve Levitt, author of Freakonomics, and founder of
forensic economics. Only problem was this: the NSA is busy with
terrorists and Steve Levitt isn't on our staff. Luckily our
biggest baddest programmer was interested in the challenge, so
this difficulty was solved.
The other hesitation was with the business logic. In both our
open access services and our restricted access journals, we like
everyone else on the internet are in a competition for eyeballs.
Could it possibly make sense when everyone is competing for more
and more downloads to compete by investing a lot of money to be
able to lower our downloads by 10, 20, 50% or who knows how much.
At first blush, this seemed simply insane. Could we possibly, a
small player in the scheme of things come out and say: "Your
downloads are down 20% and this is a good thing?" Many on the
staff thought we could. I was skeptical. I remain skeptical
from business perspective. This time, I think they are naive.
But if they are naive, it is a wonderful kind of naive. And, if
what I wanted out of life was to make a zillion dollars and own
the world, I would not be spending this kind of time working on
scholarly communication. Hopefully, the decision to do this was
not naive. But, regardless, I am sure that the effort was the
right thing to do. We hope it starts a conversation.
Best, Aaron
Aaron Edlin
Chairman, The Berkeley Electronic Press
Richard Jennings Professor of Economics and Law, UC Berkeley
Homepage: http://works.bepress.com/aaron_edlin/
Co-Editor, The Economists' Voice, http://www.bepress.com/ev
Editor, The B.E. Journals of Theoretical Economics,
http://www.bepress.com/bejte