Technology

Library Of Congress Announces It Will Be Selective In Which Tweets It Archives

Published December 27, 2017 at 2:33 PM MST

ROBERT SIEGEL, HOST:

Since 2010, the Library of Congress has been collecting every single tweet published on Twitter. The idea was to create a comprehensive public archive of the social media platform for researchers to comb through and to study. But that was all easier said than done. More than half a billion tweets are sent every day, a staggering amount of information to catalog even for one of the world's biggest libraries. And yesterday the library announced that it would scale back its Twitter collection considerably and will now archive tweets on a very selective basis.

Professor Michael Zimmer of the University of Wisconsin-Milwaukee has written about the challenges of storing this never-ending stream of social media data, and he joins us now. Welcome to the program.

MICHAEL ZIMMER: Thank you, Robert.

SIEGEL: Why is the Library of Congress rolling back its original plan to create a Twitter archive?

ZIMMER: It's not terribly surprising that they've made this decision. I think when they first made the agreement in 2010, it seemed like an easier task than it's become. In those first few years, they had originally collected about 170 billion tweets. The most recent update they gave us was in 2013. And as you said, since then, there's been about 200 billion tweets per year added, and I think they've realized this is just a much larger task than they anticipated.

SIEGEL: And when the Library of Congress would collect data, would they be collecting the text of the tweets or more information than that?

ZIMMER: There's actually quite a bit more than just the 140 characters or - now it's 280 characters. By recent count, there's been about 50 different pieces of meta data that comes along with each tweet. So it's not just the text, but it's also your location, your account settings, IP address, whether your account is public or private, who your friends or followers might be.

And then they also have to deal with other kind of policy issues like, what if someone later deletes a tweet? Should they remove that from the archive or if I make my account restricted in the future or maybe my tweet turned out to be, you know, harmful or dangerous and my account was banned from the platform? So there's all kinds of interesting policy issues alongside the technical challenges.

SIEGEL: The library will still collect tweets on a selective basis. Does that mean that, say, President Trump's tweets will be stored but Joe Blow's tweets about the weather will not be stored?

ZIMMER: Yeah, I think it's going to be a combination of notable accounts and also when there's certain events. There might be an election, a protests, a sporting event, and they'll collect, you know, more tweets around those certain kind of moments in time versus the entire archive of every single utterance that we're making on that platform.

SIEGEL: And when they began, the Library of Congress promised access to this archive for researchers. Have researchers had that access?

ZIMMER: No, not really. The library did have a small program where they were giving people some limited access - who won access - to come into the library and go through the archives. But even those people didn't get true access. They've really struggled with how to organize, how to make this searchable and make it useful for scholars.

SIEGEL: You said this wasn't surprising. Are there any other ambitious projects out there that simply underestimated the scale of social media and that are collapsing under the weight of its volume?

ZIMMER: Well, there are some successes - the Internet Archive of course, who is trying to archive every website. And there are, you know, smaller projects. There's a Trump Twitter archive where people are trying to collect everything that President Trump tweets. I actually manage The Zuckerberg Files, where I try to archive everything that Mark Zuckerberg says online or in media interviews. But usually these smaller-scale projects are the ones that are more successful.

SIEGEL: Well, Professor Zimmer, thanks for talking with us about it today.

ZIMMER: Thank you, Robert.

SIEGEL: Michael Zimmer is director of the Center for Information Policy Research at the University of Wisconsin-Milwaukee. Transcript provided by NPR, Copyright NPR.