Archiving Twitter During the Upheaval

Nov 29, 2022

By Derek Cameron

When Jim Clifford and I started archiving the Canadian conversations about COVID-19 on Twitter, it did not seem an urgent task. While Musk had made overtures to buy twitter on 13 April 2022, he had cooled by May. Similarly, we didn’t have the forethought to imagine that six months later, Musk would fire half of the 7,500-strong workforce or that he would follow this up by canceling remote work, and demanding “hardcore” commitments, which led hundreds more to quit. In recent weeks, our leisurely approach to mining Twitter for the Remember Rebuild Saskatchewan project gained new urgency.

Jim and I are exploring how COVID-19 policies in Saskatchewan were received by residents of the province. While it would be possible to get glimpses of this through traditional print sources, or cursory glances into social media, we knew that a fuller picture would emerge by systematically analyzing social media. We also see this as a way to test the new digital methods historians will need to learn to study the history of the pandemic, where the internet is now the main repository of primary sources.  As a result, we decided we would capture tweets from politicians, activists, and journalists. More importantly, however, we wanted to know what everyday people were saying in reply. Luckily for us, Twitter provides great access for Academic researchers for free to capture ten million tweets per month and the ability to request tweets from the platform’s launch until the present. But that access, as discussed below, may not last.

Derek Cameron runs a twitter collection code. Photo Credit: Kyle Stringer

Our project began work in June 2022, as a small component of the larger Remember Rebuild Saskatchewan project. At the time, I had no coding experience and I spent my summer acquiring the skills I would need to perform network analysis, which maps the complex interactions of actors to give evidence for hypotheses about those interactions or reveal surprising connections that were not anticipated. I also learned various tools to comb through massive amounts of text including simple concordance (which words appear near each other) and more complex text mining, via the incredible Natural Language ToolKit (NLTK). Clifford requested the Academic developer account but was busy with other aspects of the project, so made no progress. As far as Twitter collection was concerned, it would wait until September.

In September, Musk was embroiled in legal proceedings by Twitter’s shareholders to force him to abide by his 44 billion dollar offer. The discovery process of the trial revealed the dysfunctional plans the billionaire had for the company, but even then, as we were starting to ramp up our collecting efforts, Twitter seemed to have a secure future. It remained hard to imagine someone spending billions of dollars on something and not develop a plan for its medium-term sustainability. On October 28th, he officially bought Twitter, avoiding any legal repercussions and starting a series of missteps that increasingly compromised the stability of Twitter. The day he posted disinformation about the attack on Paul Pelosi, I became concerned. At first, I worried that Twitter would become drowned in misinformation and hate speech due to a shift by Musk to lax moderation policies. Then, he hastily fired half the workforce, only to scramble to hire back staff involved in critical operations. At this point, pulling our data from Twitter took on new urgency. We have no idea who is left to maintain the Academic access and who knows if this free service (or Twitter itself) will survive into the future.

By mid-November, we had collected all the data we had set out to use in our analysis, but, as any historian knows, often the first documents we pull lead to more questions and return trips to archives to look at more documents. While our project currently focuses on Saskatchewan, who is to say we will not want to expand to look at more of Canada, or the complex transnational exchanges that happen on the internet. In the past few weeks, we have expanded our collection efforts, adding leaders of Canadian Political parties, key hashtags, and other events to our collection efforts. I have even enlisted my brother’s computer to speed up the rate of collection.

It is impossible to know what historians in the future will want to explore. The Internet Archive, started in 1996, had the vision to understand that our digital resources are far more fragile than one might think. It now houses over 45 petabytes of data from the web (an unfathomably large amount). But as the web has changed, the basic web crawlers and archival tools have developed blind spots. Social media, which has encompassed much of the day-to-day experience of netizens, is one of those blind spots, with infinite scrolling that can stump web crawlers. Paywalls and passwords, which are appearing more frequently on news and other sites, also stop internet archiving from capturing more and more content. While tools exist to crawl social media, they are clunky, and coverage is haphazard. With the potential death of Twitter, one of the most important primary sources of the global pandemic, the Trump presidency, the Idle No More and Me Too movements, may be consigned to oblivion.

Right now it is time to react to the chaos at Twitter. Those with academic access to the API should make strides to capture parts of Twitter that they study or anything they think is particularly important for future historians. If historians want to answer historical questions in the future about our present moment, we need to preserve social media. We are going to do what we can with our two accounts or 20 million Tweets a month, but this will only be a drop in the bucket.

Derek Cameron is a PhD Candidate at the University of Saskatchewan. He studies anti-vaccine activism in Canada, using digital history methods to understand how anti-vaccine networks have changed from the 1980s to the present. He can be reached at and @DerekHilCameron (for now).

This project is funded by SSHRC and the Archives Unleashed Cohort program.

This post was first featured on