Stochastic Parrots Case Study

Stochastic Parrots: Opening Up Twitter Conversations

Case study 3: Opening up Twitter conversations for controversy analysis

By Matias Valderrama Barragan with Greta Timaite and Iain Emsley

What data are we talking about?

The Stochastic Parrots Twitter data set is a database of tweets that were collected by the Shaping AI research project in order to capture public controversy surrounding the research paper “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” by Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret ShmitchellLink opens in a new window. The paper became the subject of extensive debate online, especially after one of its authors, Timnit Gebru, was ousted from her job as Co-lead of the Ethical AI Research Team at Google AI following internal disagreements about its publication.

The Stochastic Parrots Twitter dataset was captured by the CIM RSE team to support AI controversy analysis as part of the international social research project Shaping AILink opens in a new window. Based on an online consultation of UK-based experts in AI and society (Marres et al., 2024, p. 2)Link opens in a new window, the Shaping AI research team at the University of Warwick led by prof. Noortje Marres selected five notable research controversies around AI and society that had taken place between 2012 and 2022 for further study. The Stochastic Parrot controversy was one of the most frequently mentioned by UK experts during the expert consultation. Follow-up interviews with selected experts confirmed that the social media platform Twitter served as one of the main stages, or ‘primary settings,’ for this public controversy about AI, hence the relevance of studying its traces on this platform.

How was the dataset created?

The dataset was created by CIM research software engineers by submitting request to Twitter's academic API in the form of queries. Queries consisted of key wrds related to the academic paper on Stochastic Parrots and related URLs derived from ‘manual’ search and inspection of how the paper was mentioned on Twitter. All tweets containing the queries, as well as all the replies and quotes, were collected for the period between 10 January 2019 and 18 January 2021. Tweets were downloaded using the Twitter ‘search all’ API endpoint” (Marres et al., 2024, p. 17), and consisted of 24673 tweets. The dataset was subsequently cleaned and filtered based on conversation ID-level, including all replies and sub-replies within the tweet threads of more than 10 tweets. This resulted in a database of 6295 in-scope Twitter conversations that addressed the controversy.

Figure: Topics in the Shaping AI Twitter dataset on the Stochastic Parrots controversy. Data visualisation was made by Ginevra Terenghi.

Why is this dataset relevant to open interpretative research?

The Stochastic Parrots controversy on Twitter presents an important resource for understanding controversies about AI-research and research ethics in big tech companies during the relevant period. The Twitter data set also addresses a moment that may be also of interest to researchers investigating the social and ecological implications of AI. At the time that the controversy erupted, Twitter arguably still served as a hybrid forum (Callon and Rip, 1992), a settings where different type of experts meet to negotiate scientific issues that affect research and society alike (now they are migrating to other platforms). In this regard, the dataset is also of relevance for social research on knowledge communities and knowledge democracy. In analysing the tweets, the Shaping AI researchers observed that there was not much agreement on what this controversy was all about. Users disagreed about what they disagree about: the content of the document, the dismissal of Gebru, the power dynamics in the technology industry, and so on. So, the interpretation of the object or what is at stake is also somewhat unstable, and the controversy itself is an example of open interpretation. The actors interpreted what was happening in different ways. Computer scientists, sociologists, journalists, etc., were involved, which implies a kind of reflexivity in the ongoingness of the situation.

But the question of how the data set can be opened up for open interpretative research is also a techno-methodological and ethical question: how to make Twitter data available for secondary data analysis? There are important ethical challenges even in making "public" data like tweets available in that tweeting does not imply consent to being subjected to data analysis (see the Association of Internet Researchers Ethics Statement). At the same time, making social media data open is no longer only an aspiration; it is now a requirement of all research funded by UK Research and Innovation (UKRI) Councils to report on open data and to make data sets openly available wherever this is possible. The UKRI requires that data from all funded projects is stored using their UK Data Service. When it comes to Twitter data sets, this is possible in the form of tweet IDs, which can be made available for future “rehydration”: retrieving the data using the IDs. For the Shaping AI Stochastic Parrots Data set, the team will be storing the Twitter IDs along with a description of both the data and the method using this service.

What are the challenges of opening up this Twitter dataset?

Our participatory workshop session on the Stochastic Parrots Twitter dataset highlighted several considerations and recommendations for ethical and practical handling of social media data. The first consideration is the public character of these tweets. Applied Linguistics participants saw Twitter data as highly valuable for understanding enunciation in open, public forums. However, other participants from digital media studies raised ethical concerns about user privacy, particularly given the difficulty of obtaining informed consent, dealing with deleted tweets, the risk of de-anonymization, as well as the importance of not assuming that because these tweets are posted online, their creators had the expectation that their tweets may be used for social research. To address privacy concerns, some participants recommended working with data donation (Boeschoten et al, 2022) by Twitter users as a method, in a community-involved, small-scale project, although this approach presents challenges at scale. Automated tweet rephrasing is another possible means to reduce de-anonymization risk, though it limits data accuracy. Moreover, some challenges regarding comparative analysis of social media controversies were raised: the group noted that Twitter’s lack of comprehensive dataset comparisons hinders more substantive analysis, and recent changes under Elon Musk have increased data usage restrictions, further complicating ethical data access. In other words, the setting is no longer the same after Musk took ownership of what is now called X.

How then to work around these important ethical and methodological issues?

To ensure data sharing while also addressing usage restrictions, we explored alternative approaches to open up Twitter data, such as by focusing on “public voices.” In this approach, we would identify social media figures such as influencers, politicians, or popular figures from whom we can expect an intention to reach a large audience, or use hashtag-based topics to limit scope. Another option is to open up social media dataset in aggregated form, although this might restrict the agency and freedom of researchers in analysing the data (as in the case of the controversial Social Science One project for Facebook data). The less aggregated, the bigger the agency and the more control of the researcher. Additionally, datasets could be opened under specific licenses, akin to Creative Commons, that restrict harmful re-use, like those targeting employees critical of their companies or contributing to language model training without any notice. There are several cases where Twitter datasets had faced re-use and access challenges, such as in the research project Digital Narratives of COVID-19 which uploaded a Twitter dataset to GitHub and Zenodo in the form of a list of Twitter conversation IDs that have to “hydrated” to repopulate the actual dataset. However, after recent changes in social media APIs, the Twitter hydrator does not seem to work anymore, something which may also impact the UK Data Service repository of Twitter or X data sets. Other cases include a dataset of scholars on Twitter in 2022 published by researchers, which encountered various problems keeping it re-usable after changes in Twitter and Crossref, and something simillar happened with OpenAlexorTwitter datasets uploaded to OSF. These examples underscore the need for sustainable data infrastructures for social media research that can accommodate evolving ethical standards and new technical constraints.

A different way of opening up the Shaping AI Twitter dataset: a proposal

Drawing on our participatory workshop, we offer the following recommendations for the the Shaping AI project, which can also be applied to other Twitter research projects to ensure that data set curation serves the ends of open interpretive research.

Actively devise a strategy for opening up social media datasets, defining for whom and for what purposes this is done. Datasets of public interest could also be made available to certified researchers, journalists and activists, or held in semi-closed archives.
While achieving informed consent at scale on Twitter datasets may be irrealistic, experimentation could be carried out on forms of informed consent collection at scale using AI techniques.
Focus on the analysis of “public figures” (politicians, spokespersons, academics, etc.) whose expressions are relevant to the public interest, but always explicitly define the boundary between public figures and non-public figures - with is easly blurred -and under with criteria.
Establish forms of data aggregation of the dataset that can enable openng data to the public without affecting privacy or consent issues.
Identify, cultivate and maintain autonomous and sustainable data infrastructures that enable work on data in the face of changes in ownership or control of social media platforms.
Explore the creation of an open archive or inventory of controversies in which datasets can be explored, compared or combined. An inspiration for this Twitter inventory could be the famous Mapping Controversies websiteLink opens in a new window created by the MACOSPOL project led by Bruno Latour (see the screenshot below) or the work of xcolLink opens in a new window, which is an online “inventory of the endless invention that is integral to any ethnographic inquiry”, and documents and curates four kinds of inventions: field devices, pedagogic open formats, “intraventions," and prototypes.
A Twitter controversies inventory could focus on curating datasets for the study of public controversy, specify documentation on data provenance, creating a so-called data sheets or brief, peer-reviewed publications that describe a dataset and its re-use potential. Rather than being static, this inventory could allow open interpretation of the datasets with interactive features. For example, researcher annotations of tweets or posts could be opened for interpretation, adding more and more categories relevant to further analyses.
For the inventory, an online application system to access the datasets can be implemented to verify that they are shared only with certified researchers, journalists, or activists, prior to an explanation of conditions for re-use. Data-sharing licensing could be established to define intended uses and exclude misuses (e.g., targeting employees and training AI models).
The datasets could also be uploaded to an online general-purpose open repository like Zenodo or UK Data Service. For the sake of privacy, the dataset should be configured under restricted access, and all user IDs should be encrypted. The original text of the tweets should not be provided to avoid de-anonymisation. An AI-rephrased text could be offered as an alternative, explicitly stating that this is not the original text of the tweet. The original hashtags could be retained. A "data sample" can be included so researchers can get an idea of what the dataset contains.

Feedback from Noortje Marres

Your recommendations for creating an open inventory of data sets for the study of public controversies are very appealing to me, but they also leave me feeling a tinge of regret. This is because I have been working towards precisely this objective for quite a few years, and yet much of its potential remains unrealised.

The curation of online data for the study of public controversies has been a central objective of digital researchers since the late 1990s, including of my own work. In 1998, I started mapping debates on the Web together with Richard Rogers (University of Amsterdam) and colleagues in computer-related design at the Royal College of Art (London). As part of this collaborative research, we created various public inventories of data mappings, including in the form of an online tool and archive called IssueCrawler (https://www.issuecrawler.net/).

In the early 2000, we co-organised the workshop series The Social Life of IssuesLink opens in a new window with the aim of creating an Online Issue Atlas. As part of this project, too, we created a public inventory of data maps, in the form of a basic web archive, a museum exhibit (where we projected data mappings in the “Making Things Public” exhibition space, as in the figure below), as well as the MACOSPOL website that you mentioned. Later still, we created the Wiki repository www.issuemapping.net, which is now hosted with support from the University of Warwick.

While these public inventories of data maps continue to be available, the communities of researchers that work with them are quite dispersed and on-going transformations of digital data infrastructures makes it challenging to maintain them. Indeed, you could say that the recent history of the Internet has made it much more difficult and not easier to realise a shared vision of collaborative social and cultural research supported by open data.

For the online data mappings that we created in the late 1990s, the generally accessible Web served as the principal data source, and during this time the Web itself qualified as "open data." With the rise of social media platforms like Twitter and the advent of API-based data capture (Marres and Weltevrede, 2013), we, in effect, experienced the gradual enclosure of what were previously open data (Venturini et al, 2018). Online data became less open when online activity migrated to proprietary platforms like Twitter. This is a much longer story, but we are now in the phase where even the previously "open" Web is threatened as a site for the creation of open data, as AI companies treat this data as a resource for the creation of closed generative systems, and clog the Web with synthetic content.

As a consequence of all this, we now live in a paradoxical time where "open" research requires restrictive settings, as you precisely suggest with your proposal to create protective digital spaces that can enable data curation for open research. It pleases me to read this proposal, and while the recent history of open data has been rather disappointing from the perspective of social and cultural research, I would be no less delighted to work with colleagues to explore how we can achieve this aim.

Scenario for the Issue Crawler Space in the Making Things Public Exhibition (ZKM, Karlsruhe, 2005)

Figure: Scenario for the projection of online data mappings created with Issuecrawler.net, Making Things Public, ZKM, Winter 2004