Skip to main content Skip to navigation




Text Data in Economics

Warwick Summer School 2022


Teaching Team

Instructor: Elliott Ash,
TA:             Claudia Marangon,



Lectures: XX

TA office hrs: XX


Important Links

GitHub with Notebooks (

Problem Set


Learning Objectives:

LO1. Implement and evaluate text-as-data methods.

LO2. Evaluate the use of text-analysis tools in economics research.

LO3. Plan a research project using text data.


Course Format

     8 lectures on zoom (12 hours)

     In-person workshopping of student project papers



     Problem set

     Referee report on one of the course readings

     Research proposal on a text-data project (first and second draft)






Topics Outline and Main Economics Papers Readings


  1. Overview
    1. Gentzkow, Kelly, and Taddy, “Text as Data.”
    2. Ash and Hansen, Text Algorithms in Economics
  2. Dictionaries (Macro)
    1. Baker, Bloom, and Davis (2016), Measuring economic policy uncertainty
    2. Hassan, Hollander, Van Lent, and Tahoun, Firm-Level Political Risk: Measurement and Effects
  3. Dictionaries (Micro)
    1. Michalopoulous and Xue (2019), Folklore
    2. Enke 2020, Moral values and voting
    3. Djourelova, Media persuasion through slanted language: Evidence from the coverage of immigration
    4. Truffa and Wong (2021), Undergraduate Gender Diversity and Direction of Scientific Research
    5. Advani, Ash, Cai, and Rasul (2022), Race-Related Research in Economics.
  4. Document Distance
    1. Kelly, Papanikolau, Seru, and Taddy, Measuring technological innovation over the very long run.
    2. Cage, Herve, and Viaud, The production of information in an online world
  5. Topic Models
    1. Hansen, McMahon, and Prat, Transparency and deliberation with the FOMC: A computational linguistics approach.
    2. Ash, Morelli, and Vannoni, “More laws, more growth? Evidence from U.S. states
  6. Supervised Learning
    1. Gentzkow and Shapiro (2010), What Drives Media Slant? Evidence from U.S. Daily Newspapers.
    2. Gentzkow, Shapiro, and Taddy (2019)
    3. Widmer, Galletta, and Ash, Media Slant is Contagious
  7. Word Embeddings (2 classes)
    1. Ash, Chen, and Ornaghi (2022)
    2. Gennaro and Ash, Emotion and Reason in Political Language (2021) and Transparency and Emotionality in Politics: Evidence from C-SPAN  (2022).
    3. Ash, Gennaro, Hangartner, and Stampi-Bombelli (2022)
    4. Kozlowsky et al
  8. Syntactic and Semantic Parsing
    1. Ash, Gauthier, and Widmer, Text semantics capture political and economic narratives


Learning Materials


     Natural Language Processing in Python, Third Edition (“NLTK Book”).

     Available at

     Classic treatments of traditional NLP tools.

     Aurelien Geron, Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2019)

     O’Reilly Book, should be available with an academic account using ETH email.

     A great practical book for machine learning and deep learning in Python, but not NLP-focused. We will use material from Chapters 2-4, 7-11, 13, and 15-17.

     The deep learning chapters use Keras + TensorFlow.

     Jupyter notebooks

     Yoav Goldberg, Neural Network Methods for Natural Language Processing (2017)

     ETH Library Online Access (email me if this doesn’t work)

     A more advanced theoretical treatment of neural networks with an NLP focus, but already somewhat dated. We will use material from Chapters 1-17 and 19.

     Jurafsky and Martin, Speech and Language Processing (3d Ed. 2019).

     Available here.

     The standard theory text on computational linguistics.



Python is probably the best option for NLP, used by most data scientists. All the sample code is in Python. You are welcome to use another programming language.

     New to Python?

     Python installation instructions

     Codecademy Online Python Course

     numpy tutorial

     pandas tutorial

     Jupyter Notebooks Tutorial

     Jupyter Notebook Keyboard Shortcuts

     Google Colab Tips for Power Users

     Dash Web Apps Tutorial

     Other Resources

     New to Machine Learning?

     Codecademy Machine Learning Course

     Read the Geron Book, Chapters 1-7 Practical Deep Learning for Coders Course

     New to Text Mining / NLP?

     Codecademy Online NLP Course

     Read the NLTK Book, Chapters 1-5 Code-First Introduction to Natural Language Processing

     Papers with Code (NLP)

     Lists of papers with replication repos.

     Want to use R instead?

     Quanteda is popular for text analysis among political scientists.

     Other resources:

     How to use the terminal

     How to use Google Colab notebooks


Python Libraries

pip install pandas seaborn scikit-learn tensorflow nltk gensim flair spacy transformers


     pandas: data loading and management

     seaborn: visualization

     sklearn: general purpose Python ML library

     Keras + TensorFlow: deep learning library

     NLP Necessities:

     nltk: standard NLP tools

     gensim: topic models and embeddings

     spaCy: tokenization, NER, syntactic parsing, word vectors

     flair: sentiment analysis and some other tools (tutorials)

     huggingface transformers: transformer architectures

     Specialized tools:

     AllenNLP: library of models for semantic role labeling, entailment, question answering, etc

     fastText: library of embeddings

     spacy-transformers: interface from spaCy to huggingface




Yellow highlighting indicates required reading
Blue highlighting indicates recommended methods reading


Reference (Overview):

     Gentzkow, Kelly, and Taddy, “Text as Data.”

     Goldberg, Ch. 1

     NLTK book, Chapters 1, 2, 4

     Grimmer and Stewart, “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.”

     Raschka, “Turn your Twitter Timeline into a Word Cloud”.


Reference (Dictionary Methods):

     RegExOne Regular Expressions Lessons


Reference:  Tokenization

     scikit-learn text feature extraction

     Goldberg, Ch. 6

     NLTK book, Chapter 3, 5, 7, 8

     A deep dive into preprocessing in NLP

     Denny and Spirling, “Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It.”



Reference (Dimensionality Reduction):

     Geron, Chapters 8-9

     Gilis, The why and how of nonnegative matrix factorization.

Methods (Document Distance):

     Lee et al, An empirical evaluation of models of text document similarity.

     Brandon Rose, “Document clustering in python.”

Methods (Topic Models)

     Prabhakaran, Topic Modeling with Gensim

     Quinn, et al, How to analyze political attention with minimal assumptions and costs.

     Roberts et al, “Structural Topic Models for Open-Ended Survey Responses”.

     Christian Fong and Justin Grimmer, “Discovery of treatments from text corpora.”


Reference (Machine Learning):

     Goldberg Ch. 2, 7

     Google Developers Text Classification Guide

     NLTK book, chapter 6

     Geron, Chapters 2-4, 7



Overview (Deep Learning for NLP)

     Sebastian Ruder, Deep Learning for NLP, Best Practices

Reference (Neural Nets):

     Text classification from raw text (Google Colab)

     Goldberg, Ch. 3-5

     Geron, Chapters 10-11

     Leslie Smith, A disciplined approach to neural network hyper-parameters

     Chris Olah, Backpropagation

     Baldi and Sadowski, Understanding Dropout

Reference (Embedding Layers):

     Goldberg, Ch. 8

     Bag of tricks for efficient text classification

References (RNNs):

     Geron Ch. 15-17

     Goldberg, Ch. 14-17

     Sutskever, Vinyals, and Le, Sequence to sequence learning with neural networks

     Michael Nguyen, Illustrated Guide to LSTMs and GRUs

     Andrej Karpathy, The unreasonable effectiveness of recurrent neural networks

     Chang and Masteron, Using word order in political text classification with long short-term memory models.

Reference (Model Interpretation)

     Ribeiro, Singh, and Guestrub, Local interpretable model-agnostic explanations (LIME): An introduction.

     Python Notebook with Model Interpretation Examples

Applications (MLP):

     Vamossy, Investor Emotions and Earnings Announcements

     Meursault,  The language of earnings announcements

Applications (RNN):

     [short] Iyyer et al, Political ideology detection using recursive neural networks.     

     Ash et al, In-Group Bias in the Indian Judiciary


Reference (Word Embeddings):

     Spirling and Rodriguez, Word embeddings: What works, what doesn’t, and how to tell the difference for applied research.

     Goldberg, Ch. 10-11

     Chapter Yoav Goldberg and Omer Levy, “Word2Vec explained: Deriving Mikolov et al's Negative Sampling Word Embedding Method”.

     Piero Molino, “Word embeddings: Past, present, and future”.

     Matt Kusner, Yu Sun, Nicholas Kolkin, and Killian Weinberger, “From word embeddings to document distances”.

     Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski, Linear Algebraic Structure of Word Senses, with Applications to Polysemy

     Allen and Hospedales, Analogies Explained: Towards Understanding Word Embeddings

     Ruder, Approximating the Softmax

     Peters, Ruder, and Smith, To tune or not to tune: Adapting pretrained representations to diverse tasks.

     ConceptNet NumberBatch

     Bojanowski et al, Enriching word vectors with subword information.

     Antoniak and Mimno, Evaluating the stability of embedding-based word similarities.

     Ash, Chen, and Ornaghi, Gender attitudes in the judiciary: Evidence from U.S. Circuit Courts

     Hamilton, Clark, Leskovec, and Jurafsky, 2016,  Inducing domain-specific sentiment lexicons from unlabeled corpora.

Tools (Word Embeddings)

     Word embeddings in Flair

Contextualized Word Embeddings:

     Peters et al, Deep contextualized word representations.

     ELMo embeddings with Flair


Reference (Syntactic Parsing):

     NLTK Book, Chapter 8 Analyzing Sentence Structure

     Jurafsky and Martin, Chapters 12-15, 20

     ClearNLP Dependency Labels

Reference (Semantic Role Labeling):

     Jurafsky and Martin, Ch. 19: Semantic Role Labeling

     English PropBank Annotation Guidelines


Tools (Syntactic Parsing):

     spaCy 101

Tools (Semantic Role Labeling):

     Google Syntactic N-Grams Corpus


Tools for Document Embeddings

     Ruder, Deep Learning for NLP Best Practices

     huggingface transformers

     spaCy interface to transformers

References (Document Embeddings):

     Arora, Liang, and Ma, “A simple but tough-to-beat baseline for sentence embeddings.”


     Le and Mikolov, “Distributed representations of sentences and documents.”

     A gentle introduction to Doc2Vec

     Doc2vec implementation in Keras

     Explanation of Doc2Vec Infer Vector

     Wu et al, Starspace: Embed all the things!

     Bhatia, Lau, and Baldwin, “Automatic labeling of topics with neural embeddings



     Cer et al, Universal Sentence Encoder, (code)

     Yang et al, Multilingual universal sentence encoder for semantic retrieval.

     Clark, Celikyilmaz, and Smith, Sentence mover’s similarity: Automatic evaluation for multi-sentence texts.

References (Attention / Transformers)

     Bloem, Transformers from scratch

     Ruder, NLP’s ImageNet moment has arrived

     The transformer explained

     Geron, Chapter 16

     Goldberg, Ch. 17

     Vaswani et al, Attention is all you need



     Devlin et al, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

     The illustrated BERT

     A primer in BERTology

     Reimers and Guryevich, Sentence-BERT

     Nie et al, DisSent: Learning Sentence Representations from Explicit Discourse Relations

     Hassan et al, BERT, ELMo, USE,  and InferSent Sentence Encoders


Reference (Language Models):

     Goldberg, Ch. 9, 17

Reference (Transformers):

     Huggingface Transformers Summary of Models

Reference (Autoregressive Models)

     Shree, The journey of OpenAI GPT Models

     Radford et al, Language models are unsupervised multitask learners.

     GPT-2 Demo

     XLNet: Generalized autoregressive pretraining for language understanding.

     Transformer-XL: Attentive language models beyond a fixed-length context.

     Brown et al, Language models are few-shot learners (GPT-3)

     GPT-Neo (open-sourced GPT-3)

Reference: Conditioned Text Generation

     Grover: A state-of-the-art defense against neural fake news

     Dathathri et al, Controlling text generation with plug and play language models



Reference (Sequence-to-Sequence Transformers)


     Lewis et al, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

     Raffel et al, Exploring the limits of transfer learning with a unified text-to-text transformer

     Narang et al, WT5?! Training Text-to-Text Models to Explain their Predictions


Reference (Coreference Resolution):

     Jurafsky and Martin, Ch. 21: Coreference Resolution

Reference (Textual Entailment)


Reference (Siamese Neural Nets)

     Few shot learning in NLP with USE and siamese networks

Reference (Discourse):

     Jurafsky and Martin, Ch. 23: Discourse Coherence

Reference (Dialogue):

     Li et al, Adversarial learning for neural dialogue generation.

     Toward a conversational agent that can chat about anything

     A state-of-the-art open source chatbot

     Luo et al, Detecting stance in media on global warming


Big Bird

     Understanding BigBird’s Block Sparse Attention

Reference (Information Extraction):

     NLTK Book, Chapter 7: Extracting Information

     Jurafsky and Martin, Ch. 17 Information Extraction

     Angeli et al, Leveraging linguistic structure for open domain information extraction.

     Stanford OpenIE Python Wrapper

     Qin et al, ERICA: Improving entity and relation understanding for pre-trained language models via contrastive learning

Reference (Knowledge Graphs)


     Nickel, Knowledge Graph Embeddings

     Joulin et al, Fast linear model for knowledge graph embeddings

     Lehmann et al, DBPedia -- A large-scale, multilingual knowledge base extracted from Wikipedia.

     Dense Representations for Entity Retrieval (code)

     Connecting the Dots: Document level neural relation extraction with edge-oriented graphs

     Yao et al, KG-BERT: BERT for knowledge graph completion

Reference (Summarization):

     Gabriel et al, Cooperative Generator-Discriminator Networks for Abstractive Summarization with Narrative Flow

     See et al, Get to the point: Summarization with pointer-generator networks

     TLDR: Extreme Summarization of Scientific Documents

     Stiennon et al, Learning to summarize from human feedback

Reference (Question Answering):

     Jurafsky and Martin, Ch. 25

     Question answering with huggingface transformers

     NLP Progress: Question Answering

Reference (Claim Checking):

     Vlachos, e-Fever

Language Model Interpretation:



Reference (Legal AI)

     Zhong et al, A summary of legal artificial intelligence

Methods (Causal Inference with Text):

     Keith et al, Text and causal inference: A review of using text to remove confounding from causal estimates

     Wood-Doughty et al, Challenges of Using Text Classifiers for Causal Inference

     Egami, Fong, Grimmer, Roberts, and Stewart, How to Make Causal Inferences Using Texts



Additional Applications


Complexity in Text):

     Katz and Bommarito, “Measuring the complexity of the law: The United States Code.”

     Katz et al, Complex societies and the growth of the law

     Benoit, Munger, and Spirling (2017), “Measuring and Explaining Political Sophistication Through Textual Complexity”.

     [short] Louis and Nenkova (2013), What Makes Writing Great? First Experiments on Article Quality Prediction in the Science Journalism Domain


Dictionary Methods:

     Michalopoulous and Xue (2019), Folklore

     Baker, Bloom, and Davis (2016), Measuring economic policy uncertainty

     Djourelova, Media persuasion through slanted language: Evidence from the coverage of immigration

     Enke 2020, Moral values and voting

     Cao et al, How to talk when machines are listening: Corporate disclosure in the age of AI


     Gentzkow and Shapiro (2010), What Drives Media Slant? Evidence from U.S. Daily Newspapers.

     Ash, Morelli, and Van Weelden, Elections and divisiveness: Theory and evidence.


Document Distance:

     Kelly, Papanikolau, Seru, and Taddy, Measuring technological innovation over the very long run.

     Hoberg and Phillips, Text-based network industries and endogenous product differentiation.


Topic Models

     Barron, Huang, Spang, and DeDeo, Individuals, institutions, and innovation in the debates of the French Revolution. [has appendix]

     Hansen, McMahon, and Prat, Transparency and deliberation with the FOMC: A computational linguistics approach.

     Ash, Morelli, and Vannoni, “More laws, more growth? Evidence from U.S. states



Text Classification:

     Osnabrugge, Ash, and Morelli, Cross-domain topic classification for political texts

     Widmer, Galletta, and Ash, Media slant is contagious

     Gentzkow, Shapiro, and Taddy (2019), “Measuring Group Differences in High-Dimensional Choices: Method and Application to Congressional Speech.”

     Kelly et al, Text selection

     Peterson and Spirling, Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems


Word Embeddings:

     Ash, Chen, and Ornaghi, Gender attitudes in the judiciary: Evidence from U.S. Circuit Courts

     Gennaro and Ash, Emotion and Reason in Political Language

     Caliskan et al, “Semantics derived automatically from language corpora contain human-like biases

     Bolukbasi et al, Man is to computer programmer as woman is to homemaker: Debiasing word embeddings.

     Kozlowski, Taddy, and Evans 2019, The geometry of culture: Analyzing the meanings of class through word embeddings

     Stoltz and Taylor, Concept Mover's Distance: Measuring Concept Engagement in Texts via Word Embeddings

     [short] Gillani and Levy, Simple dynamic word embeddings for perceptions in the public sphere.

     Garg et al 2018, Word embeddings quantify 100 years of gender and ethnic stereotypes [includes appendix]

     [short] Lucy et al, Content analysis of textbooks via natural language processing

     Thompson et al, Cultural influences on word meanings revealed through large-scale semantic alignment [includes appendix]

     Rheault and Cochrane, Word embeddings for the analysis of ideological placement in parliamentary corpora.

     Nyarko and Sanga, A statistical test for legal interpretation.


Syntactic Parsing:

     Hoyle et al, Unsupervised discovery of gendered language through latent-variable modeling.

     [short] Ash, Jacobs, MacLeod, Naidu, and Stammbach, Unsupervised extraction of workplace rights and duties from collective bargaining agreements.

     Vannoni, Ash, and Morelli, Measuring Discretion and Delegation in Legislative Texts

     Michael Webb, The impact of artificial intelligence on the labor market


Semantic Role Labeling:

     Ash, Gauthier, and Widmer, Mining narratives from large text corpora

     Fetzer, Can workfare programs moderate conflict? Evidence from India


Information Extraction:

     [short] Surdeanu et al 2011, Customizing an information extraction system for a new domain.

     [short] Jurafsky and Chambers, Unsupervised learning of narrative schemas and their participants.

     [short] Clark, Ji, and Smith, Neural text generation in stories using entity representations as context.

     [short] Bamman and Smith, Open extraction of fine-grained political statements.

     [short] Wyner and Peters, On rule extraction from regulations.

     [short] Xia and Ding, Emotion-Cause Pair Extraction


Document Embeddings:

     [short] Demzsky et al, 2019, Analyzing Polarization in Social Media: Method and Application to Tweets on 21 Mass Shootings

     [short] Dai, Olah, and Le, Document Embedding with Paragraph Vectors.

     Ash and Chen 2018, Mapping the geometry of law using document embeddings.

     Galletta, Ash, and Chen, “Causal Effects of Judicial Sentiment: Methods and Application to U.S. Circuit Courts”

     Ash, Chen, and Naidu (2020), “Ideas have consequences: The effect of law and economics on American justice.”

     [short] Tong et al, Low-skilled jobs face the highest re-skilling pressure.

Transformer Classification:

     Bingler et al, Cheap Talk and Cherry-Picking: What ClimateBert has to say on Corporate Climate Risk Disclosures

     Pei and Jurgens, Quantifying intimacy in language


Language Models:

     [short] Peric, Mijic, Stammbach, and Ash, Legal language modeling with transformers

     Kreps, McCain, and Brundage, All the news that’s fit to fabricate.

     [short] Peng et al, Fine-tuning a transformer-based language model to avoid generating non-normative text

     [short] Adeem, Bethky, Reddy, StereoSet: Measuring stereotypical bias in pre-trained language models


Local Semantics

     Ross et al, [short] Explaining NLP models via minimal contrastive editing

     Prabhakaran et al, How metaphors impact political discourse.


Global Semantics

     [short] Stammbach and Ash, e-FEVER: Explanations and summaries for automated fact checking

     Chen et al, Opinion aware knowledge graph for political ideology detection

     Vold and Conrad, Using transformers to improve answer retrieval for legal questions.


Causal Inference with Text:

     Margaret Roberts, Brandon Stewart, and Richard Nielsen, “Matching Methods for High-Dimensional Data with Applications to Text

     [short] Veitch et al, Using text embeddings for causal inference

     All the papers in Table 1 here.

     Zeng et al, Uncovering interpretable potential confounders in electronic medical records.


Argument Mining

     [short] Subramanian et al, Target Based Speech Act Classification in Political Campaign Text


Quote Extraction

     Newell et al, Quote extraction and analysis for news