Text Data in Economics
Warwick Summer School 2022
Teaching Team
Instructor: Elliott Ash,
TA: Claudia Marangon,
Lectures: XX
TA office hrs: XX
Learning Objectives:
LO1. Implement and evaluate text-as-data methods.
LO2. Evaluate the use of text-analysis tools in economics research.
LO3. Plan a research project using text data.
Course Format
● 8 lectures on zoom (12 hours)
● In-person workshopping of student project papers
● Problem set
● Referee report on one of the course readings
● Research proposal on a text-data project (first and second draft)
Topics Outline and Main Economics Papers Readings
- Overview
- Gentzkow, Kelly, and Taddy, “Text as Data.”
- Ash and Hansen, Text Algorithms in Economics
- Dictionaries (Macro)
- Baker, Bloom, and Davis (2016), Measuring economic policy uncertainty
- Hassan, Hollander, Van Lent, and Tahoun, Firm-Level Political Risk: Measurement and Effects
- Dictionaries (Micro)
- Michalopoulous and Xue (2019), Folklore
- Enke 2020, Moral values and voting
- Djourelova, Media persuasion through slanted language: Evidence from the coverage of immigration
- Truffa and Wong (2021), Undergraduate Gender Diversity and Direction of Scientific Research
- Advani, Ash, Cai, and Rasul (2022), Race-Related Research in Economics.
- Document Distance
- Kelly, Papanikolau, Seru, and Taddy, Measuring technological innovation over the very long run.
- Cage, Herve, and Viaud, The production of information in an online world
- Topic Models
- Hansen, McMahon, and Prat, Transparency and deliberation with the FOMC: A computational linguistics approach.
- Ash, Morelli, and Vannoni, “More laws, more growth? Evidence from U.S. states”
- Supervised Learning
- Gentzkow and Shapiro (2010), What Drives Media Slant? Evidence from U.S. Daily Newspapers.
- Gentzkow, Shapiro, and Taddy (2019)
- Widmer, Galletta, and Ash, Media Slant is Contagious
- Word Embeddings (2 classes)
- Ash, Chen, and Ornaghi (2022)
- Gennaro and Ash, Emotion and Reason in Political Language (2021) and Transparency and Emotionality in Politics: Evidence from C-SPAN (2022).
- Ash, Gennaro, Hangartner, and Stampi-Bombelli (2022)
- Kozlowsky et al
- Syntactic and Semantic Parsing
- Ash, Gauthier, and Widmer, Text semantics capture political and economic narratives
Learning Materials
● Natural Language Processing in Python, Third Edition (“NLTK Book”).
○ Available at
○ Classic treatments of traditional NLP tools.
● Aurelien Geron, Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (2019)
○ O’Reilly Book, should be available with an academic account using ETH email.
○ A great practical book for machine learning and deep learning in Python, but not NLP-focused. We will use material from Chapters 2-4, 7-11, 13, and 15-17.
○ The deep learning chapters use Keras + TensorFlow.
● Yoav Goldberg, Neural Network Methods for Natural Language Processing (2017)
○ ETH Library Online Access (email me if this doesn’t work)
○ A more advanced theoretical treatment of neural networks with an NLP focus, but already somewhat dated. We will use material from Chapters 1-17 and 19.
● Jurafsky and Martin, Speech and Language Processing (3d Ed. 2019).
○ The standard theory text on computational linguistics.
Python is probably the best option for NLP, used by most data scientists. All the sample code is in Python. You are welcome to use another programming language.
● New to Python?
○ Python installation instructions
○ Codecademy Online Python Course
○ Jupyter Notebook Keyboard Shortcuts
○ Google Colab Tips for Power Users
● New to Machine Learning?
○ Codecademy Machine Learning Course
○ Read the Geron Book, Chapters 1-7
○ Practical Deep Learning for Coders Course
● New to Text Mining / NLP?
○ Codecademy Online NLP Course
○ Read the NLTK Book, Chapters 1-5
○ Code-First Introduction to Natural Language Processing
○ Lists of papers with replication repos.
● Want to use R instead?
○ Quanteda is popular for text analysis among political scientists.
● Other resources:
○ How to use Google Colab notebooks
Python Libraries
pip install pandas seaborn scikit-learn tensorflow nltk gensim flair spacy transformers
● Basics:
○ pandas: data loading and management
○ seaborn: visualization
○ sklearn: general purpose Python ML library
○ Keras + TensorFlow: deep learning library
● NLP Necessities:
○ nltk: standard NLP tools
○ gensim: topic models and embeddings
○ spaCy: tokenization, NER, syntactic parsing, word vectors
○ flair: sentiment analysis and some other tools (tutorials)
○ huggingface transformers: transformer architectures
● Specialized tools:
○ AllenNLP: library of models for semantic role labeling, entailment, question answering, etc
○ fastText: library of embeddings
○ spacy-transformers: interface from spaCy to huggingface
Yellow highlighting indicates
required reading
Blue highlighting indicates
recommended methods reading
Reference (Overview):
● Gentzkow, Kelly, and Taddy, “Text as Data.”
● Goldberg, Ch. 1
● NLTK book, Chapters 1, 2, 4
● Grimmer and Stewart, “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.”
● Raschka, “Turn your Twitter Timeline into a Word Cloud”.
Reference (Dictionary Methods):
● RegExOne Regular Expressions Lessons
Reference: Tokenization
● scikit-learn text feature extraction
● Goldberg, Ch. 6
● NLTK book, Chapter 3, 5, 7, 8
● A deep dive into preprocessing in NLP
● Denny and Spirling, “Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It.”
Reference (Dimensionality Reduction):
● Geron, Chapters 8-9
● Gilis, The why and how of nonnegative matrix factorization.
Methods (Document Distance):
● Lee et al, An empirical evaluation of models of text document similarity.
● Brandon Rose, “Document clustering in python.”
Methods (Topic Models)
● Prabhakaran, Topic Modeling with Gensim
● Quinn, et al, How to analyze political attention with minimal assumptions and costs.
● Roberts et al, “Structural Topic Models for Open-Ended Survey Responses”.
● Christian Fong and Justin Grimmer, “Discovery of treatments from text corpora.”
Reference (Machine Learning):
● Goldberg Ch. 2, 7
● Google Developers Text Classification Guide
● NLTK book, chapter 6
● Geron, Chapters 2-4, 7
Overview (Deep Learning for NLP)
● Sebastian Ruder, Deep Learning for NLP, Best Practices
Reference (Neural Nets):
● Text classification from raw text (Google Colab)
● Goldberg, Ch. 3-5
● Geron, Chapters 10-11
● Leslie Smith, A disciplined approach to neural network hyper-parameters
● Chris Olah, Backpropagation
● Baldi and Sadowski, Understanding Dropout
Reference (Embedding Layers):
● Goldberg, Ch. 8
● Bag of tricks for efficient text classification
References (RNNs):
● Geron Ch. 15-17
● Goldberg, Ch. 14-17
● Sutskever, Vinyals, and Le, Sequence to sequence learning with neural networks
● Michael Nguyen, Illustrated Guide to LSTMs and GRUs
● Andrej Karpathy, The unreasonable effectiveness of recurrent neural networks
● Chang and Masteron, Using word order in political text classification with long short-term memory models.
Reference (Model Interpretation)
● Ribeiro, Singh, and Guestrub, Local interpretable model-agnostic explanations (LIME): An introduction.
● Python Notebook with Model Interpretation Examples
Applications (MLP):
● Vamossy, Investor Emotions and Earnings Announcements
● Meursault, The language of earnings announcements
Applications (RNN):
● [short] Iyyer et al, Political ideology detection using recursive neural networks.
● Ash et al, In-Group Bias in the Indian Judiciary
Reference (Word Embeddings):
● Spirling and Rodriguez, Word embeddings: What works, what doesn’t, and how to tell the difference for applied research.
● Goldberg, Ch. 10-11
● Chapter Yoav Goldberg and Omer Levy, “Word2Vec explained: Deriving Mikolov et al's Negative Sampling Word Embedding Method”.
● Piero Molino, “Word embeddings: Past, present, and future”.
● Matt Kusner, Yu Sun, Nicholas Kolkin, and Killian Weinberger, “From word embeddings to document distances”.
● Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski, Linear Algebraic Structure of Word Senses, with Applications to Polysemy
● Allen and Hospedales, Analogies Explained: Towards Understanding Word Embeddings
● Ruder, Approximating the Softmax
● Peters, Ruder, and Smith, To tune or not to tune: Adapting pretrained representations to diverse tasks.
● Bojanowski et al, Enriching word vectors with subword information.
● Antoniak and Mimno, Evaluating the stability of embedding-based word similarities.
● Ash, Chen, and Ornaghi, Gender attitudes in the judiciary: Evidence from U.S. Circuit Courts
● Hamilton, Clark, Leskovec, and Jurafsky, 2016, Inducing domain-specific sentiment lexicons from unlabeled corpora.
Tools (Word Embeddings)
Contextualized Word Embeddings:
● Peters et al, Deep contextualized word representations.
Reference (Syntactic Parsing):
● NLTK Book, Chapter 8 Analyzing Sentence Structure
● Jurafsky and Martin, Chapters 12-15, 20
Reference (Semantic Role Labeling):
● Jurafsky and Martin, Ch. 19: Semantic Role Labeling
● English PropBank Annotation Guidelines
Tools (Syntactic Parsing):
Tools (Semantic Role Labeling):
● Google Syntactic N-Grams Corpus
Tools for Document Embeddings
● Ruder, Deep Learning for NLP Best Practices
● spaCy interface to transformers
References (Document Embeddings):
● Arora, Liang, and Ma, “A simple but tough-to-beat baseline for sentence embeddings.”
● Doc2Vec:
○ Le and Mikolov, “Distributed representations of sentences and documents.”
○ A gentle introduction to Doc2Vec
○ Doc2vec implementation in Keras
○ Explanation of Doc2Vec Infer Vector
● Wu et al, Starspace: Embed all the things!
● Bhatia, Lau, and Baldwin, “Automatic labeling of topics with neural embeddings”
● InferSent
● USE:
○ Cer et al, Universal Sentence Encoder, (code)
○ Yang et al, Multilingual universal sentence encoder for semantic retrieval.
● Clark, Celikyilmaz, and Smith, Sentence mover’s similarity: Automatic evaluation for multi-sentence texts.
References (Attention / Transformers)
● Bloem, Transformers from scratch
● Ruder, NLP’s ImageNet moment has arrived
● Geron, Chapter 16
● Goldberg, Ch. 17
● Vaswani et al, Attention is all you need
○ Devlin et al, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
○ Reimers and Guryevich, Sentence-BERT
○ Nie et al, DisSent: Learning Sentence Representations from Explicit Discourse Relations
○ Hassan et al, BERT, ELMo, USE, and InferSent Sentence Encoders
Reference (Language Models):
● Goldberg, Ch. 9, 17
Reference (Transformers):
● Huggingface Transformers Summary of Models
Reference (Autoregressive Models)
● Shree, The journey of OpenAI GPT Models
● Radford et al, Language models are unsupervised multitask learners.
● XLNet: Generalized autoregressive pretraining for language understanding.
● Transformer-XL: Attentive language models beyond a fixed-length context.
● Brown et al, Language models are few-shot learners (GPT-3)
● GPT-Neo (open-sourced GPT-3)
Reference: Conditioned Text Generation
● Grover: A state-of-the-art defense against neural fake news
● Dathathri et al, Controlling text generation with plug and play language models
Reference (Sequence-to-Sequence Transformers)
● EasyNMT
● Lewis et al, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
● Raffel et al, Exploring the limits of transfer learning with a unified text-to-text transformer
● Narang et al, WT5?! Training Text-to-Text Models to Explain their Predictions
Reference (Coreference Resolution):
● Jurafsky and Martin, Ch. 21: Coreference Resolution
Reference (Textual Entailment)
Reference (Siamese Neural Nets)
● Few shot learning in NLP with USE and siamese networks
Reference (Discourse):
● Jurafsky and Martin, Ch. 23: Discourse Coherence
Reference (Dialogue):
● Li et al, Adversarial learning for neural dialogue generation.
● Toward a conversational agent that can chat about anything
○ A state-of-the-art open source chatbot
● Luo et al, Detecting stance in media on global warming
Big Bird
● Understanding BigBird’s Block Sparse Attention
Reference (Information Extraction):
● NLTK Book, Chapter 7: Extracting Information
● Jurafsky and Martin, Ch. 17 Information Extraction
● Angeli et al, Leveraging linguistic structure for open domain information extraction.
● Stanford OpenIE Python Wrapper
● Qin et al, ERICA: Improving entity and relation understanding for pre-trained language models via contrastive learning
Reference (Knowledge Graphs)
● Nickel, Knowledge Graph Embeddings
● Joulin et al, Fast linear model for knowledge graph embeddings
● Lehmann et al, DBPedia -- A large-scale, multilingual knowledge base extracted from Wikipedia.
● Dense Representations for Entity Retrieval (code)
● Connecting the Dots: Document level neural relation extraction with edge-oriented graphs
● Yao et al, KG-BERT: BERT for knowledge graph completion
Reference (Summarization):
● Gabriel et al, Cooperative Generator-Discriminator Networks for Abstractive Summarization with Narrative Flow
● See et al, Get to the point: Summarization with pointer-generator networks
● TLDR: Extreme Summarization of Scientific Documents
● Stiennon et al, Learning to summarize from human feedback
Reference (Question Answering):
● Jurafsky and Martin, Ch. 25
● Question answering with huggingface transformers
● NLP Progress: Question Answering
Reference (Claim Checking):
● Vlachos, e-Fever
Language Model Interpretation:
Reference (Legal AI)
● Zhong et al, A summary of legal artificial intelligence
Methods (Causal Inference with Text):
● Keith et al, Text and causal inference: A review of using text to remove confounding from causal estimates
● Wood-Doughty et al, Challenges of Using Text Classifiers for Causal Inference
● Egami, Fong, Grimmer, Roberts, and Stewart, How to Make Causal Inferences Using Texts
Additional Applications
Complexity in Text):
● Katz and Bommarito, “Measuring the complexity of the law: The United States Code.”
● Katz et al, Complex societies and the growth of the law
● Benoit, Munger, and Spirling (2017), “Measuring and Explaining Political Sophistication Through Textual Complexity”.
● [short] Louis and Nenkova (2013), What Makes Writing Great? First Experiments on Article Quality Prediction in the Science Journalism Domain
Dictionary Methods:
● Michalopoulous and Xue (2019), Folklore
● Baker, Bloom, and Davis (2016), Measuring economic policy uncertainty
● Djourelova, Media persuasion through slanted language: Evidence from the coverage of immigration
● Enke 2020, Moral values and voting
● Cao et al, How to talk when machines are listening: Corporate disclosure in the age of AI
● Gentzkow and Shapiro (2010), What Drives Media Slant? Evidence from U.S. Daily Newspapers.
● Ash, Morelli, and Van Weelden, Elections and divisiveness: Theory and evidence.
Document Distance:
● Kelly, Papanikolau, Seru, and Taddy, Measuring technological innovation over the very long run.
● Hoberg and Phillips, Text-based network industries and endogenous product differentiation.
Topic Models
● Barron, Huang, Spang, and DeDeo, Individuals, institutions, and innovation in the debates of the French Revolution. [has appendix]
● Hansen, McMahon, and Prat, Transparency and deliberation with the FOMC: A computational linguistics approach.
● Ash, Morelli, and Vannoni, “More laws, more growth? Evidence from U.S. states”
Text Classification:
● Osnabrugge, Ash, and Morelli, Cross-domain topic classification for political texts
● Widmer, Galletta, and Ash, Media slant is contagious
● Gentzkow, Shapiro, and Taddy (2019), “Measuring Group Differences in High-Dimensional Choices: Method and Application to Congressional Speech.”
● Kelly et al, Text selection
● Peterson and Spirling, Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems
Word Embeddings:
● Ash, Chen, and Ornaghi, Gender attitudes in the judiciary: Evidence from U.S. Circuit Courts
● Gennaro and Ash, Emotion and Reason in Political Language
● Caliskan et al, “Semantics derived automatically from language corpora contain human-like biases”
● Bolukbasi et al, Man is to computer programmer as woman is to homemaker: Debiasing word embeddings.
● Kozlowski, Taddy, and Evans 2019, The geometry of culture: Analyzing the meanings of class through word embeddings
● Stoltz and Taylor, Concept Mover's Distance: Measuring Concept Engagement in Texts via Word Embeddings
● [short] Gillani and Levy, Simple dynamic word embeddings for perceptions in the public sphere.
● Garg et al 2018, Word embeddings quantify 100 years of gender and ethnic stereotypes [includes appendix]
● [short] Lucy et al, Content analysis of textbooks via natural language processing
● Thompson et al, Cultural influences on word meanings revealed through large-scale semantic alignment [includes appendix]
● Rheault and Cochrane, Word embeddings for the analysis of ideological placement in parliamentary corpora.
● Nyarko and Sanga, A statistical test for legal interpretation.
Syntactic Parsing:
● Hoyle et al, Unsupervised discovery of gendered language through latent-variable modeling.
● [short] Ash, Jacobs, MacLeod, Naidu, and Stammbach, Unsupervised extraction of workplace rights and duties from collective bargaining agreements.
● Vannoni, Ash, and Morelli, Measuring Discretion and Delegation in Legislative Texts
● Michael Webb, The impact of artificial intelligence on the labor market
Semantic Role Labeling:
● Ash, Gauthier, and Widmer, Mining narratives from large text corpora
● Fetzer, Can workfare programs moderate conflict? Evidence from India
Information Extraction:
● [short] Surdeanu et al 2011, Customizing an information extraction system for a new domain.
● [short] Jurafsky and Chambers, Unsupervised learning of narrative schemas and their participants.
● [short] Clark, Ji, and Smith, Neural text generation in stories using entity representations as context.
● [short] Bamman and Smith, Open extraction of fine-grained political statements.
● [short] Wyner and Peters, On rule extraction from regulations.
● [short] Xia and Ding, Emotion-Cause Pair Extraction
Document Embeddings:
● [short] Demzsky et al, 2019, Analyzing Polarization in Social Media: Method and Application to Tweets on 21 Mass Shootings
● [short] Dai, Olah, and Le, Document Embedding with Paragraph Vectors.
● Ash and Chen 2018, Mapping the geometry of law using document embeddings.
● Galletta, Ash, and Chen, “Causal Effects of Judicial Sentiment: Methods and Application to U.S. Circuit Courts”
● Ash, Chen, and Naidu (2020), “Ideas have consequences: The effect of law and economics on American justice.”
● [short] Tong et al, Low-skilled jobs face the highest re-skilling pressure.
Transformer Classification:
● Bingler et al, Cheap Talk and Cherry-Picking: What ClimateBert has to say on Corporate Climate Risk Disclosures
● Pei and Jurgens, Quantifying intimacy in language
Language Models:
● [short] Peric, Mijic, Stammbach, and Ash, Legal language modeling with transformers
● Kreps, McCain, and Brundage, All the news that’s fit to fabricate.
● [short] Peng et al, Fine-tuning a transformer-based language model to avoid generating non-normative text
● [short] Adeem, Bethky, Reddy, StereoSet: Measuring stereotypical bias in pre-trained language models
Local Semantics
● Ross et al, [short] Explaining NLP models via minimal contrastive editing
● Prabhakaran et al, How metaphors impact political discourse.
Global Semantics
● [short] Stammbach and Ash, e-FEVER: Explanations and summaries for automated fact checking
● Chen et al, Opinion aware knowledge graph for political ideology detection
● Vold and Conrad, Using transformers to improve answer retrieval for legal questions.
Causal Inference with Text:
● Margaret Roberts, Brandon Stewart, and Richard Nielsen, “Matching Methods for High-Dimensional Data with Applications to Text”
● [short] Veitch et al, Using text embeddings for causal inference
● All the papers in Table 1 here.
● Zeng et al, Uncovering interpretable potential confounders in electronic medical records.
Argument Mining
● [short] Subramanian et al, Target Based Speech Act Classification in Political Campaign Text
Quote Extraction
● Newell et al, Quote extraction and analysis for news