Skip to main content Skip to navigation

Proposal Form- Noortje Corpus Classification

Proposal form

Last updated: 18th July 2018

Date Update
18th July 2018 No response from Noortje
19th July 2018 Noortje responsed and proposal edited

Project name

Lexical Tool (Corpus classification)

Date started

4 July 2018


Create a tool with a web based interface. The tool should identify the correlation of two types of categories within a textual corpus. The category types and the categories are defined by the user, who provides indicator terms and phrases for each category ("the lexicon"), according to a template. Indicator terms and phrases are derived from the corpus, or a random sample thereof.

A template for listing categories and queries is made available via the web interface (including instructions of use, e.g. on the use of quotation marks, phrases, root words). Output should be readable by the user, and allow both iterative improvement of the lexicon and network-based visualisation. Code for the tool should be export-able to different contexts


Category type


Category Query

  • Write Python function to process input and generate output
  • Flask front end written to execute the analysis via a web interface
  • Source available via
  • Hosted in internal server temporarily
  • ITS hosting for flask if possible

  1. Timeline
  2. Functions to process and summarise data written in 1 month
  3. Web interface written within two months




Python - Flask, base libraries


CSV file - input data. Headers: a, b

CSV file - criterial description. Headers: Category-type, category, query-term, search-column


CSV file - co-occurence table. Headers: Category-type category, Category-type, category, co-occurence-frequency

CSV file - word frequency table. Headers: Category-type, category, query-term, frequency

GDF file - Category co-occurence. Nodes: Categories by type. Edges: co-occurence of categories


Word Raw frequency to show how many words in absolute per category (visualisation). Perhaps text (occurence)