Proposal Form- Noortje Corpus Classification
Proposal form
Last updated: 18th July 2018
Updates
Date | Update |
18th July 2018 | No response from Noortje |
19th July 2018 | Noortje responsed and proposal edited |
Project name
Lexical Tool (Corpus classification)
Date started
4 July 2018
Criteria
Create a tool with a web based interface. The tool should identify the correlation of two types of categories within a textual corpus. The category types and the categories are defined by the user, who provides indicator terms and phrases for each category ("the lexicon"), according to a template. Indicator terms and phrases are derived from the corpus, or a random sample thereof.
A template for listing categories and queries is made available via the web interface (including instructions of use, e.g. on the use of quotation marks, phrases, root words). Output should be readable by the user, and allow both iterative improvement of the lexicon and network-based visualisation. Code for the tool should be export-able to different contexts
Terms
Category type
Category
Category Query
Proposal
- Write Python function to process input and generate output
- Flask front end written to execute the analysis via a web interface
- Source available via gitlab.cim.warwick.ac.uk
- Hosted in internal server temporarily
- ITS hosting for flask if possible
- Timeline
- Functions to process and summarise data written in 1 month
- Web interface written within two months
Type
Software
Languages
Python - Flask, base libraries
HTML, CSS + JS
Input
CSV file - input data. Headers: a, b
CSV file - criterial description. Headers: Category-type, category, query-term, search-column
Output
CSV file - co-occurence table. Headers: Category-type category, Category-type, category, co-occurence-frequency
CSV file - word frequency table. Headers: Category-type, category, query-term, frequency
GDF file - Category co-occurence. Nodes: Categories by type. Edges: co-occurence of categories
Notes
Word Raw frequency to show how many words in absolute per category (visualisation). Perhaps text (occurence)