projects

[3rd Year Projects from Previous Years] [4th Year Project Ideas]

[Java Data Mining] [Data Mining Applications] [Case Based Reasoning]

Notes to students

All projects will have a strong programming element
There will be a strong Java bias to most projects
There are some specific projects up for grabs but you are welcome to propose your own project based on the themes below. Infact that is preferable. The list of previous projects should help you do that

Java Data Mining

Java Data Mining is a standard API being developed as part of the Java Community Process. It hopes to achieve for data mining, what JDBC has done for database, that is, a standard API for accessing data mining servers.

In the department, we have access to a part implementation of a server conforming to the JDM standard. Projects may aim to:

Implement missing functionality
Improve the implementation of current functionality
Extend the standard

Specific Projects up for grabs:

Implementation of a web service for data mining
Outline: As part of the JDM standard, an XML based web service interface for data mining has been defined. This project will aim to build a prototype system adhering to this standard.
Implementation of a GUI to support the JDM functionality
Outline: The data mining server available within the department provides a part implementation of the JDM API. The aim of this project would be to provide a graphical user interface to the functionality provided by the server that supports the various phases of the CRISP-DM methodology for data mining that can be supported by the data mining server.
Visualisation tools for knowledge discovered by data mining algorithms.
All algorithms within JDM produce knowledge that can be exported to an XML based knowledge representation standard called the Predictive Modelling Mark-up Language (PMML). This project will aim to implement a tool that would take PMML as input and produce a visual representation of the knowledge.
Develop a real-time scoring module for JDM algorithms
Outline: Once the data mining algorithm has generated knowledge and stored it within the database, the knowledge must be used to make predictions. For example, consider a call centre of a telecom service provider that has built a model to predict how likely a customer is to switch his telecom services to a competitor. When a customer rings into the call centre this model can be used to “score” the customer and get a prediction of the likelihood of this customer switching telecom providers. If there is a significant likelihood of this customer switching services, the telecom provider may want to provide the customer with some incentive to not switch providers. This project will aim to develop a scalable architecture for performing such a scoring.

Useful Info:

What is Data Mining?
- U. Fayyad, G Piatetsky-Shapiro, P. Smyth. From Data Mining to Knowledge Discovery in Databases, AI Magazine, 1996.
What is Case-based Reasoning?
- A Aamodt and E. Plaza. Case-based Reasoning: Foundational Issues, Methodological Variations and System Approaches , AI Communications 7(i): pp. 39-59, 1994
Java Data Mining
Predictive Modelling Mark-up Language

Data Mining Applications

I have an interest in the following applications of data mining

Web Personalisation
Online Crime Prevention
Mining Text Corpuses

Specific Projects up for grabs:

Search Engine Personalisation
Search engines like Google, Yahoo and MSN provide facilities for users to search the web for documents of interest. While these search engines provide valuable assistance to users in locating documents of interest, more often than not, the result of a search itself is too large for a human to consume. Also, a number of documents returned by the search are not relevant to the user. This is based on the fact that while these search engines use very intelligent techniques to rank documents, these algorithms to date only take into account the content of the document and hyperlinks between the document and other documents on the Internet. They do not take into account individual preferences. This project aims to develop techniques for implicitly extracting user interests by observing the users behaviour and utilising this information to reorder the search results based on the users induced preferences.
Mining Keystroke Dynamics for improved Online security
Outline: The most common form of authentication online is the use of passwords. As more safety critical services move online and the limitations of passwords alone as an authentication mechanism become obvious thorough scams such as “phising”, an increasing amount of research effort is being channelled towards the discovery of cheap biometric based authentication mechanisms. The idea behind keystroke dynamics is that the typing behaviour, sometimes referred to as “Habitual Rhythm Patterns”, of individuals is unique and if the correct features can be extracted from data collected on the users keyboard activity, it can provide a cheap biometric, possibly as unique as a fingerprint. This project aims to investigate this hypothesis.
Machine Learning for generating a Voice Signature as a Biometric
Outline: The project aims to investigate the use of Human voice as a method for securing access to online information sources. You will need to investigate techniques used to extract features from audio signals that would distinguish a user from other users of the system and design and develop a system that would authenticate the user securely, across the Internet, using his/her voice
Investigating Keyboard Acoustic Attacks
Outline: Recent studies have shown how the acoustic emanations from keyboards carry substantial information about the text being typed. This of course has serious consequences for snooping within open plan offices where users of these techniques may be able to learn passwords to secure parts of networks by recording acoustics emanating from keyboards of valid users. This project aims to build a system for capturing keyboard acoustics and reproducing text being typed.
Using Machine Learning to filter Spam
Outline: This project aims to build a system based on machine learning that can accurately identify spam.
Detecting Concept Drift in Web Navigation
Outline: When navigating the web, we often drift from our intial context based on content that we discover during the navigation itself. A system that attempts to personalize such a navigation needs to identify a shift in context and adjust its recommendation to the current context. This project aims to investigate how this can be achieved.
Mining multiple content web sites for collating views
Outline: Multiple web sites provide information on the same event. For example, multiple movie review sites may give different ratings to the same movie and different new web sites would give their own interpretations on events taking place around the world. The project will explore ways of collating information from different sources and present precis of the information to a user.
Learning strucutre from text
Outline: Given that most of the web today is semi-structured, how can this information be used to populate the semantic web. The project will investigate the techniques that have been developed for parsing text and learning structure from it.
Music Recommendation
Outline: As more music becomes available for download onto portable media players, tools are required for managing music respositories and helping user find music that they will enjoy. The audio signal, rating by other users, content descriptors of the track and lyrics all play a role in defining what we as individuals enjoy about a particular track. The project aims to use these data sources to generate useful music recommendations.
Stock Market Prediction
Outline: The use of machine learning techniques for predicting the stock market have been investigated for a number of years using past stock prices. The project aims to discover shifts in behaviour of stock and aims to harvest information from the Internet that may have caused this shift for example, from news articles.
Using Text Mining for Authorship authentication
Outline: Given a set of documents written by a limited set of authors, this project will aim to automatically extract patterns from the documents that will enable authorship of unlabelled documents to be identified. The applications for such a system are wide ranging from analysis of fraudulent claims regarding historical documents to tackling plagiarism.
Incorporating Survival Analysis techniques into Lazy Learning
Outline: Survival Analysis is the phrase used to describe the analysis of data that correspond to the time from a well-defined time-origin until the occurrence of a some particular event or end-point [1]. The event may be the death of a patient suffering from an illness or a customer switching telecom services from one provider to another. A number of statistical techniques for analysing survival data have been proposed [1] however little research has been carried out into how such data can be used when applying machine learning, in general, and, more specifically, lazy learning techniques. This project will extend work carried out by me during my PhD.
[1] D. Collect. Modelling Survival Data in Medical Research, Chapman and Hall, 1994.

Case-based Reasoning

My interests lie in

Scaling CBR technology to Very Large Case bases

When the size of the case-base becomes very large, using the case base can prove to be to onerous especially in applications where a real-time response is required. Solutions to this problem proposed to date include

indexing the case base
“learning to forget” cases that are not contributing to the utility of the case-base
Using Data Mining to generate the knowledge required to build a Case-based Reasoning System

Case based reasoning systems [3] were proposed as a solution to the knowledge acqusition bottleneck faced by rule based systems. However, for all but the most simple applications, building a case-based reasoning system is not free of the need for knowledge acquisition. Indeed CBR systems require a number of different knowledge containers such as the case-base, indexing/similarity knowledge, adaptation knowledge and maintenance knowledge [1,2]. Whether data mining can be used to discover knowledge required by these knowledge containers is an open research question.

[1] D. B. Leake, B. Smyth, D. C. Wilson, Q. Yang. Maintaining Case Based Reasoning Systems, Computational Intelligence, Special Issue on Maintaining Case-based Reasoning Systems, 2001.

[2] M. M. Richter. Introduction. In Case-based Technology: From Foundations to Applications, LNAI 1400, Springer, 1998.

[3] A Aamodt and E. Plaza. Case-based Reasoning: Foundational Issues, Methodological Variations and System Approaches, AI Communications 7(i): pp. 39-59, 1994

Previous Projects

2006-2007

Improving Classifier Accuracy for Keystroke Dynamics
Survey of Machine Learning Approaches to Stock Market Prediction
A Social Network to Discover relationships Between Scholarly Articles Published by Different Academics
Learning Unrestricted Hidden Markov Models from Multiple Observation Sequences
A web-based semi-autonomous ontology miner for the Semantic Web
Investigating the Application of Artificial Intelligence in Web-Search Personalisation
Generating Recommendations for Music using Content-based Similarity
Implementation of a Web-Service for Data Mining
H2H Texas Hold 'Em Poker
Automated Web-Site Evaluator

2005-2006

Identity Authentication using Typing Biometrics
A Personalized Search Engine and Recommendation tool for Japanese Animation
A Hybrid Recommendation Algorithm Based on Clustering of User Ratings and Content Descriptors
Search Engine Personalization
Development of a Legal Knowledge Base
Text Mining for Authorship Identification
Harnessing the Power of Concurrent Version Control System
Graphical Application for Creating Custom Web-based Management Tools

2004-2005

Data Mining using the Warwick Air-accident Database
Evaluation of Approaches to Handling Large Exemplar Sets in Lazy Learning
A Study of the Data Mining Approaches to Network Security

4th Year Project

Implementation of a JDM compliant Data Mining Server
Java Data Mining is a standard API being developed as part of the Java Community Process. It hopes to achieve for data mining, what JDBC has done for databases i.e. provide a standard API for accessing data mining servers. JDM provides facilities for storing and retrieving meta-data associated with Data Mining activities. However, as it is an interface rather than a set of classes, its success depends on implementations.

The aim of the project is to build an Open-Source JDM compliant server that provides all the functionality of JDM including an implementation of the Web Services defined in JDM. You will also be required to build some applications that use various functionality of the JDM Server. Note that you will not be expected to implement any of the data mining algorithms iteself but develop wrappers for WEKA data mining algorithms instead.

Useful info:
WEKA Web Site
Knowledge Discovery Standards
Implementation of a Personalized Search Engine
Most search engines on individual web sites provide poor performance to the user. This project aims to develop a search engine for a single web site that provides personalized access to information on the web site. The project will involve the creation of a web search engine from scratch and then layering different approaches to personalization on top of the basic search engine. The scope of the project has been limited by aiming to personalize a single web site however an alternative would be to build a topic specific search engine.