Skip to main content Skip to navigation

Guide to Resetting TCAT Crawlers

Guide to Resetting TCAT Crawlers

TCAT collect Twitter data. Data collection uses PHP scripts. The crawler component uses a controller.php script to restart data collection.

Data collection sometimes stops. Do the following to investigate the issue and then reset the crawler, if needed.


Instructions for setting up an automatic reset

Linux servers have a process called cron which will run a script at a specific time interval (e.g., every hour).

A bash script for restarting the server is shown below


	#!/bin/bash                                                                     
	killall php                                                                     
	timeout 20s php /var/www/dmi-tcat/capture/stream/dmitcat_track.php              
	php /var/www/dmi-tcat/capture/stream/controller.php 
	

Save this file on the server with the extension .sh (e.g., restart_tcat.sh). The location does not matter, but make sure you know the location.

Add the script to the crontab (list of repeated scripts) by using the following command


	sudo crontab -e
	

and add the following line to the crontab file (with the correct file location)

*/12 * * * * /home/jtripp/restart_tcat.sh

The script will now run at 12 minutes past the hour, every hour.

Instructions for manually restarting the server

Log into the capture section of the web interface

If the crawler has hit the rate limit (too many tweets downloaded within a time period) then either:

(a)leave the TCAT interface and data collection will continue after a time out period, or

(b) alter the search queries to exclude terms which may collect too many tweets (thus hitting the rate limit).

If data collection has stopped and there is rate limit error then the crawler may have hit an authentication error and you will need to restart the crawler. Follow the below steps to check the error log and reset the crawler.


SSH into the machine

Enter terminal and type:

ssh james@serveraddress

Check the server log

The following will do this with the default file path and show you the last 50 lines of the tracking log.

tail -n 50 /var/www/dmi-tcat/logs/track.error.log


If an authentication or similar error is to blame then it will be shown in the log.


Stop all php processes
kill $(ps aux | grep '[p]hp' | awk '{print $2}')

Reconnect to the Twitter server

After typing this command go to the web interface and confirm that data is now being collected.

sudo php /var/www/dmi-tcat/capture/stream/dmitcat_track.php

Restart controller.php

If data is being collected then restart the controller.php using the following command.

sudo php /var/www/dmi-tcat/capture/stream/controller.php

Accessing the Containers
docker exec -it trusting_fermat /bin/bash

Notes

If the crawler is still not collecting data then the problem may be the following:

  • The credentials are not working
  • There is a network problem
  • TCAT is out of date. There may be an API change.
  • The MySQL tables are corrupted

For any of the above contact James.tripp@warwick.ac.uk.