Guide to Resetting TCAT Crawlers
Guide to Resetting TCAT Crawlers
TCAT collect Twitter data. Data collection uses PHP scripts. The crawler component uses a controller.php script to restart data collection.
Data collection sometimes stops. Do the following to investigate the issue and then reset the crawler, if needed.
Instructions for setting up an automatic reset
Linux servers have a process called cron which will run a script at a specific time interval (e.g., every hour).
A bash script for restarting the server is shown below
#!/bin/bash
killall php
timeout 20s php /var/www/dmi-tcat/capture/stream/dmitcat_track.php
php /var/www/dmi-tcat/capture/stream/controller.php
Save this file on the server with the extension .sh (e.g., restart_tcat.sh). The location does not matter, but make sure you know the location.
Add the script to the crontab (list of repeated scripts) by using the following command
sudo crontab -e
and add the following line to the crontab file (with the correct file location)
*/12 * * * * /home/jtripp/restart_tcat.sh
The script will now run at 12 minutes past the hour, every hour.
Instructions for manually restarting the server
Log into the capture section of the web interface
If the crawler has hit the rate limit (too many tweets downloaded within a time period) then either:
(a)leave the TCAT interface and data collection will continue after a time out period, or
(b) alter the search queries to exclude terms which may collect too many tweets (thus hitting the rate limit).
If data collection has stopped and there is rate limit error then the crawler may have hit an authentication error and you will need to restart the crawler. Follow the below steps to check the error log and reset the crawler.
SSH into the machine
Enter terminal and type:
ssh james@serveraddress
Check the server log
The following will do this with the default file path and show you the last 50 lines of the tracking log.
tail -n 50 /var/www/dmi-tcat/logs/track.error.log
If an authentication or similar error is to blame then it will be shown in the log.
Stop all php processes
kill $(ps aux | grep '[p]hp' | awk '{print $2}')
Reconnect to the Twitter server
After typing this command go to the web interface and confirm that data is now being collected.
sudo php /var/www/dmi-tcat/capture/stream/dmitcat_track.php
Restart controller.php
If data is being collected then restart the controller.php using the following command.
sudo php /var/www/dmi-tcat/capture/stream/controller.php
Accessing the Containers
docker exec -it trusting_fermat /bin/bash
Notes
If the crawler is still not collecting data then the problem may be the following:
- The credentials are not working
- There is a network problem
- TCAT is out of date. There may be an API change.
- The MySQL tables are corrupted
For any of the above contact James.tripp@warwick.ac.uk.