Guide to Importing Geocrawler Data to TCAT
Guide to Importing Geocrawler Data to TCAT
CIM runs a Twitter geocrawler github DML on an ITS instance and Archer.cim.warwick.ac.uk. The tweets can be exported from the PostGIS database and then imported into other software (e.g., TCAT).
This guide explains how to export this data from the Twitter geocrawler and then import it into TCAT.
Process overview
- Log into the linux instance running the geocrawler
- Export the data into a .json file
- Transfer the json file to the TCAT server
- Run a PHP script to import the data into TCAT
Instructions
Log into linux instance
Enter the terminal. You can log into ITS instances using your university username and password - though you will need the admin account password for exporting data.
ssh pssnac@servername.lnx.warwick.ac.uk
You will need to have a suitable private key to access the CIM managed linux systems. You can generate a public and private key in both MacOS and Linux. To log in, ask Janes to place the key in the known hosts file on the server. Then you can log in.
ssh -i /path/to/key james@servername.cim.warwick.ac.uk
Export the data into a json file
The geocrawler uses a postgres database. There are postgis extensions installed in the database and the table is called tweets. Within this table is a column containing the raw json output from the Twitter API.
All data
To export all of the data between October 1st 2017 and October the 31st 2017
sudo -u postgres psql -d mycapture -qAt -c "SELECT tweet FROM tweets WHERE created_at >= '2017-09-01' AND created_at < '2017-10-01';" | grep -v '^$' > london_sep_17.json
The command runs psql as the user postgres with sudo privilages. The database 'mycapture' is chosen and the command is run with the no-align, tupples only and quiet flags. The SQL query is between "" and requests the values from the tweet column in the tweets table where the created_at value is between 2017-09-01 and less than 2017-10-01.
The output of this query is sent through the grp query to remove additional lines in the output and then saved in the file london_sep_17.json.
Only geolocated tweets
The postgres server includes geo specific extensions from postgis. To download tweets from within a specific geobounding box
sudo -u postgres psql -d mycapture -qAt -c "SELECT tweet FROM tweets WHERE created_at >= '2017-09-01' AND created_at < '2017-10-01' AND ST_CONTAINS(ST_GeomFromText('POLYGON((-2.1544 52.6081, -1.6469 52.6081, -1.6469 52.332, -2.1544 52.332, -2.1544 52.6081))', 4326),coordinates)" | grep -v '^$' > london_sep_17.json
The command is the same as previous except we add an ST_CONTAINS section to the SQL query. This new section defines a bounding polygon geom object from text with a projection of 4326 and finds out which tweet coordinates are within the bounding polygon. You can use this tool to get the bounding box.
This may take a long time. Patience is good.
Transfer the JSON file to TCAT server
Transfer the files via sftp (ssh file transfer protocol). SFTP is file transfer over SSH. If you have run the above command then the .json file will be in the same directory. Connect to the TCAT server via sftp
sftp james@servername.cim.warwick.ac.uk
Then transfer the file using the put command. The file will be transfered into the home directory of the user (the above case james).
put london_sep_17.json
The file is now on the target server. Congratulations.
Import the json file into TCAT
Please back up the TCAT database !add link to guide for this at later date! first. This can completely bork your install
SSH into the machine, create a directory called json and move the data into the directory.
ssh james@servername.cim.warwick.ac.uk
mkdir json
mv london_sep_17.json json
TCAT has a php script which will import the json file into an existing data bin. As good practice, we will use the screen command. Screen allows our script to coninue running in case our connection is broken and then we can edit the php files. I use vim but you can use nano is you would like.
sudo vim /var/www/dmi-tcat/import/import-jsondump.php
and then edit the dir and bin_name accordingly
// specify the name of the bin here
$bin_name = 'LondonArea';
// specify dir with the user timelines (json)
$dir = '/home/james/json';
// set type of dump ('import follow' or 'import track')
$type = 'import track';
then run the command using php
sudo php /var/www/dmi-tcat/import/import-jsondump.php
You should see a continual stream of dots (i.g., ........). The generation of dots indicates the data is being imported. You should see a summary of how many tweets have been collected at the end.
Job done. Got to the bin and confirm the data has been imported.
Notes
The psql command may just stop for no reason. This happens when the system runs out of memory and the command is silently killed (not ideal behaviour). Try again with a smaller number of tweets (e.g., a shorter time period).