Skip to main content Skip to navigation

Guide to Importing Geocrawler Data to TCAT

Guide to Importing Geocrawler Data to TCAT

CIM runs a Twitter geocrawler github DML on an ITS instance and Archer.cim.warwick.ac.uk. The tweets can be exported from the PostGIS database and then imported into other software (e.g., TCAT).

This guide explains how to export this data from the Twitter geocrawler and then import it into TCAT.


Process overview

  • Log into the linux instance running the geocrawler
  • Export the data into a .json file
  • Transfer the json file to the TCAT server
  • Run a PHP script to import the data into TCAT

Instructions


Log into linux instance

Enter the terminal. You can log into ITS instances using your university username and password - though you will need the admin account password for exporting data.

ssh pssnac@servername.lnx.warwick.ac.uk

You will need to have a suitable private key to access the CIM managed linux systems. You can generate a public and private key in both MacOS and Linux. To log in, ask Janes to place the key in the known hosts file on the server. Then you can log in.
ssh -i /path/to/key james@servername.cim.warwick.ac.uk

Export the data into a json file

The geocrawler uses a postgres database. There are postgis extensions installed in the database and the table is called tweets. Within this table is a column containing the raw json output from the Twitter API.

All data

To export all of the data between October 1st 2017 and October the 31st 2017

sudo -u postgres psql -d mycapture -qAt -c "SELECT tweet FROM tweets WHERE created_at >= '2017-09-01' AND created_at < '2017-10-01';" | grep -v '^$' > london_sep_17.json


The command runs psql as the user postgres with sudo privilages. The database 'mycapture' is chosen and the command is run with the no-align, tupples only and quiet flags. The SQL query is between "" and requests the values from the tweet column in the tweets table where the created_at value is between 2017-09-01 and less than 2017-10-01.

The output of this query is sent through the grp query to remove additional lines in the output and then saved in the file london_sep_17.json.

Only geolocated tweets

The postgres server includes geo specific extensions from postgis. To download tweets from within a specific geobounding box

sudo -u postgres psql -d mycapture -qAt -c "SELECT tweet FROM tweets WHERE created_at >= '2017-09-01' AND created_at < '2017-10-01' AND ST_CONTAINS(ST_GeomFromText('POLYGON((-2.1544 52.6081, -1.6469 52.6081, -1.6469 52.332, -2.1544 52.332, -2.1544 52.6081))', 4326),coordinates)" | grep -v '^$' > london_sep_17.json


The command is the same as previous except we add an ST_CONTAINS section to the SQL query. This new section defines a bounding polygon geom object from text with a projection of 4326 and finds out which tweet coordinates are within the bounding polygon. You can use this tool to get the bounding box.

This may take a long time. Patience is good.


Transfer the JSON file to TCAT server

Transfer the files via sftp (ssh file transfer protocol). SFTP is file transfer over SSH. If you have run the above command then the .json file will be in the same directory. Connect to the TCAT server via sftp

sftp james@servername.cim.warwick.ac.uk

Then transfer the file using the put command. The file will be transfered into the home directory of the user (the above case james).

put london_sep_17.json
The file is now on the target server. Congratulations.

Import the json file into TCAT

Please back up the TCAT database !add link to guide for this at later date! first. This can completely bork your install

SSH into the machine, create a directory called json and move the data into the directory.

ssh james@servername.cim.warwick.ac.uk
mkdir json
mv london_sep_17.json json

TCAT has a php script which will import the json file into an existing data bin. As good practice, we will use the screen command. Screen allows our script to coninue running in case our connection is broken and then we can edit the php files. I use vim but you can use nano is you would like.
sudo vim /var/www/dmi-tcat/import/import-jsondump.php

and then edit the dir and bin_name accordingly
// specify the name of the bin here
$bin_name = 'LondonArea';
// specify dir with the user timelines (json)
$dir = '/home/james/json';
// set type of dump ('import follow' or 'import track')
$type = 'import track';


then run the command using php

sudo php /var/www/dmi-tcat/import/import-jsondump.php

You should see a continual stream of dots (i.g., ........). The generation of dots indicates the data is being imported. You should see a summary of how many tweets have been collected at the end.

Job done. Got to the bin and confirm the data has been imported.


Notes

The psql command may just stop for no reason. This happens when the system runs out of memory and the command is silently killed (not ideal behaviour). Try again with a smaller number of tweets (e.g., a shorter time period).