CRAIGSLIST ALERTS, PART 2: SCRAPING CRAIGSLIST AND SENDING E-MAILS WITH PYTHON

Following on from last week I’ve put together a little program that sends me Craigslist Alerts, using the module I created automated_craigslist_search. I’ve broken it down into 4 steps:

  1.  Set up an Amazon Web Server
  2. Run a search using automated_craigslist_search
  3. Keep a record of your previous alerts
  4. Use Cron to run your script periodically

1. Setting up an Amazon Web Server

You need to run the script from somewhere and you need to keep it running. Along comes the Amazon Web Server (AWS) with an Elastic Computing Cloud (EC2). If you’re wondering what the hell an Elastic Computing Cloud is (as I was), a good friend of mine described it as “a Unix box in space”. i.e. Through Amazon we acquire a machine we can run our script on, simple as that. We access the machine though Terminal using SSH Keys. Amazon offer an AWS package with the first 12 months for free. Here’s how to set one up:

  • Setup a free AWS by visiting aws.amazon.com. Sign in using your regular amazon credentials.
  • Then once logged in, you will be directed to your AWS console. From here select EC2 Virtual Servers in the Cloud and then wait patiently… It can take a few hours for the request to be accepted, but once it is, you will receive an e-mail notification with the title “Your AWS Account is Ready – Get Started Now“. Click the link Amazon EC2 Instance. inside the email.
  • From here click Launch Instance
  • Now you will be prompted to choose an operating system for your brand spanking new EC2. I definitely recommend Ubuntu over anything else. By far the easiest to use and its awesome.
  • Choose the free tier then press Review and Launch, then hit Launch.
  • You will then be prompted to Select an existing key pair or create a new pair. This is required for you to use SSH Keys to access your EC2 from Terminal. Select Create a new pair, then give it a name. I called mine AWSKey, then download the AWSKey.pem file.
  • Now you can acces your EC2 using SSH key from the terminal. Just follow these next steps or for the detailed more general description click here.

Using SSH Keys to access your EC2 from Terminal

    • cd to the directory containing the file AWSKey.pem
    • Run the following commands:
$   chmod 400 AWSKey.pem 
$   ssh -i AWSKey.pem ubuntu@<"Your Public DNS">

Your public DNS is available in your AWS console. Look in the table under the column name “Public DNS”

Now you have access to your server. You want to go ahead and install everything you have on your home machine. Ubuntu has an easy to use package manager called apt-get.

To install Git:

sudo apt-get install git-core

Now you can download and install automated_craigslist_search To install Pip:

sudo apt-get install python-pip

then go ahead and install all of your usual Python packages.

Good to go! You have a fully functional Unix box in space!

2. Run a search using automated_craigslist_search

Start with your search criteria:

#############################################
# Search criteria
#############################################

search_key_words = 'wafer thin mints'
words_not_included = ''
min_value = 0
max_value = 50
category = 'all for sale / wanted'
city = 'newyork'

then enter the e-mail address your alerts will come from and create your mailing list (I’m just sending the results to myself)

#############################################
# E-mail information
#############################################
 
# the gmail address that alerts will come from: this will require the users password
send_alerts_from = "ticket.alerts.from.robert@gmail.com"
 
# the mailing list. A Python List of strings, each containing e-mail addresses:
mailing_list = ["robert.david.west@gmail.com"]
 
#############################################

then just execute the function  search_and_send()

# search and send results
df = connect_to_craigslist.search_and_send(send_alerts_from, mailing_list, password, search_key_words, previous_alerts.df, min_value=min_value, max_value=max_value, words_not_included=words_not_included, city='newyork')

Boom. If the search returned any info, it will be sent out to your mailing list. Good to go! Now we can get this running periodically right? Um… not quite, I did that step next, then I realised I was going to be e-mailing people the same information over and over again. First we need to keep track of the searches we’ve made and make sure we don’t duplicate our alerts.

There is one input in search_and_send that I haven’t mentioned previous_alerts.df. I will cover this in section 3, but to run the function all you need is an empty DataFrame that has the same columns as the DataFrame returned by search_and_send. search_and_send will throw out any search results found if they are contained in this DataFrame.

3. Keep a record of your previous alerts

I’m using HDF5 PyTables since they are pretty easy to use with pandas.DataFrames. 

The idea being to start with an empty DataFrame, stored in a .h5 file and to then append all previously e-mailed search results to this dataset, to make sure the same result isn’t sent twice.

I started by running a search that would return an empty DataFrame

# search and send results
df_old_alerts = connect_to_craigslist.search_craigslist('sa@$%gieu')

then i stored the resulting DataFrame in my working directory

df_old_alerts.to_hdf('previous_alerts.h5','df')

Now I have an empty DataFrame that I can use with search_and_send. So run the search, append the results to the initial DataFrame, drop any duplicates, then save the .h5 file

# search and send results
df = connect_to_craigslist.search_and_send(send_alerts_from, mailing_list, password, search_key_words, previous_alerts.df, min_value=min_value, max_value=max_value, words_not_included=words_not_included, city='newyork')

# append current search results to full list
previous_alerts.df = previous_alerts.df.append(df)

# remove dupilicates
previous_alerts.df = previous_alerts.df.drop_duplicates()

# update previous search entries in hdf5 file
previous_alerts.df.to_hdf('previous_alerts.h5','df')
previous_alerts.close()

Now if you run the script multiple times a new alert will only be sent if the search results change!

Ok this still needs work, since I now have an ever growing DataFrame stored on disk. I need to set up another script that runs periodically to manage this DataFrame.

here‘s the full script I used working through sections 2 and 3.

4. Using Cron to run your script periodically

Cron allows you to execute desired tasks in the background of your machine at designated times. The Ubuntu Cron how to help page is very user friendly and gives good examples for how to use. Here are the steps I took to get my script running on my EC2:

  1. First install cron:
     sudo apt-get install gnome-schedule
    
  2. Then select the crontab editor from a terminal window with the following command:
     crontab -e   
    
  3. Enter the following line:
     */1 * * * * /usr/bin/python /home/ubuntu/periodic_craigslist_search/periodic_craigslist_search.py
    

    This will run the python .py file periodic_craigslist_search.py every minute. To change the frequency that the scrip is run, adjust the 1 you see in */1 to anything from 0 to 59 minutes. If you want to run the script every hour the command would be * */1 * * * ...

and that’s it, done! For now I’d still probably recommend IFTTT for doing this robustly, but right now their server is down.. so my solution should be useful for the next 30 mins or so at least ;)…

Proof! :

IFTTT something isn't working

One thought on “CRAIGSLIST ALERTS, PART 2: SCRAPING CRAIGSLIST AND SENDING E-MAILS WITH PYTHON

  1. how does this compare to desktop version of C/L search software such as http://thing2.ws/clsearcher/ ?? faster/slower? Does python version interact with web page? CL page now have ‘show contact info’ hidden behind js. This desktop version seems to do pretty well grabbing it.

Leave a comment