Setup Developement Environment

I decide to spin up 1 Digital Ocean VPS to host my projects for now. This can be easily done with a Terraform file I used before:

variable "do_token" {
  default="####################################"
}

provider "digitalocean" {
  token = "${var.do_token}"
}

resource "digitalocean_droplet" "a1" {
  image  = "ubuntu-16-04-x64"
  name   = "a1"
  region = "tor1"
  size   = "1gb"
  #size   = "4gb"
#   user_data = <<EOF
# #!/bin/bash
# apt-get -y update
# apt-get -y install python
# EOF
  ssh_keys = ["###############################"]
}

resource "digitalocean_floating_ip" "myip" {
  region = "tor1"
  droplet_id = "${digitalocean_droplet.a1.id}"
}

output "ip" {
  value = "${digitalocean_floating_ip.myip.ip_address}"
}

The command to run is terraform apply.

Done!

The ip is 174.138.114.17. I'll add an entry to my host file for now. I will probably add a sub-domain for it soon so it's accessible by domain name:

174.138.114.17 marathon

We will use Ansible to do configuration management. First, Python is required by Ansible but is not installed in the default DO Ubuntu image. Let's install it using this playbook:

- hosts: all
  gather_facts: False
  
  tasks:
  - name: install python 2
    raw: test -e /usr/bin/python || (apt -y update && apt install -y python-minimal)

The command to run is ansible-playbook ansible-init.yml -u root -i marathon,.

Success!

We will also install Docker suite on the server, so we can push our docker deployment there. A search on Ansible Galaxy shows that somebody has already created a darn good docker role with thousands of downloads. Let's use that:

- hosts: all
  roles:
    - mongrelion.docker

Then run ansible-galaxy install mongrelion.docker and then ansible-playbook -vvv init-docker.yml -u root -i marathon,.

Do a docker run --rm -p 80:80 nginx, and open http://marathon and see the Nginx default message. It works!

We are seeing repeating patterns, so let's make some bash functions to save us the trouble of re-typing:

function play() {
    ansible-playbook -vvv $1 -u root -i marathon,
}

Works well.

Developing the app

For today, I want to start with something simple. Let's write some web scraping scripts with Python and display the aggregated data in a nice formated spread sheet so that we can sort it.

I have an idea: How about we archive all the DevOps jobs on Indeed (Canada), and put them in a spreadsheet.

Why does anybody want this?

It's much easier to sort with a spreadsheet than clicking the filtering criterias on the website. Plus, I can still keep a record of the jobs that are no longer available. So that we can run some statistics on the job market history.

Let's get started!

First, we need to figure out what to search for. Let's play with the advanced search feature.

We want all the jobs in Canada so we will leave out the location; the age of the post will be set to "since yesterday", because our script will run daily. Let's display 50 results per page.

This is the url to run:

It gives me 91 results on 2 pages.

We can see that the search gives some false positive. An "Advanced Web UI Application Developer" cannot be a Devops job just because it has "Nice to have: Devops Experience" in it. Let's match the title but not the description. We will match "devops" as well as "Development Operations", since it's a word game.

After some trail and error, I find this url gives the best result:

It will give jobs for "Devops" or "Development Operation", case ignored.

Then, we need to figure out what information we need to scrap.

We will scrap:

Job Title
Company name
Short Description
Posting time (we can use yesterday as timestamp, as long as we run the script at 00:01 every day.)
Location
salary
Boolean: can be applied on indeed.

I will try requests-html Python library by Kenneth Reitz. It looks really easy to use, and it supports Javascript.

Run pipenv install requests-html. Then pipenv shell. Then python.

With the help of another tool called SelectorGadget, I can easily find the CSS selectors I need.

It went smoothly, first draft took about 45 minutes. Here's the code:

from requests_html import HTMLSession
import datetime

def get_text(i, sel):
    try:
        return i.find(sel)[0].text
    except Exception:
        return ""

s = HTMLSession()
r = s.get('https://ca.indeed.com/jobs?as_and=&as_phr=&as_any=&as_not=&as_ttl=Devops+or+%22Development+Operation%22&as_cmp=&jt=all&st=&salary=&radius=50&l=&fromage=1&limit=50&sort=date&psf=advsrch')

r.html.render()
l = r.html.find('.clickcard')

res = []

for i in l:
    d = {}
    d['ttl'] = get_text(i, '.jobtitle .turnstileLink')
    d['comp'] = get_text(i, '.company')
    d['loc'] = get_text(i, '.location')
    d['desc'] = get_text(i, '.summary')
    d['apply'] = i.find('.iaLabel') != []
    d['date'] = datetime.datetime.now().strftime("%Y-%m-%d")
    d['sal'] = get_text(i, '#resultsCol .no-wrap')
    res.append(d)

import pprint; pprint.pprint(res)

I decided to use Google Spreadsheet to store and display the data. Following this tutorial.

I manually created a spreadsheet on my Google account. And shared with the API user. Now I just need Python to insert the records in the spreadsheet.

I also added the first row as header row; debugging the selector for salary; and write deduplication feature in Python code. Now it works very well.

Visit the Google Sheet!

Now I will package the code in a docker image and pull it from the server, which will run a docker container daily, with a cron job.

Build the image:

FROM ubuntu:xenial

RUN apt-get -y update && apt-get install -y python3 python3-pip
RUN pip install -r requirements.txt

CMD python insert_gsheet.py

It took me 6 hours 20 minutes straight working on this thing. And I cannot deploy it with Docker. It seems that even with Docker, we are still at works on my laptop but not on prod stage. I'm blocked at this issue: pyppeteer.errors.BrowserError: Failed to connect to browser port: http://127.0.0.1:45439/json/version, which seems to be related to docker. Might be related here.

Finally, I got it working after following the instructions here. I got one giant Dockerfile.

Now we push the image to Docker Hub: docker tag dohsimpson:indeed_jobs dohsimpson/indeed_jobs docker push dohsimpson/indeed_jobs. Then pull from the server: docker pull dohsimpson/indeed_jobs. Push takes forever because my slow uplink.

Lastly, we add a cron entry with Ansible.

- hosts: all
  tasks:
  - name: Run indeed_jobs every day after midnight
    cron:
      name: Run indeed jobs docker container
      job: docker run --rm dohsimpson/indeed_jobs
      minute: "1"
      hour: "0"

All seems working!

Screen-Shot-2018-03-20-at-4.13.21-PM

Lesson learned

Do not use Docker if you are working solo, not worth the trouble.

TODO:

The script doesn't handle pagination, but if we run it frequent enough, it won't be a problem.
I will leave out any CSS magic for now.
Try package the app with Dokku, it might be easier.

Server Setup, Automated

Enting Zhou

Setup Developement Environment

Developing the app

Lesson learned