Automatic package and documentation deployment using gitlab CI

by Susanne Groothuis, Data Scientist on September 24, 2019

A how-to article on creating a pipeline for automated testing and deploying of code and documentation using Gitlab CI and kubernetes

Automatic package and documentation deployment using Gitlab CI

Let's say you're working for a client and you want to reuse some awesome piece of code your colleague made a week prior. But she wasn't allowed to put it on pypi (since that was also for a client), so you try to just clone the repo and install it that way. "But I'm still working on it!" she exclaims. Ok, that's fine. You'll just set it as a submodule and then it will get updated as she pushes her changes. But then you realize that her module is already dependent on another submodule, and that submodule is dependent on some legacy code made by someone who no longer works here! And so on and so forth, and before you know it you're in Dante's submodule inferno and there is no escape.

Photo by Jay Heike on Unsplash
Me in recursive submodule hell.

This used to happen a lot in our team, and we often found that reusing code was much more trouble than it was worth. This resulted in low adoption of older code and meant that we were often reinventing the bdist wheel. Needless to say, something needed to change.

To finally tackle this issue, we decided that the following things were important for us to create more cohesion and increase code adoption:

  1. Make sure tests are run on all our project code and packages
  2. Create easy access to packages created by team members
  3. Create easy access to documentation on these packages
  4. Make creating this easy so people actually do it

Points two and three could easily be solved by setting up a Package indexing server and a documentation server respectively. On top of that we could use our Gitlab's CICD functionality to automatically run the test, and do the deployments of packages and documentation. Making sure we have templates in place for new projects would hopefully make it second nature.

In this blog post I'll explain how we setup our PyPI and documentation servers on kubernetes and what we added to our Gitlab CICD pipeline for automated testing and deployment.

Please note that I will not be going into details of our security settings, as this is handled on the cluster level and so is out of scope of this blog post.

Gitlab CI

CICD stands for Continuous Integration Continuous Deployment (or delivery depending on who you talk to). As the name suggests, it consists of two parts, the CI and the CD. The CI is about making sure all your code is stored at the same place and always runs the same process before posting your results. For example, all code for the same project is stored in the same repository, and you always run the same tests.

The CD part will take care of deployment of your code, which when done manually can be boring and tedious. Luckily there are a number of tools available to automate this process, like Travis CI or the Gitlab CI runner. In the context of this post we'll regard CICD as the automated process we use to test, build and deploy our python packages and its documentation.

gitlab's ci pipeline
The Gitlab CI and CD pipelines.

Setting up the Gitlab CI / CD functionality is pretty straightforward. All you need to do is add a .gitlab-ci.yml file to your repository's main folder. Gitlab even provides an online linter to check the syntax of your yaml file. Depending on the contents of the yaml, a new pipeline will trigger every time a change happens on the branch you've specified.

So we start out with a simple .gitlab-ci.yml file that only has one stage (test) and then runs that test on python 3.6 can be seen below:


image: python:3.6-buster

before_script:
  - pip install pip --upgrade

stages:
  - tests

test36:
  stage: tests
  script:
    - pip install pytest pytest-cov
    - pip install .
    - cd tests
    - pytest --cov $PACKAGE_NAME
  only:
  - merge_requests
  - master

This file now specifies a few things in order to run a basic pipeline:

  • The base image
  • Anything you want to run for every runner at the start specified in before_script
  • All stages and their order of execution
  • Each step in a pipeline, and with that:
    • To which stage it belongs
    • What script to run
    • When to trigger it (in this case on merge_requests and commits on master)

This simple pipeline will be the basis for our final configuration, and we will keep adding to it throughout the post to include the other deployment steps.

PypiServer

Next up is setting up a private garden of Eden package registry for the team. There are a few options you can consider for this purpose, such as pypicloud and of course devpi.

For our use-case we settled on pypiserver, which is a minimal index for python packages that seemed to cover most of what we needed. Plus, it also had a Helm chart available for easy deployment on kubernetes!

Installing a helm chart is quite easy, and can be done with the following two commands:

helm repo add owkin https://owkin.github.io/charts
helm install cicd-pypi owkin/pypiserver --version=1.0.0-rc.1 --namespace=pypi-server -f values.yaml

Where cicd-pypi is the name of our deployment and pypi-server is the namespace it lives in. You only need to adapt the values.yaml to your specifications, such as adding the hostname, and the credentials for the gitlab-ci user.

auth:
  actions: update
  credentials:
    [username]: "[password-string]"

ingress:
  enabled: true
  labels: {}
  annotations:
    kubernetes.io/ingress.class: nginx
    certmanager.k8s.io/cluster-issuer: tls-cert-letsencrypt-prod # we added a certificate issuer for https
    nginx.ingress.kubernetes.io/whitelist-source-range: [range] # we also specified only specific source ranges to only allow internal traffic
  path: /
  tls:
    - hosts:
      - [host url]
      secretName: [secret name]
  hosts:
    - [host url]

You can add separate credentials for each of your users, and that way people can also manually push packages to the server using twine. However, we didn't want to manually manage a lot of individual accounts, and in addition, only wanted to deploy packages through the CICD. So we only need one user, namely for the gitlab runners.

For more information on the pypisever check out their documentation.

Pushing to the pypiserver using gitlab ci

Now that we have a pypi server running, we need to start pushing packages to it. The building and deployment can all be handled by the Gitlab CI on the fly. We only need to give it the credentials of the user we created in the previous step.

Luckily Gitlab has an option for you to you define custom environment variables that can either be available for a specific repository or on a group level. These can be defined under Repo or Group > settings > CICD > Environment variables. Here we can add 3 variables for the server URL PYPI_REPOSITORY, username PYPI_USERNAME and password PYPI_PASSWORD.

Pushing packages using twine requires a .pypirc file containing the pypiserver credentials. Using the above defined variables we can build this file in one of the stages.

Below you can see we use these variables to build the pypirc authentication file during the deploy stage, after which we push the package to the server using twine.

image: python:3.6-buster

before_script:
  - pip install pip --upgrade

variables:
  PACKAGE_NAME: "YourPackage"
  PYPI_REPOSITORY: SECURE
  PYPI_USERNAME: SECURE
  PYPI_PASSWORD: SECURE

stages:
  - tests
  - deploy

test36:
  stage: tests
  script:
  - pip install pytest pytest-cov
  - pip install .
  - cd tests
  - pytest --cov $PACKAGE_NAME
  only:
  - merge_requests
  - master

deploy_pypi:
  stage: deploy
  script:
    - pip install twine setuptools wheel
    - python3 setup.py bdist_wheel #build the package
    - pip install .
    - echo "[distutils]" >> ~/.pypirc # create credential file
    - echo "index-servers =" >> ~/.pypirc
    - echo "  serverIndexName" >> ~/.pypirc
    - echo "" >> ~/.pypirc
    - echo "[serverIndexName]" >> ~/.pypirc
    - echo "repository:" $PYPI_REPOSITORY >> ~/.pypirc
    - echo "username:" $PYPI_USERNAME >> ~/.pypirc
    - echo "password:" "$PYPI_PASSWORD" >> ~/.pypirc
    - twine upload --repository serverIndexName dist/* # upload to server
    - echo "" > ~/.pypirc && rm ~/.pypirc # delete everything
  only:
    - master

Documentation server using NGINX

Now setting up the documentation server was a bit more difficult. At first we wanted to use Gitlab pages, but found out that this was not implemented for the kubernetes instance of Gitlab.

As an alternative we decided to use a standard NGINX server, which auto indexes a fileShare hosting our html pages.

Setting up the NGINX server

As with the pypiserver, we wanted the documentation server to also run on Kubernetes. We found a helm chart for an nginx server as well, but unfortunately it didn't allow for custom mounted volumes. Thus we had to adapt the helm chart to allow for this. In the spirit of reusablilty, we took this part straight from the pypiserver helm chart, and copy-pasta'd that baby right into the NGINX deployment template.

Under volumes we set the mountPath of the PVC to the default nginx folder /usr/share/nginx/html.

{{- if .Values.persistence.enabled }}
- name: html-source
  mountPath: /usr/share/nginx/html
{{- end }}

Under volume mounts we specify the PVC we want to use:

{{- if .Values.persistence.enabled }}
- name: html-source
  persistentVolumeClaim:
    claimName: {{ if .Values.persistence.existingClaim }}{{ .Values.persistence.existingClaim }}{{- else }}{{ template "nginx.fullname" . }}{{- end }}
{{- end }}

As for the relevant values to add to the defaults, consider the following snipped from our values.yaml. To keep the documents available for internal users only, we limit the incoming source ranges to our network. You will also need to pass the url to your server host, as well as adding a custom serverBlock which points to our fileShare.

ingress:
  enabled: true
  certManager: false
  annotations:
    kubernetes.io/ingress.class: nginx
    certmanager.k8s.io/cluster-issuer: tls-cert-letsencrypt-prod # the certficate issuer of our cluster
    nginx.ingress.kubernetes.io/whitelist-source-range: # add ip ranges for internal use only
  hosts:
  - name: [url of your server host]
    path: /

  tls:
  - hosts:
      - [url of your server host]
    secretName: tls-secret

# Custom serverblock that listens to containerport 8080 (default) and autoindexes the contents
serverBlock: |-
  server {
    listen       8080;
    server_name  localhost;

    location / {
      root   /app;
      index  index.html index.htm;
      autoindex on;
    }
  }

persistence:
  enabled: true
  accessMode: ReadOnlyMany
  size: 1Gi

Building and pushing documentation using Gitlab CI

Now that we have our server running, the final thing we need to add to your .gitlab-ci.yml file is a script to push files to our server! Since we're using azure, we opted to use AZ copy

Adding this to our yaml file, the final pipeline looks like this:

image: python:3.6-buster

before_script:
  - pip install pip --upgrade

variables:
  PACKAGE_NAME: "YourPackage"
  PYPI_REPOSITORY: SECURE
  PYPI_USERNAME: SECURE
  PYPI_PASSWORD: SECURE
  DOCS_SAS: SECURE
  DOCS_URL: SECURE

stages:
  - tests
  - deploy

test36:
  stage: tests
  script:
  - pip install pytest pytest-cov
  - pip install .
  - cd tests
  - pytest --cov $PACKAGE_NAME
  only:
  - merge_requests
  - master

deploy_pypi:
  stage: deploy
  script:
    - pip install twine setuptools wheel
    - python3 setup.py bdist_wheel #build the package
    - pip install .
    - echo "[distutils]" >> ~/.pypirc # create credential file
    - echo "index-servers =" >> ~/.pypirc
    - echo "  aabd" >> ~/.pypirc
    - echo "" >> ~/.pypirc
    - echo "[aabd]" >> ~/.pypirc
    - echo "repository:" $PYPI_REPOSITORY >> ~/.pypirc
    - echo "username:" $PYPI_USERNAME >> ~/.pypirc
    - echo "password:" "$PYPI_PASSWORD" >> ~/.pypirc
    - twine upload --repository aabd dist/* # upload to server
    - echo "" > ~/.pypirc && rm ~/.pypirc
  only:
    - master

deploy_docs:
  stage: deploy # on which stage to execute this runner
  script:
    - pip install -r requirements.txt # install your package and it's requirements
    - pip install .
    - wget -O- https://aka.ms/downloadazcopy-v10-linux | tar -C. -xzf- --strip-components=1 # download azcopy
    - pip install -U sphinx sphinx-rtd-theme sphinxcontrib-napoleon # install sphinx and extensions
    - cd docs # change to docs folder
    - sphinx-build -b html source build # build html documentation using sphinx
    - ../azcopy copy "./build/*" ${DOCS_URL}${PROJECT_NAME}/?${DOCS_SAS} --recursive  # copy to folder
  only:
    - master

⚠️ Highlighted choices

  • We only push packages and documentation on a commit to master.
  • Tests are only triggered on a merge request, not on a push to master.
  • To prevent that running the tests is circumvented, we disabled direct pushes to master so the tests are always first run on a merge request.

How and when you want certain steps to trigger in your pipeline is completely up to you. This is only an example, so think about what works best for your case!

Too long; didn't yaml

Remember when I said we wanted this to be easy to use? The yaml file we have created up there is already getting pretty long, and this is just for a 'simple' pipeline were we run 1 test, do 1 type of deployment and only push documents once. But what if you want to run tests for different python versions, or do some sort of version control on your documentation, the specifications can very quickly get out of hand.

To clean this up a bit, and keep things consistent throughout the team, we opted to create a bunch of bash scripts for all the standard scripts like running pytest, and building and deploying code or documentation.

Using Gitlab builtin environment variables and a separate repository to store the scripts, we were able to reduce the yaml file to this:

image: python:3.6-buster

before_script:
  - pip install pip --upgrade
  - git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@url/to/our/cicd.git  # get our standard cicd scripts

variables:
  PACKAGE_NAME: "YourPackage"
  PYPI_REPOSITORY: SECURE
  PYPI_USERNAME: SECURE
  PYPI_PASSWORD: SECURE
  DOCS_SAS: SECURE
  DOCS_URL: SECURE

stages:
  - tests
  - deploy

test36:
  stage: tests
  script:
    - ./cicd/scripts/test_with_coverage.sh  # Run pytest with coverage report
  only:
  - merge_requests
  - master

update_docs:
  stage: deploy
  script:
    - ./cicd/scripts/build_sphinx.sh  # Split up builing and deployment in seperate steps
    - ./cicd/scripts/update_python_docs.sh  # Update your documentation when you do a commit on master
  only:
    - master

deploy_docs:
  stage: deploy
  script:
    - ./cicd/scripts/build_sphinx.sh  # Running the same build script
    - ./cicd/scripts/deploy_docs.sh  # On a tag, deploy a version of your documentation to a subfolder
  only:
    - tags

deploy_pypi:
  stage: deploy
  script:
    - ./cicd/scripts/deploy_pypi.sh  # On a tag, deploy a new version of your package to the registery
  only:
    - tags

As you can see we split up the documentation deployment into a deploy_docs and a update_docs. The difference being that update will update the latest version (in the main folder), and and the deploy creates a new folder with the tag as the version number. This now not only allows us to automatically do documentation deployment, but keep older versions alive as well!

folder structure for our 'utilipy' package that has one version stored
A new version folder on the server.
folder structure for our 'utilipy' package that has one version stored
The versions page of the documentation is also updated.

Sidenote, if you really want to have at it there is also this thing

Part two!

So we did it! We escaped hell and entered the kingdom of CICD heaven. But why stop here when we're having so much fun? In part two I'll explain how we continued our quest for automation by talking about using CookieCutter to create a repository template and how to use git hooks for code formatting and quality testing.

Photo by Bruce Mars on Unsplash
This is obviously you, exited for the next installment

So stay tuned!

About the author

Susanne Groothuis is a Data Scientist at KPMG since 2017, with a focus on NLP and image processing.