Tips and tricks from personal experience, followed by a list of study materials

Hi! You must be interested in passing the Google Cloud Professional Data Engineer Exam. Google recommends that you have 3+ years of experience before attempting the exam. However, I think that if you have some experience with other cloud providers, databases and SQL, you can still do it, namely because GCP is much more intuitive than its competitors (in my humble opinion).

Unlike other certifications, there isn’t a regimented coursebook or training manual. That is because Google expects you to be a practitioner and know most things…


Anomalies and outliers and how to find them.

When I started to write this article the first thing I did was to look for formal definitions of anomaly and outlier. Turns out that there isn’t a consensus on the matter. Every field would have (annoying) a slightly different opinion:

  • Statisticians, for the most part, will use the two terms interchangeably. [ref]
  • Climatologists would say that a (temperature) anomaly is the difference between value and the mean [ref]
  • In manufacturing, anomalies are defects (I have not seen anyone use the term outlier)
  • In banking and insurance anomalies are synonymous with fraud…


How to get rid of Jupyter Notebook and use Jupyter Lab instead.

The Problem

We at DataSparQ love Kubeflow. But we don’t like the default Notebook options that are on the menu:

“You can have any colour you want, as long as it’s black”

They all come with:

  • Jupyter Notebook: which does not have dark mode or the ability to paste in the terminal
  • Lack of (working) sudo, which comes apparent only after you realise apt get is the only way to install packages like ssh-keygen
  • Default packages are fine if you need Tensorflow. Besides you can always install from your requirements…


How to return meaningful responses when something goes wrong.

TLDR:

  • Use custom Exceptions in your Flask app
  • Capture all of those and return them to the client in a uniform format in the response body
  • Have an SST (single source of truth) for your HTTP response codes

The problem

Has something like this ever happened to you?


3 simple steps to build a simple CI/CD using your existing Kubernetes cluster

TLDR:

  • We run our tests as Kubernetes jobs so that are as close to production as possible
  • A python script running in Cloud Build monitors the job until completion. If the job fails, so does the rest of the build
  • We used a custom Docker container that had all the required packages and could automatically authenticate with most GCP resources

A diagram and the source code are at the bottom of the page.

Background

Houston is our home grown server-less solution for workflow management. It has a couple of…


… when you want Helm without Helm

TLDR:
We wanted parameterised Kubernetes deployments but Helm was too complicated to integrate with our CI/CD. So we solved the problem with Jinja Templates and a python script running in Cloud Build.

Code is available as a Github gist: here

The Problem

We have a product called Houston that simplifies workflows. It has an API, a web interface, and a couple of other components that all live in Kubernetes. We wanted to have a continuous integration/deployment where testing, staging and production were all running on the same cluster but in different namespaces. …


The Majestic Million dataset, TLDs and punycode

If you ever looked at an enterprise network traffic, you quickly find out that the majority of the traffic is to a handful of domains like Google, Apple, Facebook, Instagram and of course Netflix (makes you wander how much work exactly gets done there).

And if your job is to find anomalous traffic/connections, it is very useful to get rid of those big an noisy domains by having some kind of whitelist. I am not going to insult the readers intelligence by listing unrealistic and broken solutions. The way I went about it is to use domain rankings. I have…


I recently decided to give Jupyter Lab a try as an alternative to RStudio. And there were a couple of basic requirements I had:

  • Run it in a container
  • Have an NGINX reverse proxy that provides both SSL and client-side certificates authentication
  • Deploy with Terrafom

What seemed a trivial task quickly turned in more than a day of frustration. I managed to get the Jupyter container up and running behind NGINX, but I could not get any code running in the notebooks — I could not connect to the Python kernel.

After several hours of trying chasing the wrong questions…


Image source: http://horizonlightinginc.com

This is part 2 of a series of articles that explore the prospective salaries and living expenses across several countries and cities. If you want to read part 1:

In the previous part, we talked about salaries in the driven industry for California, UK and Germany. I also touched on the taxes of those locations and came up with a net annual salary estimate. Next I would like to talk about expenses in those places.

Basic Living Expenses

This is where is gets really messy and complicated. I have given up on the idea of achieving high validity, but that doesn’t have to…


Image Source: www.londonphotofestival.org

Couple of moths ago I went on a journey to find myself one of those elusive creatures called jobs. For those of you who have been in my shoes (or still are), you are painfully aware how time consuming this process can be. It can often feel overwhelming and futile, and if you lack the emotional capacity, it can get the best of you.

For reasons I will not dwell upon in this article, I have decided that I will take the challenge up a notch or two - I decided to apply to multiple countries simultaneously without ever having…

Ivan N.

When the machines take over, I will be on the winning side 🤖

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store