Data … as usual

All things about data by Laurent Leturgez

Playing with Python and Machine Learning in the Cloud … for Free

If you are interested by the Python language and Machine learning programming (which is usually linked), you will probably think about configuring an environment on your laptop or instantiate an environment hosted by a Cloud provider, which is a bit expensive especially if you want doing some very basic tests.

If you want to try (and build) those very trendy neural networks, you will need a GPU to speed up your programs (and some related boring stuff like installing and configuring Cuda etc.). Same thing if you want to play with spark (and specifically with pyspark)

That can be a boring stuff to do. But do you know that it’s possible to quickly set up an environment in the cloud for free … yes for free. So let’s have a look to two solutions : Google colaboratory (named colab) and Kaggle.

But before we start, we need to know what a notebook is, because these platforms use python notebook as playground.

What is a notebook ?

A notebook is a file which embed code, markup language (HTML) and equations. Each notebook is divided by cells and each cell can be executed individually inside a kernel.

When the code is python, the file extension is usually ipynb. Please note, that notebooks can run other languages than python, each kernel run a specific language for example, Python or Scala.

If you want to know more about notebook, follow these links: or


Google Colaboratory (Colab notebooks)

Google colab is a free notebook environment hosted by Google. To access it, you only need a free google account. Once you created your notebook, you have the possibility to save it on a Google drive file (with ipynb extension) and, optionally, export it on github.

To create your first notebook, you have to go to and then click the “File” Menu and Select “new Python 3 notebook” (or New Python 2 Notebook, if you want to deal with python 2).

Colaboratory Notebook 1

This will create a new folder in your google drive home root directory named “Colab Notebooks” with the file you created in it.


Once in your notebook, you can add some cells and write your first python lines.

But, what you have to know is that you are in a remote environment with packages installed (by default you have many python packages already installed), and once instantiated, you can even modify your kernel by installing new softwares etc.

For example, let’s say … we want to set up a pyspark environment. We first need to install pyspark with pip and then run a bunch of pyspark code to test everything is ok.

The notebook is available at this URL (I saved it on github):, and the result is represented below:


Colaboratory Notebook 2

You can even load files from your local disk to your runtime, and then run code on it. In the example given below (and integrated in the notebook linked above), I used the google API to do that:

Colaboratory notebook 3

Of course, this environment is for testing purpose only, you don’t have a lot of power behind but it’s useful if you want to start learning Python, or test a bunch of script without any local infrastructure and … it’s free.


The second platform to start playing python is more machine learning oriented. Indeed kaggle is a platform for data scientists who are allowed to share and find some data sets, build model, enter in datascience challenges etc.

Accessing to kaggle is free, you just have to subscribe at and then log in.

Once logged into the system, you have to go to and click on “New Kernel” and select your preferred style, and you will have access to a notebook with default packages loaded (like numpy and pandas) :


Basically, kaggle is not very different from Google Colaboratory … but kaggle is interesting because you can enable a GPU for your notebook.

To do that, you can go to the “settings” area (in the bottom right corner) and set “GPU” to “ON”.


This will enable a GPU (ok it’s not a farm 😉 ) but this can help you to work on small workload and why not on a small neural network.

Below, a small bunch of code that use tensorflow and gives you the information about GPU enablement.


Please note that you can easily upload your datasets (or use datasets provided by kaggle) by clicking on the “+ Add Dataset” Button

That’s it for today 😉



Colors identification for images stored in the Cloud with Python

I recently worked on some Python code to detect which are the main colors in an image.

To do that, my images were stored in an Oracle Cloud Infrastructure block storage bucket.

The process had to be done in 3 steps:

  • I had first to extract them by using the “oci” python package.
  • Then I had to convert the unstructured binary image to a structured numpy array.
  • And finally, I used an unsupervised ML routine (KMeans Clustering) to analyze the numpy array and detect which were the main colors in this image.

Reading Images stored in an OCI block storage bucket

To read images, or more generally, files store in  an OCI block storage bucket. You need to have configured your client environment to access the OCI.

To do that, you will need various OCIDs (user, tenant), some keys (private and public). I will not develop this part because I already did it in a previous post … see here !

Once your configuration is ok, you have to load it into your python script, get an ObjectStorageClient object from the configuration, and request the namespace data of your ObjectStorageClient.

After that, it becomes easy to read an object (file) inside a bucket referenced inside the namespace.

This is done by the following code

compartment_id = config["tenancy"]
object_storage = oci.object_storage.ObjectStorageClient(config)
namespace = object_storage.get_namespace().data

my_object = object_storage.get_object(namespace,

print("type( = ",type(

As you can see, I printed the class type of the object content … and without any surprise, it’s a “bytes” class.

type( =  <class 'bytes'>

Note: If your images are stored by another cloud provider. They usually have a Python SDK in order to do the same things 😉

Converting an unstructured binary image to a numpy array

Once I did that, if I want to process my image I have to convert it in a usable data structure. And, with Python, the best data structure to process images is a numpy array, so I had to find a way to convert my binary soap (Bytes) to a structures numpy array.

As I don’t want to use a temporary file to do that stuff, I used a BytesIO object to process them directly in memory. At the end of the stream, I used a pillow Image (new name for the deprecated PIL package) from the BytesIO stream.

After that, a conversion to a numy array was possible. Please note that I had to convert a bit my numpy array structure. As you may know, an image file is represented in a multi-dimension array.

The first two dimensions represent the pixels of your Image. Added to that, you have 3rd dimension which encode for Red, Green and Blue values of each pixel. Sometimes a fourth value is added for what is called “Alpha” which is intended in transparency encoding. As I don’t know how were encoded Images, and as I don’t need to process the Alpha layer, I converted my 3 or 4 layers array into a 3 layers array (R,G and B encoding only).

The following code do the stuff:

from PIL import Image
from io import BytesIO

This will produce the result below:

img.shape= (640, 1280, 3)

So my image is represented by a numpy array (ndarray). my image width is 640 pixels, height is 1280 pixels and each pixel is encoded by 3 values for Red, Green and Blue.

Using a clustering ML algorithm to detect colors

Next step, but not least. We have to choose a method to detect colors in the image.

First, I thought about getting the “average” color, but doing this is not a good way, because in the case of your image is equally colored by yellow, blue, red, and green … your average color will be a crappy brown which is not realistic.

The best way to get colors is to run a unsupervised machine learning algorithm (K-Means) to group all your colors into clusters based on R, G and B values. No matter the ML framework you will use to execute the KMeans, after execute your program you will get, the center point of each cluster which represent the color associated with the cluster and the differents labels for your clusters. Then you will be able to count the number of occurence of your label, and you will get the number of points inside your cluster.

It becomes easy to count the number of points in each color, this is for the most important thing in this algorithm. The other key point is how to structure your data as input for your KMeans.

This is simply resolved by flattening your image representation (in the numpy array). The array is flatten to a one-dimension list of triplets (reprensenting your RGB values).

In the following code, I used opencv (cv2 package) which is often used for image detection and capturing. This package is delivered with a kmeans algorithm that is optimized for image processing.

import cv2

# pixels is the 1D array, results of the img flattening process (made by reshape function)
pixels = np.float32(img.reshape(-1, 3))
print("Pixel shape = ", pixels.shape)

# Here is the number of colors we are trying to detect.
n_colors = 5

# Opencv kmeans parameters (See the following URL for more information: 
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 200, .1)

# palette represents clusters centers
# Labels represents the cluster labels.
#   As we have 5 colors, labels are 0,1,2,3,4 
_, labels, palette = cv2.kmeans(pixels, n_colors, None, criteria, 10, flags)
# And counts represents the number of occurence for each label
_, counts = np.unique(labels, return_counts=True)

# Our dominant color is the color that have the maximum number of occurence in the "counts" array
dominant = palette[np.argmax(counts)]
print("dominant color (RVB) =",dominant)

If you prefer to use tensorflow, the code below will do the stuff

import tensorflow as tf
# this is for removing all the tensorflow INFO and WARN messages

# pixels is the 1D array, results of the img flattening process (made by reshape function)
pixels = np.float32(img.reshape(-1, 3))
print("Pixel shape = ", pixels.shape)

def input_fn():
    return tf.train.limit_epochs(tf.convert_to_tensor(pixels, dtype=tf.float32), num_epochs=1)

n_colors = 5

kmeans = tf.contrib.factorization.KMeansClustering(num_clusters=n_colors, 

num_iterations = 20
for _ in range(num_iterations):
    print('Training ... score:', kmeans.score(input_fn))
    cluster_centers = kmeans.cluster_centers()

cluster_indices = list(kmeans.predict_cluster_index(input_fn))
counts=np.unique(cluster_indices, return_counts=True)[1]

dominant = palette[np.argmax(counts)]
print("dominant =",dominant)

Now we have our results, we are able to produce a nice plot with:

  • the initial picture,
  • the dominant colors gradient,
  • the main dominant color
  • the second dominant color (I did that because In the code I worked on, many pictures had a white background which was detected and the main color in 99% of the cases)

And to do that, I used the matplotlib library:

import matplotlib as mpl
%matplotlib notebook
from matplotlib import pyplot as plt

indices = np.argsort(counts)[::-1]  
freqs = np.cumsum(np.hstack([[0], counts[indices]/counts.sum()]))
rows = np.int_(img.shape[0]*freqs)

dom_patch = np.zeros(shape=img.shape, dtype=np.uint8)
main_patch=np.ones(shape=img.shape, dtype=np.uint8)*np.uint8(palette[indices[0]])
second_patch=np.ones(shape=img.shape, dtype=np.uint8)*np.uint8(palette[indices[1]])

for i in range(len(rows) - 1):
    dom_patch[rows[i]:rows[i + 1], :, :] += np.uint8(palette[indices[i]])

fig, (ax0, ax1, ax2, ax3 ) = plt.subplots(1, 4 , figsize=(9,6))

ax1.set_title('Dominant colors')

ax2.set_title('Main color')

ax3.set_title('Second color')

Please note that, this code was running inside a jupyter notebook … so adapt the code if you want to run it in another context.

This will produce that kind of result :



Dealing with Oracle Cloud Infrastructure and Python

Oracle provides various SDK to create resources in the OCI.

Recently, I played with the Python SDK for OCI. In this blog post, I will show you the basics to create a simple bucket in the Object Storage part of OCI, and simply put a file on this Bucket.

OCI Client configuration

First, you will need to install the python OCI package. The best for that is to create a python virtual environment, activate it, and install all the packages you need inside.

mbp:python_venv $ python -m virtualenv oci
Using base prefix '/Users/leturgezl/miniconda3/envs/general'
New python executable in /Users/leturgezl/python_venv/oci/bin/python
Installing setuptools, pip, wheel...

mbp:python_venv $ source oci/bin/activate

(oci) mbp:python_venv leturgezl$ pip install oci numpy pandas

Now that packages are installed, we have to configure the client to access OCI.

To do that, we need many things :

  • User OCID: this can be found in the User’s Page in OCI
  • Tenancy OCI: this can be found in the Tenancy’s page in OCI
  • Your OCI region
  • A private key file, its public key, and the related fingerprint.

The keys have been generated like this (I used a key without passphrase)

# Private key generation

$ mkdir ~/.oci
$ openssl genrsa -out ~/.oci/oci_api_key.pem 2048
$ chmod go-rwx ~/.oci/oci_api_key.pem

# Public key generation
$ openssl rsa -pubout -in ~/.oci/oci_api_key.pem -out ~/.oci/oci_api_key_public.pem

# Fingerprint generation
$ openssl rsa -pubout -outform DER -in ~/.oci/oci_api_key.pem | openssl md5 -c

Once you did that, you will have to configure you user in OCI to add it the public key, the fingerprint given in the interface must match with the previous command:

OCI User 1


It’s important to keep your private key “private” (don’t send it to another people, or leave it without any protection on your laptop).

Now, your local environment is configured, we will need a dictionary structure in our python script to use the SDK.

This dictionary can be build manually and embedded in the code, then you will have to fill the required fields (Key file location is the private key location):

config = {
    "user": "ocid1.user.oc1..aaaaaaaamcel7xygkvhe....aaaaaaaaaaaaaaaaaaaaa" ,
    "key_file": "~/.oci/oci_api_key.pem",
    "fingerprint": "35:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa",
    "tenancy": "ocid1.tenancy.oc1..aaaaaaaahgagkf7xygkvhe....aaaaaaaaaaaaaaaaaaaaa",
    "region": "eu-frankfurt-1"

Or, you can configure a local “config” file in your ~/.oci/ directory and then load it in the code with the given python code below:

$ cat ~/.oci/config

Note : you can embed more than one user profile in this file. The only one required is the DEFAULT profile.

>>> import oci
>>> import pandas as pd
>>> config=oci.config.from_file()
>>> df=pd.DataFrame.from_dict(config, orient='index')
>>> df
log_requests                                                       False
pass_phrase                                                         None
user                             ocid1.user.oc1..aaaaaaaamcel7xygkvhe...
fingerprint              35:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa:aa
key_file                                          ~/.oci/oci_api_key.pem
tenancy                     ocid1.tenancy.oc1..aaaaaaaahgagkf7xygkvhe...
region                                                    eu-frankfurt-1

You can read the oci config file and select another profile by using this:

config = oci.config.from_file(profile_name="laurent")

Or use another file by using this parameter

config = oci.config.from_file(file_location="~/OCI_config.uat")

You can see there are more parameters in the dictionary, you can find the details by reading this:

Creating an Object Storage bucket in OCI

Now our client is well configured to access OCI through a user and his keys.

it’s really easy to create a bucket. We have to request an ObjectStorageClient object and use it to create the bucket:

import oci
from oci.object_storage.models import CreateBucketDetails

compartment_id = config["tenancy"]
object_storage = oci.object_storage.ObjectStorageClient(config)

namespace = object_storage.get_namespace().data
bkt_name = "python-bucket"
object_name = "python_file"

print("Creating a new bucket {!r} in compartment {!r}".format(bkt_name, compartment_id))
request = CreateBucketDetails()
request.compartment_id = compartment_id = bkt_name
bucket = object_storage.create_bucket(namespace, request)

This will produce that kind of output:

Creating a new bucket 'python-bucket' in compartment 'ocid1.tenancy.oc1..aaaaaaaahgagkf7xygkvhe...'

And in the OCI web interface, our bucket appeared :

OCI Bucket 1

Put a file into the Object storage bucket

Now we have a bucket created in our compartment, it’s easy to put a file on it (I’ll put a binary file which is a PNG file).

To do that, the below code will be enough (considering variables have been initiated by previous code parts … see above)

with open("images/myimage.png", mode='rb') as file:
my_data =

obj = object_storage.put_object(

In the OCI console, inside the previously created bucket, the file has been created and is available:

OCI Bucket 2

As you can see, deploying resources on the OCI is easy and you can deploy your full infrastructure with a bunch of code.

Next investigations will be made soon, specially to deploy virtual machines, storage and databases of course.

That’s it for today 🙂

Calling a PLSQL procedure or function from a Python script

There’s a couple of days I didn’t write here, and specially in the Python for Oracle series.

In this blog post, I will write a bit on how calling a plsql procedure or function from a Python script using the cx_Oracle package.

There’s nothing very difficult to do that, you just have to consider 3 cases :

  • Calling a stored procedure with IN arguments
  • Calling a stored procedure with IN and OUT arguments
  • Calling a function and getting the result.

For all these cases, there are two common operations to do:

  1. Connect to Oracle database
  2. Create a cursor from this connection

Once you get the cursor … it depends from what you want to do:

Calling a stored procedure with IN arguments

In the case you have a simple call to do with only IN arguments, you have to call the callproc function from the cx_Oracle cursor object, and give an array of parameters … and that’s it.

In real, and in the case you call a stored procedure which have this declaration :

create or replace procedure t(a in number, b in varchar2)


it gives something like that:


    curs.callproc("T", [2,"parameter1"])
except cx_Oracle.DatabaseError as ex:
    err, =ex.args
    print("Error code    = ",err.code)
    print("Error Message = ",err.message)

connectToOracle is a function I coded previously and which is not detailed here, but its code is the one below:

def connectToOracle(url, username, password, mode=None):
    if mode is not None:
        connection = cx.Connection (user=username, password=password, dsn=url, mode=mode)
        connection = cx.Connection (user=username, password=password, dsn=url)
    return connection

Calling a stored procedure with IN and OUT parameters

In that case, it’s not really different from the previous example because you will give the same array as procedure parameter. The only difference is one (or many parameters) will be modified by the procedure execution.

But let’s see an example.

Consider this procedure definition (very simple ok 😉 )

create or replace procedure t(a in number, b out number) as
  b := a*a;

Calling this procedure in a python script will look like this


    curs.callproc("T", [10,r])
    print("Result = ",r.getvalue(pos=0))

except cx_Oracle.DatabaseError as ex:
    err, =ex.args
    print("Error code    = ",err.code)
    print("Error Message = ",err.message)

You have remark that, you have to create a variable with the same cx_Oracle internal data type as you plsql out parameter, and then convert it into the python type with getvalue function. (Every function related to cx_Oracle variables are documented here:

Calling a function and get the result into a python script

The principle is basically the same, once you get a connection, you create a cursor from it, and then you will call the “callfunc” function with these parameters:

And this call will return the result of this function.

For example, if we consider a PLSQL function declared like this, and that return the salary of an employee:

CREATE OR REPLACE EDITIONABLE FUNCTION "LAURENT"."GET_SAL" (fname in varchar2, lname in varchar2) return number
 sal number;
 l_ln varchar2(128) := lower(lname);
 l_fn varchar2(128) := lower(fname);
 select salary into sal from emp where lower(first_name) like '%'||l_fn||'%' and lower(last_name) like '%'||l_ln||'%';
 return sal;
   when NO_DATA_FOUND then raise_application_error(-20000,'No employee with this last_name and first_name');

It will be called like this in a python script:


    janette_king_sal = curs.callfunc("GET_SAL", cx_Oracle.NUMBER, ["Janette", "King"]);
    print("janette king sal", janette_king_sal)
except cx_Oracle.DatabaseError as ex:
    err, =ex.args
    print("Error code    = ",err.code)
    print("Error Message = ",err.message)


As you can see, dealing with PLSQL and Python is not very difficult, and let you keeping the code close to your data.

That’s it for today 🙂

Monitor your Oracle Linux startup time

Linux redhat’s based systems (like Oracle Linux) come with systemd init system that is used to bootstrap the user space and to manage system processes after booting.

Systemd has been created to reduce boot time (and shutdown time) of your server, but sometime, you can have a process that slows down the boot process and, in order to diagnose it, you can use a very interesting command to do that stuff.

The command is systemd-analyze.

With no argument, it will help you to know in how many time your server starts-up and how many times it spent on each phase (kernel, initrd, and user space) :

[root@oel7 ~]# systemd-analyze
Startup finished in 1.942s (kernel) + 6.141s (initrd) + 1min 44.688s (userspace) = 1min 52.772s


Another interesting option is the “blame” option that will give you the list of running systemd units ordered by time to initialize (remark: I used the no-pager switch to disable results paging)

[root@oel7 ~]# systemd-analyze blame --no-pager
    1min 15.289s kdump.service
         50.119s plymouth-quit-wait.service
         37.486s vboxadd.service
          8.134s accounts-daemon.service
          6.110s tuned.service
          5.179s systemd-logind.service
          5.092s libvirtd.service
          5.026s NetworkManager-wait-online.service
          4.873s systemd-udev-settle.service
          4.870s lvm2-monitor.service
          4.863s ModemManager.service
          4.749s ksm.service
          4.742s rhel-dmesg.service
          4.316s postfix.service
          4.052s NetworkManager.service
          3.924s dev-mapper-ol\x2droot_lv.device
          3.678s abrt-ccpp.service
          3.200s rtkit-daemon.service
          3.087s polkit.service
          2.937s gssproxy.service
          1.875s lvm2-pvscan@8:2.service
          1.508s rsyslog.service
          1.505s network.service
          1.425s packagekit.service
          1.420s sysstat.service
          1.401s sshd.service
          1.374s systemd-tmpfiles-setup-dev.service
          1.214s avahi-daemon.service
          1.109s netcf-transaction.service
          1.063s proc-fs-nfsd.mount
          1.014s chronyd.service
           932ms rhel-readonly.service
           913ms rhnsd.service
           877ms systemd-udevd.service
           842ms dev-mapper-ol\x2dswap_lv.swap
           800ms auditd.service
           778ms plymouth-read-write.service
           750ms u01.mount
           746ms systemd-tmpfiles-setup.service
           709ms systemd-vconsole-setup.service
           692ms systemd-fsck@dev-mapper-ol\x2dorasoft_lv.service
           638ms systemd-fsck@dev-mapper-ol\x2doradata_lv.service
           632ms systemd-journald.service
           631ms u02.mount

With this command, you can display aa tree of every steps in the init critical chain and how many time it took:

[root@oel7 ~]# systemd-analyze critical-chain
The time after the unit is active or started is printed after the "@" character.
The time the unit takes to start is printed after the "+" character. @1min 44.678s
└─ @1min 44.678s
  └─vboxadd-service.service @50.648s +372ms
    └─vboxadd.service @13.158s +37.486s
      └─ @12.969s
        └─ @12.969s
          └─cups.path @12.968s
            └─ @12.509s
              └─systemd-update-utmp.service @12.403s +105ms
                └─auditd.service @11.602s +800ms
                  └─systemd-tmpfiles-setup.service @10.719s +746ms
                    └─rhel-import-state.service @10.120s +590ms
                      └─ @10.118s
                        └─run-user-0.mount @1min 8.832s
                          └─ @9.141s
                            └─lvm2-monitor.service @4.269s +4.870s
                              └─lvm2-lvmetad.service @5.220s
                                └─lvm2-lvmetad.socket @4.267s

The last but not least option offers the possibility to generate a SVG diagram (readable in a web browser) to visualize all the details of the bootstrap and daemons initialisation.

To do that you have to use the plot option:

[root@oel7 ~]# systemd-analyze plot > t.html

Then, you will get a nice plot in your browser to easily identify which process took the major part of the boot time:

For information, a legend is displayed on the page bottom.



That’s it for today 😉