18. Libraries

The built-in Python modules do a lot that you need, but there are many external Application Programming Interfaces (APIs), libraries or modules. We will look at some of these external libraries.

digraph libraries { rankdir=LR; node [shape=box, style="rounded"]; pypi [label="PyPI packages"]; pip [label="pip install"]; pandas [label="pandas\nCSV and tables"]; numpy [label="numpy / scipy\nnumerical computing"]; sklearn [label="scikit-learn\nmachine learning"]; joblib [label="joblib\nparallel work"]; pypi -> pip; pip -> pandas; pip -> numpy; pip -> sklearn; pip -> joblib; }

18.1. pip

One popular way to install external libraries is through pip. pip is a command-line tool that installs Python packages from the Python Package Index (PyPI). To install a package, you usually only need its name, such as pandas.

pip install <package_name>

You can also install multiple packages in one line.

pip install <package_name_1> <package_name_2>

Note

pip will work its hardest to resolve transitive dependencies and bring those in. Transitive dependencies are those that a package you are trying to install depends on to work.

18.2. Pandas

pip install pandas

Pandas is a library for interacting with data. Writing CSV files is easy using Pandas.

1import pandas as pd
2import random
3
4data = [[random.randint(0, 101) for _ in range(10)] for _ in range(10)]
5
6df = pd.DataFrame(data, columns=[f'x{i}' for i in range(10)])
7print(df.shape)
8
9df.to_csv('test.csv', header=True, index=False)

Reading data from a CSV using Pandas is just as easy.

1import pandas as pd
2
3df = pd.read_csv('test.csv')
4
5print(df.shape)

18.3. Numpy

pip install numpy scipy

Numpy is a numerical library. SciPy builds on numpy and is a general purpose scientific computing library. If we wanted to draw samples from a normal distribution centered on 0 with a scale of 1, \(\mathcal{N}(0, 1)\), we can use the normal() function.

from numpy.random import normal

values = normal(0, 1, 100)
print(values)

18.4. Scikit-Learn

pip install scikit-learn

Scikit-Learn is a data science library. We can use this library to learn predictive models, generate data, transform data and so on.

from sklearn.datasets import make_regression

X, y = make_regression(**{
   'n_samples': 1000,
   'n_features': 50,
   'n_informative': 10,
   'n_targets': 1,
   'bias': 5.3,
   'random_state': 37
})

print(f'X shape = {X.shape}, y shape {y.shape}')

18.5. joblib

pip install joblib

Joblib is an library to make multi-core processing easier in Python.

from math import sqrt
from joblib import Parallel, delayed

results = Parallel(n_jobs=2)(delayed(sqrt) (i ** 2) for i in range(10))
print(results)

18.6. Exercise

Pick one of the libraries in this chapter and build a tiny demo around it:

  • pandas: read a CSV and compute one summary statistic

  • numpy: generate random values and compute their mean

  • scikit-learn: create fake regression data and print the shapes

  • joblib: parallelize a simple math task

Draw a short conclusion about why that library is more convenient than writing everything yourself.

digraph library_exercise { rankdir=LR; node [shape=box, style="rounded"]; choose [label="choose one library"]; run [label="write small demo"]; inspect [label="print result"]; explain [label="explain why it helps"]; choose -> run -> inspect -> explain; }