Table of Contents
I attended europython 2019 here are some of my takeaways and notes.
dockeris widely used for reproducible environments especially for training sessions and for deploying models.
dockerimages would be deployed as
flaskapps, with an REST API.
- REST APIs can be managed using
dockerimages would be managed using
continuous deliverywas advertised a lot
VS Codeseems to be the most popular IDE, very good demo at the MS booth. Beats
pycharmin my opinion.
scikitlearn 0.21release great new features ColumnTransformer replaces
sklearn-pandas, histogram gradient boosting, faster and more light weight then regular gradient boosting.
dirty_cathas great encoders for dirty data. SimilarityEncoder and
the new sci-kit column transformer, has very a very simple API, that makes it easy to transform your pandas df into a numpy matrix. Howver the method
get_feature_names(). Is not working for most transformers github issue. So it is not really possible to inspect the transformations inside the matrix in a systematic way.
sklearn-pandas does not have that problem, however its API is a bit more complicated.
swaggerto build YAML configurable REST API’s.
swaggerprovides documentation and user interface based on yaml.
- pytest can run tests from other testing suites, in order to incorporate examples from docstrings doctest can be used as
- pytest goes by no API is a good API
- markers can be used to organise tests
fixtures can be used to pass data to tests
hypothesiscan be used to generate random testing strings. If a fail is detected it will give a minimal reproducible example.
- intel published
tensorflowconfiguration that is 2X -4x times faster than out-of-the-box
- rf models are larger than gb trees
- histogram gradient boosted trees are implemented, low memory cost, faster training
- early stopping, stops training when a platuea of performance has been reached such as for number of trees.
- column transformer has been introduced, game changer, better alternative to
- kbinsdicretizer preprocessor to compute nonelinear decision boundaries in order to generate new features which improves performance of linear models.
lazy loads large data from disk to RAM
creates virtual columns from column calculations
- has great histogram plotting features
- supports hdfs and appache arrow
- user defined functions, can be translated to C using numbajit
- window functions like aggregate in pandas
- comes with a lot of wrappers for python ml libraries, supports scikitlearn but does not support all scalers
- pca on data points on a map with x-y coordinates can be used to shift coordinate points representing a density grid tilted grid into a horizontally alligned grid
- save all operations done in one df, can be saved as json and applied to a new dataframe. Replacement for pipelines
Make Docker Images Safe
- Large docker images have many exposed vulnarbilities
- securety tools like
clairecan be used to scan them
- shells can be attached to the docker image
- use distroless image (reduced images)
- distroless get rid of everything that is not needed, for example the
lscommand from the shell.
pyinstallercan be used to reduce dependencies, but does not detect all dependencies and then they need to be added manually.
Do: - dont run as root - use image hash instead of image name and tag (hash sign version of image similar like git commit hashes) - build your own distroless images - sign docker images
- use euclidean distance of ratings to calculate similarity score
- calculate weighted average (by similarity score) as prediction
- recommend if predicted recommendation is higher than average rating
Dev Test workflow
pip install -e .install package from from curent wd, during package dev. updates all files loaded from repo, like devtools::load_all() in R. Does not install directly into py dist.
.py files into
src/ directory instead of packagename directory. Force your tests to run on
pip install -e . version of your code.
Tox somehow like Rcmdcheck, different python versions can be spcified.
check-manifest checks whether all files are included in tar ball.
sphinx pythonic solution for python doc. generates api reference from docstrings
mkdocs language agnostic markdown documentation for projects libraries,
pydocmd claims to do generate api reference from docstrings. sphinx might be more advanced, on this, best to check receommended docstring layout recommended for the tool.
pydoc can also bve used to make documentation from docstrings. In standard library.
add readme.md to setuptools so it propably shows up on python
from setuptools import setup with open("README.md", "r") as fh: long_description = fh.read() setup( … long_description=long_description, long_description_content_type="text/markdown", ... )
cookiecutter has python package templates
Dirty categorical features
- manualy break up into two or more seperate features, for example first name, last name
- manualy group categories
- Similarity Endoding, similarity distance to category, new strings can be fitted on old categories
- Jaro-winkler, levenstain, 3-gram similarity scores
dirty_cathas similarity encoder
from dirty_cat import SimilarityEncoder
- TargetEncoder, Encode Categorical Feature as Mean/Median of other value, example police officer ranking on Salary
- Latent Category Encoder, builds new categories based on substring similarities
- classical data generation assumption, data generation is complete and random entries are random.
- NA values are seldom random, and sometimes are the result of the data model, like age of spouse will be NA for people that are single
- mean imputation distorts the distribution, concerning for statisitcal models but not algorithmic models
- when imputing age of spouse, missingness indicator could be used to flagg single people
From Script to Open Source
docopt helps you build GNU compliant CLI-tools
- code guides, only 2 parameters per function.
python setup.py develop same as
pip install -e .
- setup.py let’s you define entry points (for package plug-ins) and CLI callable name
- requirements.txt file can augment setup.py dependencies, stating tested dependency versions
black reformats code to be pythonic
pre-commit runs formatters such as
black before git commit
flake8 to check you code
tox.ini configuration file for
black code standard and
- use static type analysis,
MyPy checks if function with wrong type has been called
tox manages all those tools including testing tools similar to Rcmdcheck I guess
travisCI pip installing
tox is enough to run all test
- requirements updater is a bot that will continuosly check versions of dependencies,
pytest-cov will check test coverage
- automated code review
- automated pull request merge
twine to upload to pypi
State of Production ML in 2019
GITOPS STRATEGIES FOR ML
CI/CD via github, using, docker, kubernetes description
- data assessment
- model assessment (feature importance, shap-values, pdp-plots, interpretability)
- production monitoring (see that asassments remain intact during production)
- explainer, model that adds explanations to predictions,
alibi, delivers pertinent negative and pertinate positive (minimum changes for positive and negative prediction)
seldoncan be used to manage kubernetes
- Container Versioning
Modern Continuous Delivery
- deploy to production from commit #1
- take over release schedule from IT to Business
- CDEV is concept, CI and CDEP are techniques
* immutable infrastructure
* container orchestration
* version control and automation
* cloud native apps
tools? choice or lock-in?, lock-in choices should be avoided
cookiecutterseems to be what devtools/usethis is for python, can be used to setup CDEV for projects.
- generate + seal your secrets, otherwise you cant continuously deliver
- dont overload your yaml
- test-driven, pair programming
- the only way to go fast is to go well, robert c. martin
Practical Clean Architecture
Typing for data interfaces
- typing, use type annotations when writing functions
- typing package has objects that allow you type specifications for dictionaries
- python 3.7 offers data classes that make this easier
- dataclasses can be frozen, immutability can be added
- use abstractions to interact with databases,
- we can use
injectorto build interfaces, will inject stuff into data classes
- the interfaces can be documented with swagger
- make an in RAM db for testing
These interfaces are easily testable
- lifetime 10 years
- make an application centric infrastructure
- do not put your db the center of your architecture