Skip to content

qdscreen

Remove redundancy in your categorical variables and increase your models performance.

Python versions Build Status Tests Status Coverage Status codecov Flake8 Status

Documentation PyPI Downloads Downloads per week GitHub stars

qdscreen provides a python implementation of the Quasi-determinism screening algorithm (also known as qds-BNSL) from T.Rahier's PhD thesis, 2018.

Most data scientists are familiar with the concept of correlation between continuous variables. This concept extends to categorical variables, and is known as functional dependency in the field of relational databases mining. We also name it determinism in the context of Machine Learning and Statistics, to indicate that when a random variable X is known then the value of another variable Y is determined with absolute certainty. "Quasi-"determinism is an extension of this concept to handle noise or extremely rare cases in data.

qdscreen is able to detect and remove (quasi-)deterministic relationships in a dataset:

  • either as a preprocessing step in any general-purpose data science pipeline

  • or as an accelerator of a Bayesian Network Structure Learning method such as pyGOBN

Installing

> pip install qdscreen

Usage

1. Remove correlated variables

See this example.

2. Learn a Bayesian Network structure

TODO see #6.

Main features / benefits

  • A feature selection algorithm able to eliminate quasi-deterministic relationships

    • a base version compliant with numpy and pandas datasets
    • a scikit-learn compliant version (numpy only)
  • An accelerator for Bayesian Network Structure Learning tasks

See Also

Others

Do you like this library ? You might also like smarie's other python libraries

Want to contribute ?

Details on the github page: https://github.com/python-qds/qdscreen