Home

This package implements some resampling algorithms to generate synthetic samples. Here, with "some", I mean one algorithm. More will probably be added in the future.

The main function is smote. See the tutorial in the menu for a walkthrough and the docs below for the API reference.

Public API

Resample.smote — Function

smote([rng=default_rng()], data::AbstractVecOrMat, n::Int; k::Union{Nothing,Int}=nothing)
smote([rng=default_rng()], data, n::Int; k::Union{Nothing,Int}=nothing)

Return the sample obtained via Synthetic Minority Over-sampling TEchnique (SMOTE) (Chawla et al., 2002) for

data: Data as a matrix or satisfying the tables interface. For matices, each column denotes a point and for tables each row denotes a point.
n: Number of synthetic points that should be created.
k: Number of nearest neighbors to consider for each point.

For each minority class, the algorithm creates synthetic points along the lines in between one of the k nearest neighbors. The location of the point along the line is chosen randomly.

The implementation is based on the pseudocode from the paper, but do note that the paper has a weird API (especially N) and the implementation is full of indexing logic. The essence is much simpler than the pseudocode: To find n new points, take a random point p for each n from the minority group and for each p take a random point along the line to the nearest neighbor.

source

smote(rng::AbstractRNG, data, col::Union{Int,AbstractString,Symbol}; ratio::Real=1.0, k::Union{Nothing,Int}=nothing)
smote(data, col::Union{AbstractString,Symbol}; ratio::Real=1.0, k::Union{Nothing,Int}=nothing)

This is a helper function to simplify balancing data. Return the sample obtained via Synthetic Minority Over-sampling TEchnique (SMOTE) for

data: Data as a matrix or satisfying the tables interface. For matices, each column denotes a point and for tables each row denotes a point.
col: A column number or name specifying on which column the oversampling needs to be based.
ratio: Here, ratio specifies the desired ratio between the element types in col. For example, when column :class contains 1200 elements of class 1 and 1400 elements of class 2, then smote(data, :class; ratio=1.0) will add 200 elements of class 1. With a ratio of 0.9, smote will add only 60 elements since class 1 comes first in the data and 1400 * 0.9 = 1260 assuming that class 1 comes first in the data and then class 2. If class 2 would come first, then the ratio of 0.9 will fail since it would need to remove elements from class 1.
k: Number of nearest neighbors to consider for each point.

Note

This functionality is currently only implemented for 2 classes.

source