Home
This package implements some resampling algorithms to generate synthetic samples. Here, with "some", I mean one algorithm. More will probably be added in the future.
The main function is smote
. See the tutorial in the menu for a walkthrough and the docs below for the API reference.
Public API
Resample.smote
— Functionsmote([rng=default_rng()], data::AbstractVecOrMat, n::Int; k::Union{Nothing,Int}=nothing)
smote([rng=default_rng()], data, n::Int; k::Union{Nothing,Int}=nothing)
Return the sample obtained via Synthetic Minority Over-sampling TEchnique (SMOTE) (Chawla et al., 2002) for
data
: Data as a matrix or satisfying the tables interface. For matices, each column denotes a point and for tables each row denotes a point.n
: Number of synthetic points that should be created.k
: Number of nearest neighbors to consider for each point.
For each minority class, the algorithm creates synthetic points along the lines in between one of the k
nearest neighbors. The location of the point along the line is chosen randomly.
The implementation is based on the pseudocode from the paper, but do note that the paper has a weird API (especially N
) and the implementation is full of indexing logic. The essence is much simpler than the pseudocode: To find n
new points, take a random point p
for each n
from the minority group and for each p
take a random point along the line to the nearest neighbor.
smote(rng::AbstractRNG, data, col::Union{Int,AbstractString,Symbol}; ratio::Real=1.0, k::Union{Nothing,Int}=nothing)
smote(data, col::Union{AbstractString,Symbol}; ratio::Real=1.0, k::Union{Nothing,Int}=nothing)
This is a helper function to simplify balancing data
. Return the sample obtained via Synthetic Minority Over-sampling TEchnique (SMOTE) for
data
: Data as a matrix or satisfying the tables interface. For matices, each column denotes a point and for tables each row denotes a point.col
: A column number or name specifying on which column the oversampling needs to be based.ratio
: Here,ratio
specifies the desired ratio between the element types incol
. For example, when column:class
contains 1200 elements of class 1 and 1400 elements of class 2, thensmote(data, :class; ratio=1.0)
will add 200 elements of class 1. With a ratio of 0.9, smote will add only 60 elements since class 1 comes first in the data and 1400 * 0.9 = 1260 assuming that class 1 comes first in the data and then class 2. If class 2 would come first, then the ratio of 0.9 will fail since it would need to remove elements from class 1.k
: Number of nearest neighbors to consider for each point.
This functionality is currently only implemented for 2 classes.