discretize {arules} | R Documentation |
This function implements several basic unsupervised methods to convert a continuous variable into a categorical variable (factor) using different binning strategies. For convenience, a whole data.frame can be discretized (i.e., all numeric columns are discretized).
discretize(x, method = "frequency", breaks = 3, labels = NULL, include.lowest = TRUE, right = FALSE, dig.lab = 3, ordered_result = FALSE, infinity = FALSE, onlycuts = FALSE, categories, ...) discretizeDF(df, methods = NULL, default = NULL)
x |
a numeric vector (continuous variable). |
method |
discretization method. Available are: |
breaks, categories |
|
labels |
character vector; labels for the levels of the resulting category. By default, labels are constructed using "(a,b]" interval notation. If |
include.lowest |
logical; should the first interval be closed to the left? |
right |
logical; should the intervals be closed on the right (and open on the left) or vice versa? |
dig.lab |
integer; number of digits used to create labels. |
ordered_result |
logical; return a ordered factor? |
infinity |
logical; should the first/last break boundary changed to +/-Inf? |
onlycuts |
logical; return only computed interval boundaries? |
... |
for method "cluster" further arguments are passed on to
|
.
df |
data.frame; each numeric column in the data.frame is discretized. |
methods |
named list of lists or a data.frame;
the named list contains list of discretization parameters
(see parameters of |
default |
named list; parameters for |
discretize
only implements unsupervised discretization. See packages arulesCBA, discretization or RWeka for supervised
discretization.
discretizeDF
applies discretization to each numeric column.
Individual discretization parameters can be specified in the form:
methods = list(column_name1 = list(method = ,...), column_name2 = list(...))
.
If no discretization method is specified for a column, then the discretization in default
is applied (NULL
invokes the default method in discretize()
). The special method "none"
can be specified to suppress discretization for a column.
A factor representing the categorized continuous variable
with attribute "discretized:breaks"
indicating the used breaks
or and "discretized:method"
giving the used method. If
onlycuts = TRUE
is used, a vector with the calculated
interval boundaries is returned.
discretizeDF
returns a discretized data.frame.
Michael Hahsler
data(iris) x <- iris[,1] ### look at the distribution before discretizing hist(x, breaks = 20, main = "Data") def.par <- par(no.readonly = TRUE) # save default layout(mat = rbind(1:2,3:4)) ### convert continuous variables into categories (there are 3 types of flowers) ### the default method is equal frequency table(discretize(x, breaks = 3)) hist(x, breaks = 20, main = "Equal Frequency") abline(v = discretize(x, breaks = 3, onlycuts = TRUE), col = "red") # Note: the frequencies are not exactly equal because of ties in the data ### equal interval width table(discretize(x, method = "interval", breaks = 3)) hist(x, breaks = 20, main = "Equal Interval length") abline(v = discretize(x, method = "interval", breaks = 3, onlycuts = TRUE), col = "red") ### k-means clustering table(discretize(x, method = "cluster", breaks = 3)) hist(x, breaks = 20, main = "K-Means") abline(v = discretize(x, method = "cluster", breaks = 3, onlycuts = TRUE), col = "red") ### user-specified (with labels) table(discretize(x, method = "fixed", breaks = c(-Inf, 6, Inf), labels = c("small", "large"))) hist(x, breaks = 20, main = "Fixed") abline(v = discretize(x, method = "fixed", breaks = c(-Inf, 6, Inf), onlycuts = TRUE), col = "red") par(def.par) # reset to default ### prepare the iris data set for association rule mining ### use default discretization irisDisc <- discretizeDF(iris) head(irisDisc) ### discretize all numeric columns differently irisDisc <- discretizeDF(iris, default = list(method = "interval", breaks = 2, labels = c("small", "large"))) head(irisDisc) ### specify discretization for the petal columns and don't discretize the others irisDisc <- discretizeDF(iris, methods = list( Petal.Length = list(method = "frequency", breaks = 3, labels = c("short", "medium", "long")), Petal.Width = list(method = "frequency", breaks = 2, labels = c("narrow", "wide")) ), default = list(method = "none") ) head(irisDisc) ### discretize new data using the same discretization scheme as the ### data.frame supplied in methods. Note: NAs may occure if a new ### value falls outside the range of values observed in the ### originally discretized table (use argument infinity = TRUE in ### discretize to prevent this case.) discretizeDF(iris[sample(1:nrow(iris), 5),], methods = irisDisc)