How do I choose an appropriate discretization algorithm, if I don't have any knowledge about the domain?
Quote 0 0
It depends on the type of model you are developing:[* snkbdg1]For models with a target variable, follow the discretization approach for Supervised Learning.[/*:m snkbdg1][* snkbdg1]For models without a target variable, i.e with all variables having (a priori) equal importance, follow the discretization approach for Unsupervised Learning.[/*:m snkbdg1]Discretization for Supervised Learning:In the context of Supervised Learning, the Decision Tree algorithm is generally the best approach. All continuous variables will be binned based on their individual relationship with the target variable. Information: Please note that the Decision Tree discretization only works with discrete target variables. So, if your target variable instead happens to be continuous, you will first need to discretize it individually with the Manual approach. This way you can discretize the target based on your own knowledge. Alternatively, you can utilize the automatic unsupervised discretization algorithms (e.g. density approximation). Once the thresholds are defined, the target variable is considered discrete and then it can be selected as a target with the Decision Tree algorithm.If the Decision Tree discretization fails (which means that there is no significant relationship between the variable and the target), the discretization will default to the algorithm for Unsupervised Learning (see below).Discretization for Unsupervised Learning:The default policy implemented in BayesiaLab when users are not prompted to choose the algorithms is the following:[list=1 snkbdg1][* snkbdg1]Density Approximation, then if this algorithm fails[/*:m snkbdg1][* snkbdg1]K-Means, then if this algorithm fails[/*:m snkbdg1][* snkbdg1]Normalized Equal Distances, then if this algorithm fails[/*:m snkbdg1][* snkbdg1]Equal Distances[/*:m snkbdg1][/list  snkbdg1]Information: We do not recommend Equal Frequencies for the following two reasons:[* snkbdg1]the initial density function is lost[/*:m snkbdg1][* snkbdg1]the marginal entropy of the discretized variable is at its maximum value. The objective of the BayesiaLab's structural learning algorithms being the reduction of the global entropy of the system, this discretization tool leads to over complicated networks.[/*:m snkbdg1]Number of Bins:The BayesiaLab's structural learning algorithms are based on the Minimum Description Length score (MDL). The general idea is to add a link between 2 variables only if the relation is strong enough to compensate the structural complexity induced by the resulting conditional probability table used to quantify the relationship. The more intervals per variable, the more complex the resulting conditional probability tables are, and then, the more data is needed to find relationships strong enough to compensate the complexity.The maximum number of intervals to choose is then directly dependent on the size of the dataset you are analyzing. Defining too many intervals can lead to too sparse networks. Defining too few intervals can lead to too connected networks.For datasets with 1000 observations, choosing 5 bins is usually a safe choice.At least 3 bins are necessary to capture nonlinear relationships.Changing the Structural Coefficient ([latex snkbdg1]\alpha[/latex snkbdg1]) allows changing the balance in the MDL score between relation strength and structural complexity.
Quote 0 0