Library
I have built a segmentation model in BayesiaLab, where the resultant segments are derived from a set of induced factors. I now have a new set of data that I would like to classify into the segments created from the first set of data. In order to identify the best variables with which to do that I ran an Augmented MB on the manifest variables in the original model. Not surprisingly, given that the segments were created from factors, the overall precision was not great (~70%). Running the same procedure, but this time focusing only on the induced factors yields a much higher precision (~96%).So what I would like to be able to do is generate "factor scores" for each of the original induced factors for each respondent in the new dataset. Then, I would like to use the factors identified in the Augmented Markov Blanket procedure to classify new respondents into a given segment, and then save those classifications out to a data file.Anyway, I was wondering if this is possible? And if so, how it is accomplished in BayesiaLab?
Quote 0 0
Library
The process you are describing is what we call the “BayesiaLab Hierarchical Clustering”. We will first describe the workflow for Data Clustering based on Factors, and then how to use the resulting network for classifying a new set of observations into the created segments.Cluster Induciton:Here are the steps for creating the Hierarchical network:[list=1:l8gn9pno][*:l8gn9pno]Run one of the Unsupervised Structural Learning algorithms on the Manifest variables (the original variables of your dataset) you want to include into your segmentation. The best choice is probably the Maximum Weight Spanning tree. Even if the generated tree is usually not the best representation of the joint probability distribution,[list=a:l8gn9pno][*:l8gn9pno]This is the fastest algorithm,[/*:m:l8gn9pno][*:l8gn9pno]The results are stable,[/*:m:l8gn9pno][*:l8gn9pno]This is an intermediate step. The network will just be used for generating clusters of variables.[/*:m:l8gn9pno][/list:l8gn9pno][/*:m:l8gn9pno][*:l8gn9pno]Go to ValidationMode (F5).[/*:m:l8gn9pno][*:l8gn9pno]Run Variable clustering.[/*:m:l8gn9pno][*:l8gn9pno]Go back to Modeling Mode (F4)[/*:m:l8gn9pno][*:l8gn9pno]Run Multiple clustering for inducing one Factor per cluster of variables[list=a:l8gn9pno][*:l8gn9pno]Check "Add all Nodes to the Final Network" to get the Factor and the Manifest in the final network.[/*:m:l8gn9pno][*:l8gn9pno]Check “Connect Factors to their Manifest Variables” to get a set of Naïve structures, one per Factor.[/*:m:l8gn9pno][*:l8gn9pno]Check "Forbid new Relations with Manifest Variables" to focus only on the Factors.[/*:m:l8gn9pno][/list:l8gn9pno][/*:m:l8gn9pno][*:l8gn9pno]Select all the nodes (Ctrl + A) and run Data Clustering. The constraints on the Manifests will prevent the connections with the cluster node.[/*:m:l8gn9pno][/list:l8gn9pno]Classification:Here are the steps for using the Hierarchical network for classification of a new dataset. The new observations do not have Factors.[list=1:l8gn9pno][*:l8gn9pno]Associate the new dataset to the Hierarchical network (Data - Associate Data Source). All the Factors will appear as white nodes for indicating that they are Hidden nodes, i.e. nodes without any corresponding data in the associated dataset.[/*:m:l8gn9pno][*:l8gn9pno]Impute the Factors by using the evidence we have on the Manifests[list=a:l8gn9pno][*:l8gn9pno]Select of all the Factors[/*:m:l8gn9pno][*:l8gn9pno]Right click on one of these Factors and select Imputation - Choose the Values with the Maximum Probability[/*:m:l8gn9pno][/list:l8gn9pno][/*:m:l8gn9pno][*:l8gn9pno]Impute (classify) the Cluster node by right clicking on this node and selecting Imputation - Choose the Values with the Maximum Probability[/*:m:l8gn9pno][/list:l8gn9pno]You can then save the entire data set or only save your cluster by selecting it (Data - Save Data).If you do not want to use the original Hierarchical network (i.e. the one with the Cluster node connected to all the Factors and the Factors connected to all their Manifests), you can impute the Cluster node after Step 6 of the Cluster Induction workflow, and then use a Supervised Learning algorithm to find the best variables. However, as the segments have been induced based on the Naïve structure, the initial structure should obviously give you the best results. Using one of the Markov Blanket based algorithms is then only useful, in that particular case, for trying to reduce the number of variables that necessary for the classification.
Quote 0 0