AldenBlack
What is the best way to compare the effectiveness of adding a latent variable to a network (using Learning | Clustering | Data Clustering)? More specifically, what is the best way to determine the accuracy and likelihood of the structure pre and post latent variable addition? I'm thinking using the Network Performance function might be the best way to go, but get the suspicion I'm missing something obvious.If this topic is adequately addressed in one of the white papers or elsewhere and I missed it, please redirect me there and I'll take a look.
Quote 0 0
Dan
If you are adding latent variables in the context of Supervised Learning, the best way to measure their added value is using Network Performance | Target with an independent test set.If it's in the context of Unsupervised Learning, the best solution consists indeed in using the Contingency Table Fit via Network Performance | Overall. However, you need to set your latent variables as Not Observable (via Edit | Cost) in order to exclude them for the log likelihood computation.
Quote 0 0
AldenBlack
Thanks.Why would I want to set my latent variables as non-observable? Wouldn't I want to include them in the score as they form part of the new network? In other words, the reference structure and new structure with the latent variable(s) only differ by the latent variable(s), so by excluding them, wouldn't I end up with the same score?Also, can you please explain the different between the structure score (that you see for each structure when performing cross-validation) and the contingency table fit percentage? I was under the impression that the "best" unsupervised-learned network was one that had at least 75% contingency table fit but the lowest score (indicating the highest P(structure | data) from the EM algorithm).Finally, while we're on the topic of hidden variables, I think I have a good idea of how to introduce latent variables via modeling, but I do not understand the significance of Analysis | Report | Hidden Variable Discovery, especially if this function is intended to help show where latent variables should be introduced. The documentation in the FAQ is vague, but I thought this was supposed to show the chance of two nodes (or V-structure) being independent. However, the values for the G-test and P-value in the report do not make much sense in the context of the network I'm currently working with. Can you provide a little more insight here please (also, what does the change in color of the edges represent when I ask for this report?)Apologies for the long, multi-part question.
Quote 0 0