Mining local and global patterns for complex data classification

Thumbnail Image
Li, Geng
Issue Date
Electronic thesis
Computer science
Research Projects
Organizational Units
Journal Issue
Alternative Title
In the second part of this thesis, we study the challenging problem of graph classification. With the proliferation of graph data, there has been a lot of interest in recent years to develop effective methods for classifying graph objects. Various graph kernel methods have been proposed recently. These methods have proven to be effective, but they tend to have high computational overhead. We propose an alternative approach to graph classification that is based on easily computable feature-vectors constructed from different global topological pattern attributes, as well as global label-based features. The main idea is that the graphs from the same class should have similar topological patterns and label attributes. Our method is simple to implement, and via a detailed comparison on real benchmark datasets, we show that our topological and label feature-based approach delivers competitive classification accuracy, with significantly better results on those datasets that have large unlabeled graph instances. Our method is also substantially faster than most other graph kernels.
In the first part of this thesis, we tackle the challenging problem of mining the simplest Boolean patterns from categorical datasets, which are subsequently used as features for classification. Enumerating all possible frequent Boolean patterns is prohibitive in most real-world datasets, so instead of complete enumeration, which is typically infeasible for this class of patterns, we propose the first approach to generate a near-uniform sample of the minimal Boolean expressions. We make both theoretical and practical contributions. Our method, based on Markov Chain Monte Carlo (MCMC) sampling, yields a succinct subset of the simplest frequent Boolean patterns. In our method, each Markov state can be taken to be a Boolean expression, with transitions allowed only between parent and child expressions. Starting from the empty expression, we use Monte Carlo methods to perform random walks with designed probability transition matrix P in the expression space to sample the minimal expressions. We propose a novel theoretical characterization of the minimal Disjunctive Normal Form (DNF) expressions, which allows us to prune the MCMC search space effectively. When combined with other optimization techniques, our method is also practically effective. For instance, we are able to sample interesting support-less patterns, i.e., where minimum frequency threshold is set to one. In order to demonstrate the effectiveness of our method, we perform an extensive set of experiments. In particular, we classify a variety of datasets from the UCI Machine Learning Repository, and show that minimal DNF patterns make very effective features for classification; much more so than purely conjunctive features. We also study the sample quality of our approach, as well as its scalability.
Pattern mining is an important data mining technique that involves extracting interesting (e.g., frequent) patterns from complex data. There is abundant work published in this research area in the past and lots of progress has been made, ranging from itemset mining, to sequence mining and graph mining. One significant application of pattern mining lies in its use for effective classification, since pattern mining can help discover structure in complex domains. The basic idea is that interesting patterns can be defined as new features, which can then be used to build a classifier. However, with the growing complexity of the data as well as the types of patterns and groups sought, traditional methods based on complete enumeration of all interested patterns suffers in terms of either the time complexity or memory constraint problem, which usually make the computation very expensive or even intractable. This is especially common for dense datasets and complex graph data.
August 2013
School of Science
Full Citation
Rensselaer Polytechnic Institute, Troy, NY
PubMed ID