Solar irradiance forecasting by using wavelet based denoising

Predicting of global solar irradiance is very important in applications using solar energy resources. Due to the fact that in many applications, the data collected includes noise from different sources. The noise probably would have a great influence in the process of building regression models for irradiance forecasting. Denoising based on wavelet transformation as a preprocessing step is proposed to apply to the time series meteorological data. Artificial neural network and support vector machine are then utilized to make predictive model on Global Horizontal Irradiance (GHI) for the three cities located in California, Kentucky and New York, individually. Detailed experimental analysis is presented for the developed predictive models and comparisons with existing methodologies showed that the proposed approach gives a significant improvement with increased generality.

A grid density based framework for classifying streaming data in the presence of concept drift

Tegjyot Singh Sethi, Mehmed Kantardzic and  Hanquing Hu

Mining data streams is the process of extracting information from non-stopping, rapidly flowing data records to provide knowledge that is reliable and timely. Streaming data algorithms need to be one pass and operate under strict limitations of memory and response time. In addition, the classification of streaming data requires learning in an environment where the data characteristics might change constantly. Many of the classification algorithms presented in literature assume a 100 % labeling rate, which is impractical and expensive when data records are rapidly flowing in. In this paper, a new incremental grid density based learning framework, the GC3 framework, is proposed to perform classification of streaming data with concept drift and limited labeling. The proposed framework uses grid density clustering to detect changes in the input data space. It maintains an evolving ensemble of classifiers to learn and adapt to the model changes over time. The framework also uses a uniform grid density sampling mechanism to obtain a uniform subset of samples for better classification performance with a lower labeling rate. The entire framework is designed to be one-pass, incremental and work with limited memory to perform any-time classification on demand. Experimental comparison with state of the art concept drift handling systems demonstrate the GC3 frameworks ability to provide high classification performance, using fewer models in the ensemble and with only 4-6 % of the samples labeled. The results show that the GC3 framework is effective and attractive for use in real world data stream classification applications.

 Paper

On the Reliable Detection of Concept Drift from Streaming Unlabeled Data (Tegjyot Singh Sethi and Mehmed Kantardzic)

Classifiers deployed in the real world operate in a dynamic environment, where the data distribution can change over time. These changes, referred to as concept drift, can cause the predictive performance of the classifier to drop over time, thereby making it obsolete. To be of any real use, these classifiers need to detect drifts and be able to adapt to them, over time. Detecting drifts has traditionally been approached as a supervised task, with labeled data constantly being used for validating the learned model. Although effective in detecting drifts, these techniques are impractical, as labeling is a difficult, costly and time consuming activity. On the other hand, unsupervised change detection techniques are unreliable, as they produce a large number of false alarms. The inefficacy of the unsupervised techniques stems from the exclusion of the characteristics of the learned classifier, from the detection process. In this paper, we propose the Margin Density Drift Detection (MD3) algorithm, which tracks the number of samples in the uncertainty region of a classifier, as a metric to detect drift. The MD3 algorithm is a distribution independent, application independent, model independent, unsupervised and incremental algorithm for reliably detecting drifts from data streams. Experimental evaluation on 6 drift induced datasets and 4 additional datasets from the cybersecurity domain demonstrates that the MD3 approach can reliably detect drifts, with significantly fewer false alarms compared to unsupervised feature based drift detectors. At the same time, it produces performance comparable to that of a fully labeled drift detector. The reduced false alarms enables the signaling of drifts only when they are most likely to affect classification performance. As such, the MD3 approach leads to a detection scheme which is credible, label efficient and general in its applicability.

Paper

An improved input parameters-insensitive trajectory clustering algorithm.

Abstract: The existing trajectory clustering (TRACLUS) is sensitive to the input parameters ε and MinLns. The parameter valueis changed a little, but cluster results are entirely different. Aiming at this vulnerability, a shielding parameters sensitivity trajectory cluster (SPSTC) algorithm is proposed which is insensitive to the input parameters. Firstly, some definitions about the core distance and reachable distance of line segment are presented, and then the algorithm generates cluster sorting according to the core distance and reachable distance. Secondly, the reachable plots of line segment sets are constructed according to the cluster sorting and reachable distance. Thirdly, a parameterized sequence is extracted according to the reachable plot, and then the final trajectory cluster based on the parameterized sequence is acquired. The parameterized sequence represents the inner cluster structure of trajectory data. Experiments on real data sets and test data sets show that the SPSTC algorithm effectively reduces the sensitivity  to the input parameters, meanwhile it can obtain the better quality of the trajectory cluster.

Ensemble Framework for Missing Feature Problem in Data Stream Classification

Hanqing Hu, Mehmed Kantardzic

A dynamic data stream requires the classification framework to adapt to changes in the stream. A common strategy for adaptation is to train new models or update existing models when changes occur. However, in real world applications, some features of the data can be missing when training new models. This can be due to faulty devices or interruption in data transmission. The performance for new models trained with incomplete data may be negatively impacted. If no update to models occurs, performance may remain low even after the data stream is restored back to full feature. To solve this missing feature problem we propose Ensemble Framework for Missing Feature (EFMF). The framework trains new models using available features then update the model once the data stream is restored. Experimentally we show that our framework outperforms the two naïve approaches where the framework waits for all features and then trains new models and where the framework train with incomplete data with no update later on.

Sliding Reservoir Approach for Delayed Labeling in Streaming Data Classification

Hanqing Hu, Mehmed Kantardzic

Download Paper

Abstract

When concept drift occurs within streaming data, a streaming data classification framework needs to update the learning model to maintain its performance. Labeled samples required for training a new model are often unavailable immediately in real world applications. This delay of labels might negatively impact the performance of traditional streaming data classification frameworks. To solve this problem, we propose Sliding Reservoir Approach for Delayed Labeling (SRADL). By combining chunk based semi-supervised learning with a novel approach to manage labeled data, SRADL does not need to wait for the labeling process to finish before updating the learning model. Experiments with two delayed-label scenarios show that SRADL improves prediction performance over the naïve approach by as much as 7.5% in certain cases. The most gain comes from 18-chunk labeling delay time with continuous labeling delivery scenario in real world data experiments.