Truth Discovery in Crowdsourcing Systems


Crowdsourced data is noisy in most cases, and thus repetitive labeling is usually utilized to ensure the quality of the obtained labels. Repetitive labeling requires multiple workers to provide labels for each item. By aggregating the multiple labels, one single result, which is called consensus result, can be used as estimated true label for each item.
The AC2 data set was originally used in the work of Ipeirotis et al. “Quality management on Amazon mechanical Turk”. It includes the AMT judgments for websites for the presence of adult content on the page. The original TREC data set is used in the work of Buckley et al. “Overview of the trec 2010 relevance feedback track (notebook)”. It has AMT ordinal graded relevance judgments for pairs of search queries and URLs.

Dr. Kantardzic

Personal Page

Dr. Kantardzic received his doctorate in Computer Science from the University of Sarajevo, Bosnia and Herzegovina, in 1980. He also holds an MS in Computer Science (1976), and a BS in Electrical Engineering (1972), both from the University of Sarajevo. He served as an Assistant and later an Associate Professor at the University of Sarajevo from 1972 to 1994, and in this period he was elected twice and served as an Associate Dean for Research (1987 – 1989, and 1992 – 1994). He joined the Engineering faculty at the University in Louisville in 1995 as a Visiting Professor, was instated as an Associate Professor in 2001, and was promoted to Professor in 2004. Currently, he is the Director of the Data Mining Lab as well as the Director of CECS Graduate Studies. His research focuses on data mining & knowledge discovery, machine learning, soft computing, click fraud detection and prevention, concept drift in streaming data, and distributed intelligent systems.

Dr. Kantardzic is the author of six books including “Data Mining: Concepts, Models, Methods, and Algorithms”, (Wiley, second edition, 2011), which is widely accepted as the textbook for data mining courses at over 100 universities worldwide, and “New Generation of Data Mining Applications” (Wiley, 2005). He is the author of over 30 peer-reviewed journal publications and 20 book chapters, and over 200 reviewed articles in the proceedings of international conferences in the areas of data mining, machine learning, and soft computing. His recent research projects are supported by NSF, KSTC, US Treasury Department, and NASA.  Among his accolades are a University of Louisville Distinguished Service Award (2008), numerous Outstanding, Best, and Honorable Mention Awards for his research papers (2012, -09, -07, -05, -03), and several Faculty Favorite and Distinguished Teaching Awards (2012, -11, -07, -04). Dr. Kantardzic has served on the Editorial Boards for the Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Transactions on Machine Learning and Data Mining Journal, The Information Resources Management Journal, and The Mathematical Methods for Learning Journal. He was a General Chair or Program Chair for several international conferences including ICAT‘11, ICMLA’09, IEEE ISSPIT’08, WSIPT’07, and invited keynote speaker at conferences in USA, Mexico, Bosnia, Egypt, Japan, and Algeria. He was a reviewer and panel member for NSF, the Austrian Science Foundation (ASF), the Canadian Research Council (NSERC), and Norman Hackerman Advanced Research Program (NHARP).

Data mining & knowledge discovery, machine learning, soft computing, click fraud detection and prevention, concept drift in streaming data,  distributed intelligent systems.

Solar irradiance forecasting by using wavelet based denoising

Predicting of global solar irradiance is very important in applications using solar energy resources. Due to the fact that in many applications, the data collected includes noise from different sources. The noise probably would have a great influence in the process of building regression models for irradiance forecasting. Denoising based on wavelet transformation as a preprocessing step is proposed to apply to the time series meteorological data. Artificial neural network and support vector machine are then utilized to make predictive model on Global Horizontal Irradiance (GHI) for the three cities located in California, Kentucky and New York, individually. Detailed experimental analysis is presented for the developed predictive models and comparisons with existing methodologies showed that the proposed approach gives a significant improvement with increased generality.

Lingyu Lyu

Lingyu Lyu is a PhD candidate, working with Prof. Mehmed Kantardzic, in the data mining lab. Her current research focuses on inferencing user reliability and truth discovery in crowdsourcing applications. Her other interests include machine learning, predictive models on time series data, and Bayesian modeling.

Datasets and framework for induction of feature drifts

Datasets for concept drift detection

The repository presents datasets used in the paper:

Sethi, Tegjyot Singh, and Mehmed Kantardzic. “On the Reliable Detection of Concept Drift from Streaming Unlabeled Data.” Expert Systems with Applications (2017).

A grid density based framework for classifying streaming data in the presence of concept drift

Tegjyot Singh Sethi, Mehmed Kantardzic and  Hanquing Hu

Mining data streams is the process of extracting information from non-stopping, rapidly flowing data records to provide knowledge that is reliable and timely. Streaming data algorithms need to be one pass and operate under strict limitations of memory and response time. In addition, the classification of streaming data requires learning in an environment where the data characteristics might change constantly. Many of the classification algorithms presented in literature assume a 100 % labeling rate, which is impractical and expensive when data records are rapidly flowing in. In this paper, a new incremental grid density based learning framework, the GC3 framework, is proposed to perform classification of streaming data with concept drift and limited labeling. The proposed framework uses grid density clustering to detect changes in the input data space. It maintains an evolving ensemble of classifiers to learn and adapt to the model changes over time. The framework also uses a uniform grid density sampling mechanism to obtain a uniform subset of samples for better classification performance with a lower labeling rate. The entire framework is designed to be one-pass, incremental and work with limited memory to perform any-time classification on demand. Experimental comparison with state of the art concept drift handling systems demonstrate the GC3 frameworks ability to provide high classification performance, using fewer models in the ensemble and with only 4-6 % of the samples labeled. The results show that the GC3 framework is effective and attractive for use in real world data stream classification applications.


On the Reliable Detection of Concept Drift from Streaming Unlabeled Data (Tegjyot Singh Sethi and Mehmed Kantardzic)

Classifiers deployed in the real world operate in a dynamic environment, where the data distribution can change over time. These changes, referred to as concept drift, can cause the predictive performance of the classifier to drop over time, thereby making it obsolete. To be of any real use, these classifiers need to detect drifts and be able to adapt to them, over time. Detecting drifts has traditionally been approached as a supervised task, with labeled data constantly being used for validating the learned model. Although effective in detecting drifts, these techniques are impractical, as labeling is a difficult, costly and time consuming activity. On the other hand, unsupervised change detection techniques are unreliable, as they produce a large number of false alarms. The inefficacy of the unsupervised techniques stems from the exclusion of the characteristics of the learned classifier, from the detection process. In this paper, we propose the Margin Density Drift Detection (MD3) algorithm, which tracks the number of samples in the uncertainty region of a classifier, as a metric to detect drift. The MD3 algorithm is a distribution independent, application independent, model independent, unsupervised and incremental algorithm for reliably detecting drifts from data streams. Experimental evaluation on 6 drift induced datasets and 4 additional datasets from the cybersecurity domain demonstrates that the MD3 approach can reliably detect drifts, with significantly fewer false alarms compared to unsupervised feature based drift detectors. At the same time, it produces performance comparable to that of a fully labeled drift detector. The reduced false alarms enables the signaling of drifts only when they are most likely to affect classification performance. As such, the MD3 approach leads to a detection scheme which is credible, label efficient and general in its applicability.


Tegjyot Singh Sethi (TJ)

TJ is a PhD candidate in the data mining lab. His research interests lie in the area of data mining, adversarial machine learning, change detection, learning with limited labeling, and data stream mining. He is a program committee member for INNS – Big Data, and has published several international conference and journal papers.

Improved fuzzy possibilistic C-means model based on quadratic distance

Abstract: Aiming at the problem of most fuzzy clustering algorithms being sensitive to sample data sets, this sensitivity makes one algorithm run on various kinds of data sets to generate great different clustering results, therefore, we propose improved fuzzy possibilistic C-means based on quadratic distance. We analyze the feature of interval-valued data and introduce mathematic representation method of interval-valued sample data. On the basis of these, we present three measure methods between interval-valued sample data and prototypes and corresponding computing methods of weight matrix, and then propose optimal objective function. The iterative function of centroid and membership and typicality are acquired by constructing Lagrange equation and then it is proved iterative function is convergence by many times iteration. Finally, we provide steps of algorithm. Experiments on two types of three data sets show that algorithm has good performance not only on point prototype but also on interval-valued prototype.