The size of data collected around the world is growing exponentially, and it has become popular as big data. The volume and velocity of big data are facilitating the transition of machine learning (ML), deep learning (DL) and artificial intelligence (AI) from research laboratories to real life. There are numerous other claims made about Big Data. Can we, however, rely on data blindly? What happens when a dataset used to train ML models has a hidden statistical paradox? Data, like fossil fuels, is valuable, but it must be refined carefully for accurate outcomes. Statistical paradoxes are hard to observe in classical data cleaning and analysis techniques. Still, they are required to be investigated separately in training datasets. In this paper, we discuss the impact of Simpson’s paradox on categorical data and demonstrate its effects on AI and ML application scenarios. Next, we provide an algorithm to automatically identify the confounding variable and detect Simpson’s paradox within categorical datasets. The algorithm experiments on datasets from two real-world case studies. The outcome of the algorithm uncovers the existence of the paradox and indicates that Simpson’s paradox is severely harmful in automatic data analysis, especially in AI, ML and DL.
ADBIS-2022
Detecting Simpson’s Paradox: A Step Towards Fairness in Machine Learning
In the last two decades, artificial intelligence (AI) and machine learning (ML) have grown tremendously. However, understanding and assessing the impacts of causality and statistical paradoxes are still some of the critical challenges in their domains. Currently, these terms are widely discussed within the context of explainable AI (XAI) and algorithmic fairness. However, they are still not in the mainstream AI and ML application development scenarios. In this paper, first, we discuss the impact of Simpson’s paradox on linear trends, i.e., on continuous values, and then we demonstrate its effects via three benchmark training datasets used in ML. Next, we provide an algorithm for detecting Simpson’s paradox. The algorithm has experimented with the three datasets and appears beneficial in detecting the cases of Simpson’s paradox in continuous values. In future, the algorithm can be utilized in designing a certain next-generation platform for fairness in ML.
IEEE Access
A Novel Framework for Unification of Association Rule Mining, Online Analytical Processing and Statistical Reasoning
Measuring interestingness in between data items is one of the key steps in association rule mining. To assess interestingness, after the introduction of the classical measures (support, confidence and lift), over 40 different measures have been published in the literature. Out of the large variety of proposed measures, it is very difficult to select the appropriate measures in a concrete decision support scenario. In this paper, based on the diversity of measures proposed to date, we conduct a preliminary study to identify the most typical and useful roles of the measures of interestingness. The research on selecting useful measures of interestingness according to their roles will not only help to decide on optimal measures of interestingness, but can also be a key factor in proposing new measures of interestingness in association rule mining.
DASFAA
Towards Unification of Statistical Reasoning, OLAP and Association Rule Mining: Semantics and Pragmatics
Over the last decades, various decision support technologies have gained massive ground in practice and theory. Out of these technologies, statistical reasoning was used widely to elucidate insights from data. Later, we have seen the emergence of online analytical processing (OLAP) and association rule mining, which both come with specific rationales and objectives. Unfortunately, both OLAP and association rule mining have been introduced with their own specific formalizations and terminologies. This made and makes it always hard to reuse results from one domain in another. In particular, it is not always easy to see the potential of statistical results in OLAP and association rule mining application scenarios. This paper aims to bridge the artificial gaps between the three decision support techniques, i.e., statistical reasoning, OLAP, and association rule mining and contribute by elaborating the semantic correspondences between their foundations, i.e., probability theory, relational algebra, and the itemset apparatus. Based on the semantic correspondences, we provide that the unification of these techniques can serve as a foundation for designing next-generation multi-paradigm data mining tools.
DASFAA
Why Not to Trust Big Data: Discussing Stastical Paradoxes
Big data is driving the growth of businesses, data is the money, big data is the fuel of the twenty-first century, and there are many other claims over Big Data. Can we, however, rely on big data blindly? What happens if the training data set of a machine learning module is incorrect and contains a statistical paradox? Data, like fossil fuels, is valuable, but it must be refined carefully for the best results. Statistical paradoxes are difficult to observe in datasets, but they are significant to analyse in every small or big dataset. In this paper, we discuss the role of statistical paradoxes on Big data. Mainly we discuss the impact of Berkson’s paradox and Simpson’s paradox on different types of data and demonstrate how they affect big data. We provide that statistical paradoxes are more common in a variety of data and they lead to wrong conclusions potentially with harmful consequences. Experiments on two real-world datasets and a case study indicate that statistical paradoxes are severely harmful to big data and automatic data analysis techniques.
Machine Learning Assisted Methodology for Multiclass Classification of Malignant Brain Tumors
The Internet of Things (IoT) is commonly employed to detect different kinds of diseases in the health sector. Presently, disease detection is performed using MRI images, X-rays, CT scans, and so on for diagnosing the diseases. The manual detection process is found to be time-consuming and may result in detection errors that affect the diagnosis. Hence, there is a need for an automatic system for which the deep learning methods gain a major interest. Hence, the idea to combine deep learning and disease prediction to effectively predict the disease is initiated. In this research, the deep learning method is combined with deep learning for the effective prediction of diseases, where the IoT network is employed in the data collection from the patients. The proposed cuckoo-based deep convolutional long-short term memory (deep convLSTM) classifier is employed for disease prediction, where the cuckoo search optimization is utilized for tuning the deep convLSTM classifier. The proposed method is compared with the conventional methods, and it achieved a training percentage of 97.591%, 95.874%, and 97.094%, respectively, for accuracy, sensitivity, and specificity. The comparative analysis proved that the proposed method obtained higher accuracy than other methods.
2021
Springer
A Systematic Assessment of Numerical Association Rule Mining Methods
In data mining, the classical association rule mining techniques deal with binary attributes; however, real-world data have a variety of attributes (numerical, categorical, Boolean). To deal with the variety of data attributes, the classical association rule mining technique was extended to numerical association rule mining. Initially, the concept of numerical association rule mining started with the discretization method, and later, many other methods, e.g., optimization, distribution are proposed in state-of-the-art. Different authors have presented various algorithms for each numerical association rule mining method; therefore, it is hard to select a suitable algorithm for a numerical association rule mining task. In this article, we present a systematic assessment of various numerical association rule mining methods and we provide a meta-study of thirty numerical association rule mining algorithms. We investigate how far the discretization techniques have been used in the numerical association rule mining methods.
BDA
Impact-Driven Discretization of Numerical Factors: Case of Two- and Three-Partitioning
Kaushik, Minakshi, Sharma, Rahul, Peious, Sijo Arakkal, and Draheim, Dirk
Many real-world data sets contain a mix of various types of data, i.e., binary, numerical, and categorical; however, many data mining and machine learning (ML) algorithms work merely with discrete values, e.g., association rule mining. Therefore, the discretization process plays an essential role in data mining and ML. In state-of-the-art data mining and ML, different discretization techniques are used to convert numerical attributes into discrete attributes. However, existing discretization techniques do not reflect best the impact of the independent numerical factor onto the dependent numerical target factor. This paper proposes and compares two novel measures for order-preserving partitioning of numerical factors that we call Least Squared Ordinate-Directed Impact Measure and Least Absolute-Difference Ordinate-Directed Impact Measure. The main aim of these measures is to optimally reflect the impact of a numerical factor onto another numerical target factor. We implement the proposed measures for two-partitions and three-partitions. We evaluate the performance of the proposed measures by comparison with human-perceived cut-points. We use twelve synthetic data sets and one real-world data set for the evaluation, i.e., school teacher salaries from New Jersey (NJ). As a result, we find that the proposed measures are useful in finding the best cut-points perceived by humans.
BDA
Big Data Analytics in Association Rule Mining: A Systematic Literature Review
Shahin, Mahtab, Arakkal Peious, Sijo, Sharma, Rahul, Kaushik, Minakshi, Ben Yahia, Sadok, Shah, Syed Attique, and Draheim, Dirk
In 3rd International Conference on Big Data Engineering and Technology (BDET) Jan 2021
Measuring interestingness in between data items is one of the key steps in association rule mining. To assess interestingness, after the introduction of the classical measures (support, confidence and lift), over 40 different measures have been published in the literature. Out of the large variety of proposed measures, it is very difficult to select the appropriate measures in a concrete decision support scenario. In this paper, based on the diversity of measures proposed to date, we conduct a preliminary study to identify the most typical and useful roles of the measures of interestingness. The research on selecting useful measures of interestingness according to their roles will not only help to decide on optimal measures of interestingness, but can also be a key factor in proposing new measures of interestingness in association rule mining.
DaWaK
Grand Reports: A Tool for Generalizing Association Rule Mining to Numeric Target Values
Arakkal Peious, Sijo, Sharma, Rahul, Kaushik, Minakshi, Shah, Syed Attique, and Yahia, Sadok Ben
In Big Data Analytics and Knowledge Discovery Jan 2020
Since its introduction in the 1990s, association rule mining(ARM) has been proven as one of the essential concepts in data mining; both in practice as well as in research. Discretization is the only means to deal with numeric target column in today’s association rule mining tools. However, domain experts and decision-makers are used to argue in terms of mean values when it comes to numeric target values. In this paper, we provide a tool that reports mean values of a chosen numeric target column concerning all possible combinations of influencing factors – so-called grand reports. We give an in-depth explanation of the functionalities of the proposed tool. Furthermore, we compare the capabilities of the tool with one of the leading association rule mining tools, i.e., RapidMiner. Moreover, the study delves into the motivation of grand reports and offers some useful insight into their theoretical foundation.
FDSE
On the Potential of Numerical Association Rule Mining
Kaushik, Minakshi, Sharma, Rahul, Peious, Sijo Arakkal, Shahin, Mahtab, Ben Yahia, Sadok, and Draheim, Dirk
In Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications Jan 2020
In association rule mining, both the classical algorithms and today’s available tools either use binary data items or discretized data. However, in real-world scenarios, data are available in many different forms (numerical, text) and these types of data items are not supported in the classical association rule mining algorithms. There are some association rule mining algorithms that have been proposed for numerical data items but unfortunately, for working data scientists and decision makers, it is challenging to find concrete algorithms that fit their purposes best. Therefore, it is highly desired to have a study on the different existing numerical association rule mining algorithms (NARM). In this paper, we provide such a detailed study by thoroughly reviewing 24 NARM algorithms from different categories (optimization, discretization, distribution).