The size of data collected around the world is growing exponentially, and it has become popular as big data. The volume and velocity of big data are facilitating the transition of machine learning (ML), deep learning (DL) and artificial intelligence (AI) from research laboratories to real life. There are numerous other claims made about Big Data. Can we, however, rely on data blindly? What happens when a dataset used to train ML models has a hidden statistical paradox? Data, like fossil fuels, is valuable, but it must be refined carefully for accurate outcomes. Statistical paradoxes are hard to observe in classical data cleaning and analysis techniques. Still, they are required to be investigated separately in training datasets. In this paper, we discuss the impact of Simpson’s paradox on categorical data and demonstrate its effects on AI and ML application scenarios. Next, we provide an algorithm to automatically identify the confounding variable and detect Simpson’s paradox within categorical datasets. The algorithm experiments on datasets from two real-world case studies. The outcome of the algorithm uncovers the existence of the paradox and indicates that Simpson’s paradox is severely harmful in automatic data analysis, especially in AI, ML and DL.
@inproceedings{dexa2022,author={Sharma, Rahul and Garayev, Huseyn and Kaushik, Minakshi and Peious, Sijo Arakkal and Tiwari, Prayag and Draheim, Dirk},editor={Strauss, Christine and Cuzzocrea, Alfredo and Kotsis, Gabriele and Tjoa, A. Min and Khalil, Ismail},title={Detecting Simpson's Paradox: A Machine Learning Perspective},booktitle={Database and Expert Systems Applications},year={2022},publisher={Springer International Publishing},address={Cham},pages={323--335},abbr={DEXA-2022},selected={true},bibtex_show={true},isbn={978-3-031-12423-5}}
ADBIS-2022
Detecting Simpson’s Paradox: A Step Towards Fairness in Machine Learning
In the last two decades, artificial intelligence (AI) and machine learning (ML) have grown tremendously. However, understanding and assessing the impacts of causality and statistical paradoxes are still some of the critical challenges in their domains. Currently, these terms are widely discussed within the context of explainable AI (XAI) and algorithmic fairness. However, they are still not in the mainstream AI and ML application development scenarios. In this paper, first, we discuss the impact of Simpson’s paradox on linear trends, i.e., on continuous values, and then we demonstrate its effects via three benchmark training datasets used in ML. Next, we provide an algorithm for detecting Simpson’s paradox. The algorithm has experimented with the three datasets and appears beneficial in detecting the cases of Simpson’s paradox in continuous values. In future, the algorithm can be utilized in designing a certain next-generation platform for fairness in ML.
@inproceedings{ADBIS2022,author={Sharma,RahulandKaushik,MinakshiandPeious,SijoArakkalandBertl,MarkusandVidyarthi,AnkitandKumar,AshwaniandDraheim,Dirk},title = {Detecting Simpson's Paradox: A Step Towards Fairness in Machine Learning},booktitle = {New Trends in Database and Information Systems},year = {2022},publisher = {Springer International Publishing},address = {Cham},pages = {67--76},abbr = {ADBIS-2022},selected = {true},bibtex_show = {true},isbn = {978-3-031-15743-1}}
IEEE Access
A Novel Framework for Unification of Association Rule Mining, Online Analytical Processing and Statistical Reasoning
Measuring interestingness in between data items is one of the key steps in association rule mining. To assess interestingness, after the introduction of the classical measures (support, confidence and lift), over 40 different measures have been published in the literature. Out of the large variety of proposed measures, it is very difficult to select the appropriate measures in a concrete decision support scenario. In this paper, based on the diversity of measures proposed to date, we conduct a preliminary study to identify the most typical and useful roles of the measures of interestingness. The research on selecting useful measures of interestingness according to their roles will not only help to decide on optimal measures of interestingness, but can also be a key factor in proposing new measures of interestingness in association rule mining.
@article{RahulA1,author={Sharma, Rahul and Kaushik, Minakshi and Peious, Sijo Arakkal and Bazin, Alexandre and Shah, Syed Attique and Fister, Iztok and Yahia, Sadok Ben and Draheim, Dirk},journal={IEEE Access},title={A Novel Framework for Unification of Association Rule Mining, Online Analytical Processing and Statistical Reasoning},year={2022},volume={10},number={},pages={12792-12813},doi={10.1109/ACCESS.2022.3142537},abbr={IEEE Access},bibtex_show={true},pdf={RahuAl.pdf},html={https://ieeexplore.ieee.org/document/9678347}}
DASFAA
Towards Unification of Statistical Reasoning, OLAP and Association Rule Mining: Semantics and Pragmatics
Over the last decades, various decision support technologies have gained massive ground in practice and theory. Out of these technologies, statistical reasoning was used widely to elucidate insights from data. Later, we have seen the emergence of online analytical processing (OLAP) and association rule mining, which both come with specific rationales and objectives. Unfortunately, both OLAP and association rule mining have been introduced with their own specific formalizations and terminologies. This made and makes it always hard to reuse results from one domain in another. In particular, it is not always easy to see the potential of statistical results in OLAP and association rule mining application scenarios. This paper aims to bridge the artificial gaps between the three decision support techniques, i.e., statistical reasoning, OLAP, and association rule mining and contribute by elaborating the semantic correspondences between their foundations, i.e., probability theory, relational algebra, and the itemset apparatus. Based on the semantic correspondences, we provide that the unification of these techniques can serve as a foundation for designing next-generation multi-paradigm data mining tools.
@inproceedings{dasfaa2022,author={Sharma, Rahul and Kaushik, Minakshi and Peious, Sijo Arakkal and Shahin, Mahtab and Vidhyarthi, Ankit and Draheim, Dirk},title={Towards Unification of Statistical Reasoning, OLAP and Association Rule Mining: Semantics and Pragmatics},booktitle={Database Systems for Advanced Applications},year={2022},publisher={Springer International Publishing},address={Cham},selected={true},abbr={DASFAA},bibtex_show={true}}
DASFAA
Why Not to Trust Big Data: Discussing Stastical Paradoxes
Big data is driving the growth of businesses, data is the money, big data is the fuel of the twenty-first century, and there are many other claims over Big Data. Can we, however, rely on big data blindly? What happens if the training data set of a machine learning module is incorrect and contains a statistical paradox? Data, like fossil fuels, is valuable, but it must be refined carefully for the best results. Statistical paradoxes are difficult to observe in datasets, but they are significant to analyse in every small or big dataset. In this paper, we discuss the role of statistical paradoxes on Big data. Mainly we discuss the impact of Berkson’s paradox and Simpson’s paradox on different types of data and demonstrate how they affect big data. We provide that statistical paradoxes are more common in a variety of data and they lead to wrong conclusions potentially with harmful consequences. Experiments on two real-world datasets and a case study indicate that statistical paradoxes are severely harmful to big data and automatic data analysis techniques.
@inproceedings{RhulA4,author={Sharma, Rahul and Kaushik, Minakshi and Peious, Sijo Arakkal and Shahin, Mahtab and Vidhyarthi, Ankit and Tiwari, Prayag and Draheim, Dirk},title={Why Not to Trust Big Data: Discussing Stastical Paradoxes},booktitle={Database Systems for Advanced Applications},year={2022},publisher={Springer International Publishing},address={Cham},selected={true},abbr={DASFAA},bibtex_show={true}}
Machine Learning Assisted Methodology for Multiclass Classification of Malignant Brain Tumors
The Internet of Things (IoT) is commonly employed to detect different kinds of diseases in the health sector. Presently, disease detection is performed using MRI images, X-rays, CT scans, and so on for diagnosing the diseases. The manual detection process is found to be time-consuming and may result in detection errors that affect the diagnosis. Hence, there is a need for an automatic system for which the deep learning methods gain a major interest. Hence, the idea to combine deep learning and disease prediction to effectively predict the disease is initiated. In this research, the deep learning method is combined with deep learning for the effective prediction of diseases, where the IoT network is employed in the data collection from the patients. The proposed cuckoo-based deep convolutional long-short term memory (deep convLSTM) classifier is employed for disease prediction, where the cuckoo search optimization is utilized for tuning the deep convLSTM classifier. The proposed method is compared with the conventional methods, and it achieved a training percentage of 97.591%, 95.874%, and 97.094%, respectively, for accuracy, sensitivity, and specificity. The comparative analysis proved that the proposed method obtained higher accuracy than other methods.
2021
Springer
A Systematic Assessment of Numerical Association Rule Mining Methods
In data mining, the classical association rule mining techniques deal with binary attributes; however, real-world data have a variety of attributes (numerical, categorical, Boolean). To deal with the variety of data attributes, the classical association rule mining technique was extended to numerical association rule mining. Initially, the concept of numerical association rule mining started with the discretization method, and later, many other methods, e.g., optimization, distribution are proposed in state-of-the-art. Different authors have presented various algorithms for each numerical association rule mining method; therefore, it is hard to select a suitable algorithm for a numerical association rule mining task. In this article, we present a systematic assessment of various numerical association rule mining methods and we provide a meta-study of thirty numerical association rule mining algorithms. We investigate how far the discretization techniques have been used in the numerical association rule mining methods.
@article{Minakshi2,title={A Systematic Assessment of Numerical Association Rule Mining Methods},volume={2},issn={2661-8907},url={https://doi.org/10.1007/s42979-021-00725-2},doi={10.1007/s42979-021-00725-2},pages={348},number={5},journaltitle={SN Computer Science},author={Kaushik, Minakshi and Sharma, Rahul and Peious, Sijo Arakkal and Shahin, Mahtab and Yahia, Sadok Ben and Draheim, Dirk},date={2021-06-22},bibtex_show={true},html={https://doi.org/10.1007/s42979-021-00725-2},abbr={Springer}}
BDA
Impact-Driven Discretization of Numerical Factors: Case of Two- and Three-Partitioning
Kaushik, Minakshi, Sharma, Rahul, Peious, Sijo Arakkal, and Draheim, Dirk
Many real-world data sets contain a mix of various types of data, i.e., binary, numerical, and categorical; however, many data mining and machine learning (ML) algorithms work merely with discrete values, e.g., association rule mining. Therefore, the discretization process plays an essential role in data mining and ML. In state-of-the-art data mining and ML, different discretization techniques are used to convert numerical attributes into discrete attributes. However, existing discretization techniques do not reflect best the impact of the independent numerical factor onto the dependent numerical target factor. This paper proposes and compares two novel measures for order-preserving partitioning of numerical factors that we call Least Squared Ordinate-Directed Impact Measure and Least Absolute-Difference Ordinate-Directed Impact Measure. The main aim of these measures is to optimally reflect the impact of a numerical factor onto another numerical target factor. We implement the proposed measures for two-partitions and three-partitions. We evaluate the performance of the proposed measures by comparison with human-perceived cut-points. We use twelve synthetic data sets and one real-world data set for the evaluation, i.e., school teacher salaries from New Jersey (NJ). As a result, we find that the proposed measures are useful in finding the best cut-points perceived by humans.
@inproceedings{Minakshi3,author={Kaushik, Minakshi and Sharma, Rahul and Peious, Sijo Arakkal and Draheim, Dirk},editor={Srirama, Satish Narayana and Lin, Jerry Chun-Wei and Bhatnagar, Raj and Agarwal, Sonali and Reddy, P. Krishna},title={Impact-Driven Discretization of Numerical Factors: Case of Two- and Three-Partitioning},booktitle={Big Data Analytics},year={2021},publisher={Springer International Publishing},address={Cham},pages={244--260},isbn={978-3-030-93620-4},bibtex_show={true},abbr={BDA},url={https://link.springer.com/chapter/10.1007/978-3-030-93620-4_18},html={https://link.springer.com/chapter/10.1007/978-3-030-93620-4_18}}
BDA
Big Data Analytics in Association Rule Mining: A Systematic Literature Review
Shahin, Mahtab, Arakkal Peious, Sijo, Sharma, Rahul, Kaushik, Minakshi, Ben Yahia, Sadok, Shah, Syed Attique, and Draheim, Dirk
In 3rd International Conference on Big Data Engineering and Technology (BDET) Jan 2021
@inproceedings{Mahtab1,author={Shahin, Mahtab and Arakkal Peious, Sijo and Sharma, Rahul and Kaushik, Minakshi and Ben Yahia, Sadok and Shah, Syed Attique and Draheim, Dirk},title={Big Data Analytics in Association Rule Mining: A Systematic Literature Review},year={2021},isbn={9781450389280},publisher={Association for Computing Machinery},address={New York, NY, USA},doi={https://doi.org/10.1145/3474944.3474951},abstdoiract={Due to the rapid impact of IT technology, data across the globe is growing exponentially as compared to the last decade. Therefore, the efficient analysis and application of big data require special technologies. The present study performs a systematic literature review to synthesize recent research on the applicability of big data analytics in association rule mining (ARM). Our research strategy identified 4797 scientific articles, 27 of which were identified as primary papers relevant to our research. We have extracted data from these papers to identify various technologies and algorithms of using big data in association rule mining and identified their limitations in regards to the big data categories (volume, velocity, variety, and veracity).},booktitle={3rd International Conference on Big Data Engineering and Technology (BDET)},pages={40–49},numpages={10},bibtex_show={true},abbr={BDA},url={https://dl.acm.org/doi/abs/10.1145/3474944.3474951},html={https://dl.acm.org/doi/abs/10.1145/3474944.3474951}}
ICECube
Cluster-Based Association Rule Mining for an Intersection Accident Dataset
Shahin, Mahtab, Saeidi, Soheila, Shah, Syed Attique, Kaushik, Minakshi, Sharma, Rahul, Peious, Sijo A., and Draheim, Dirk
In International Conference on Computing, Electronic and Electrical Engineering (ICE Cube) Jan 2021
@inproceedings{Mahtab2,author={Shahin, Mahtab and Saeidi, Soheila and Shah, Syed Attique and Kaushik, Minakshi and Sharma, Rahul and Peious, Sijo A. and Draheim, Dirk},booktitle={International Conference on Computing, Electronic and Electrical Engineering (ICE Cube)},title={Cluster-Based Association Rule Mining for an Intersection Accident Dataset},year={2021},pages={1-6},doi={10.1109/ICECube53880.2021.9628206},bibtex_show={true},abbr={ICECube},url={https://ieeexplore.ieee.org/abstract/document/9628206},html={https://ieeexplore.ieee.org/abstract/document/9628206}}
2020
DaWaK
Expected vs. Unexpected: Selecting Right Measures of Interestingness
Measuring interestingness in between data items is one of the key steps in association rule mining. To assess interestingness, after the introduction of the classical measures (support, confidence and lift), over 40 different measures have been published in the literature. Out of the large variety of proposed measures, it is very difficult to select the appropriate measures in a concrete decision support scenario. In this paper, based on the diversity of measures proposed to date, we conduct a preliminary study to identify the most typical and useful roles of the measures of interestingness. The research on selecting useful measures of interestingness according to their roles will not only help to decide on optimal measures of interestingness, but can also be a key factor in proposing new measures of interestingness in association rule mining.
@inproceedings{RahulP1,author={Sharma, Rahul and Kaushik, Minakshi and Peious, Sijo Arakkal and Yahia, Sadok Ben and Draheim, Dirk},editor={Song, Min and Song, Il-Yeol and Kotsis, Gabriele and Tjoa, A. Min and Khalil, Ismail},title={Expected vs. Unexpected: Selecting Right Measures of Interestingness},booktitle={Big Data Analytics and Knowledge Discovery},year={2020},publisher={Springer International Publishing},address={Cham},pages={38--47},isbn={978-3-030-59065-9},pdf={dawak2020.pdf},selected={true},bibtex_show={true},abbr={DaWaK},url={https://dl.acm.org/doi/abs/10.1145/3474944.3474951},html={https://dl.acm.org/doi/abs/10.1145/3474944.3474951}}
DaWaK
Grand Reports: A Tool for Generalizing Association Rule Mining to Numeric Target Values
Arakkal Peious, Sijo, Sharma, Rahul, Kaushik, Minakshi, Shah, Syed Attique, and Yahia, Sadok Ben
In Big Data Analytics and Knowledge Discovery Jan 2020
Since its introduction in the 1990s, association rule mining(ARM) has been proven as one of the essential concepts in data mining; both in practice as well as in research. Discretization is the only means to deal with numeric target column in today’s association rule mining tools. However, domain experts and decision-makers are used to argue in terms of mean values when it comes to numeric target values. In this paper, we provide a tool that reports mean values of a chosen numeric target column concerning all possible combinations of influencing factors – so-called grand reports. We give an in-depth explanation of the functionalities of the proposed tool. Furthermore, we compare the capabilities of the tool with one of the leading association rule mining tools, i.e., RapidMiner. Moreover, the study delves into the motivation of grand reports and offers some useful insight into their theoretical foundation.
@inproceedings{sijo,author={Arakkal Peious, Sijo and Sharma, Rahul and Kaushik, Minakshi and Shah, Syed Attique and Yahia, Sadok Ben},editor={Song, Min and Song, Il-Yeol and Kotsis, Gabriele and Tjoa, A. Min and Khalil, Ismail},title={Grand Reports: A Tool for Generalizing Association Rule Mining to Numeric Target Values},booktitle={Big Data Analytics and Knowledge Discovery},year={2020},publisher={Springer International Publishing},address={Cham},pages={28--37},isbn={978-3-030-59065-9},bibtex_show={true},abbr={DaWaK}}
FDSE
On the Potential of Numerical Association Rule Mining
Kaushik, Minakshi, Sharma, Rahul, Peious, Sijo Arakkal, Shahin, Mahtab, Ben Yahia, Sadok, and Draheim, Dirk
In Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications Jan 2020
In association rule mining, both the classical algorithms and today’s available tools either use binary data items or discretized data. However, in real-world scenarios, data are available in many different forms (numerical, text) and these types of data items are not supported in the classical association rule mining algorithms. There are some association rule mining algorithms that have been proposed for numerical data items but unfortunately, for working data scientists and decision makers, it is challenging to find concrete algorithms that fit their purposes best. Therefore, it is highly desired to have a study on the different existing numerical association rule mining algorithms (NARM). In this paper, we provide such a detailed study by thoroughly reviewing 24 NARM algorithms from different categories (optimization, discretization, distribution).
@inproceedings{Minakshi1,author={Kaushik, Minakshi and Sharma, Rahul and Peious, Sijo Arakkal and Shahin, Mahtab and Ben Yahia, Sadok and Draheim, Dirk},title={On the Potential of Numerical Association Rule Mining},booktitle={Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications},year={2020},publisher={Springer Singapore},address={Singapore},pages={3--20},isbn={978-981-33-4370-2},bibtex_show={true},abbr={FDSE},url={https://link.springer.com/chapter/10.1007/978-981-33-4370-2_1},html={https://link.springer.com/chapter/10.1007/978-981-33-4370-2_1}}