Improving Pipelining Tools for Pre-processing Data.

María Novo Lourés; Yeray Lage; Reyes Pavón; Rosalía Laza; David Ruano Ordás; José Ramón Méndez

doi:10.9781/ijimai.2021.10.004

Authors

María Novo Lourés Universidade de Vigo
Yeray Lage Universidade de Vigo
Reyes Pavón Universidade de Vigo
Rosalía Laza Universidade de Vigo
David Ruano Ordás Universidade de Vigo
José Ramón Méndez Universidade de Vigo

DOI:

https://doi.org/10.9781/ijimai.2021.10.004

Keywords:

Burst Processing, Data Pre-processing, Java, Pipeline Frameworks

Supporting Agencies

D. Ruano-Ordás was supported by a post-doctoral fellowship from Xunta de Galicia (ED481D-2021/024). Additionally, this work was funded by the project Semantic Knowledge Integration for ContentBased Spam Filtering [grant number TIN2017-84658-C2-1-R] from the Spanish Ministry of Economy, Industry and Competitiveness (SMEIC), State Research Agency (SRA) and the European Regional Development Fund (ERDF); and Consellería de Educación, Universidades e Formación Profesional (Xunta de Galicia) under the scope of the strategic funding of Competitive Reference Group [grant number ED431C2018/55-GRC]. SING group thanks CITI (Centro de Investigación, Transferencia e Innovación) from University of Vigo for hosting its IT infrastructure.

Abstract

The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features.

Downloads

Download data is not yet available.

References

I. M. Dunham, “Big Data: A Revolution That Will Transform How We Live, Work, and Think”, The AAG Review of Books,vol. 3, no. 1,pp. 19–21,Jan. 2015.

Q. Qi, F. Tao, “Digital Twin and Big Data Towards Smart Manufacturing and Industry 4.0: 360 Degree Comparison”, IEEE Access, vol. 6, pp. 3585–3593, 2018.

V. Kalavri, V. Vlassov, “MapReduce: Limitations, Optimizations and Open Issues,” in 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 2013, pp. 1031–1038.

D. Miner, A. Shook, Mapreduce Design Patterns Building Effective Algorithms and Analytics for Hadoop and Other Systems. Oreilly & Associates Inc, 2012. [5] Apache Software Foundation, “Apache Hadoop.” 2018.

Amazon, “Amazon Elastic MapReduce.” 2019.

Disco Project, “DisCo MapReduce.” 2014.

S. Papadimitriou, J. Sun, “DisCo: Distributed Co-Clustering with MapReduce: A Case Study towards Petabyte-Scale End-to-End Mining,” in 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 512–521.

Apache Software Foundation, “Apache Spark - Unified Analytics Engine for Big Data.” 2018.

J. Zeng, B. Plale, “Data Pipeline in MapReduce,” in 2013 IEEE 9th International Conference on e-Science, 2013, pp. 164–171.

P. O’Donovan, K. Leahy, K. Bruton, D. T. J. O’Sullivan, “An Industrial Big Data Pipeline for Data-Driven Analytics Maintenance Applications in Large-Scale Smart Manufacturing Facilities”, Journal of Big Data, vol. 2, no. 1, p.p. 25, Dec. 2015.

P. Di Tommaso, “Awesome Pipeline: A Curated List of Awesome Pipeline Toolkits.” 2018.

Amazon, “AWS Data Pipeline.” 2019.

Snaplogic, “SnapLogic Intelligent Integration Platform,” 2019. [Online]. Available: https://www.snaplogic.com/products/intelligent-integrationplatform [Accessed: 21-Jun-2020].

Alooma, “Alooma Enterprise Data Pipeline.” 2019.

S. G. Ahmad, C. S. Liew, M. M. Rafique, E. U. Munir, “Optimization of Data-Intensive Workflows in Stream-Based Data Processing Models”,The Journal of Supercomputing, vol. 73, no. 9, pp. 3901–3923, Sep. 2017.

G. Kougka, A. Gounaris, A. Simitsis, “The Many Faces of Data-Centric Workflow Optimization: A Survey”, International Journal of Data Science and Analytics, vol. 6, no. 2, pp. 81–107, Sep. 2018.

J. Leipzig, “A Review of Bioinformatic Pipeline Frameworks”, Briefings in Bioinformatics, p.p. bbw020, Mar. 2016.

P. A. Ewels et al., “The Nf-Core Framework for Community-Curated Bioinformatics Pipelines”, Nature Biotechnology, vol. 38, no. 3, pp. 276–278, Mar. 2020.

M. Bourgey et al., “GenPipes: An Open-Source Framework for Distributed and Scalable Genomic Analyses”, GigaScience, vol. 8, no. 6, Jun. 2019.

D. Swersky, “Top 43 Programming Languages: When and How to Use Them,” 2018. [Online]. Available: https://raygun.com/blog/programminglanguages/ [Accessed: 21-Jun-2020].

E. Frank, M. A. Hall, I. H. Witte, The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques,” Fourth Edi. Morgan Kaufmann Publishers Inc., 2016.

A. Moro, R. Navigli, “Babelfy.” 2014.

Y. Lage, J. R. Méndez, M. Novo-Lourés, “Big Data Pre-Processing For Java (BDP4J).” 2018.

F. Lordan et al., “ServiceSs: An Interoperable Programming Framework for the Cloud”, Journal of Grid Computing, vol. 12, no. 1, pp. 67–91, Mar. 2014.

R. M. Badia et al., “COMP Superscalar: An Interoperable Programming Framework”, SoftwareX, vol. 3–4, pp. 32–36, Dec. 2015.

T. Burdett, N. Kurbatova, D. Hastings, Emma Faulconbridge, Adam Mapleson, R. Davey, “Conan2 Lightweight Workflow Manager.” 2019.

J. Bingham, S. Davis, N. Deflaux, “Dockerflow: A Workflow Runner That Uses Dataflow to Run a Series of Tasks in Docker with the Pipelines API,” 2017. [Online]. Available: https://github.com/googlegenomics/dockerflow

Google Inc, “Cloud Dataflow Documentation,” 2019. [Online]. Available: https://cloud.google.com/dataflow/docs/?hl=es-419 [Accessed: 21-Jun2020]

Netflix, “Suro: Netflix Distributed Data Pipeline.” 2012.

J. M. Wozniak, M. Wilde, I. T. Foster, “Language Features for Scalable Distributed-Memory Dataflow Computing,” in Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing, 2014, pp. 50–53.

J. M. Wozniak, M. Wilde, I. T. Foster, “Swift Tutorial for Running on Localhost,” 2014. [Online]. Available: http://swift-lang.org/tutorials/localhost/tutorial.html [Accessed: 21-Jun-2019].

M. Hategan et al., “Swift-Lang, Swift-K,” 2019. [Online]. Available: https://github.com/swift-lang/swift-k [Accessed: 21-Jun-2019].

H. López-Fernández, O. Graña-Castro, A. Nogueira-Rodríguez, M. Reboiro-Jato, D. Glez-Peña, “Compi: A Framework for Portable and Reproducible Pipelines”, PeerJ Computer Science, vol. 7, p. e593, Jun. 2021.

Broad Institute, “Cromwell: Workflow Management System Geared towards Scientific Workflows.” 2019.

A. Malloy et al., “Drake.” 2015.

S. Fong, Y. Zhuang, J. Li, R. Khoury, “Sentiment Analysis of Online News Using MALLET,” in 2013 International Symposium on Computational and Business Intelligence, 2013, pp. 301–304.

A. K. McCallum, “MALLET: A Machine Learning for Language Toolkit.” 2002.

Apache Software Foundation, “Apache Spark: ML Pipelines,” 2018. [Online]. Available: https://spark.apache.org/docs/latest/ml-pipeline.html [Accessed: 21-Jun-2020].

A. Liu, Apache Spark Machine Learning Blueprints, First. Birmingham, UK: PACKT Publishing Ltd., 2016.

D. S. F. Long, D. Mohindra, R.C. Seacord, D.F. Sutherland, “Svoboda, Java Coding Guidelines: 75 Recommendations for Reliable and Secure Programs”, Addison-Wesley, 2013.

Google LLC, “AutoService: A Collection of Source Code Generatos for Java.” 2013.

L. Breiman, “Random Forests”,Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

M. R. G. Alder, D. Benson, “Jgraph/Jgraphx.” 2014.

E. P. S. J.M. Gómez Hidalgo, “SMS Spam Corpus v.0.1,” 2011.

A. Pérez, P. Larrañaga, I. Inza, “Bayesian Classifiers Based on Kernel Density Estimation: Flexible Classifiers”, International Journal of Approximate Reasoning, vol. 50, no. 2, pp. 341–362, Feb. 2009.

M. Novo-Lourés, Y. Lage, R. Pavón, R. Laza, D. Ruano-Ordás, J. R. Mendez, “Benchmarking Code for Pipeline-Based Frameworks.” 2021.