Open data for the development of machine learning applications in industry

The hidden goldmine?

Attachments (Github):

  1. Table review of datasearch engine usability for searching open data
  2. Datacollection for the development of machine learning applications in industry

We are facing the breakthrough of the data – it is so significant that I believe future historians will name the 21st century as the “era of data (and the world’s sixth extinction)”. My recent work on open data made me wonder if we are riding on the crest of a wave or standing between a chasm and a mountain.

The IDC 2020 study shows the biggest challenges for companies on the path toward incorporation of artificial intelligence: Data and information management; regulatory change; cost and budget concerns; scarcity of talent in data science, engineering, and solution development; and challenges in security and privacy. Numerous companies are still operating through separated processes, technologies, teams, and projects. This makes difficult to solve challenges and understand the value of investing in data, information management, and artificial intelligence. In AI projects, goals are often not set, or the projects do not scale. (Hamel 2021)

Why do companies face these challenges? In a white paper published by IDC (2017), David Reinsel, John Gantz, and John Rydning predict globally produced and consumed data (Figure 1). Prediction has even grown due the COVID-19 pandemic, and it seems obvious that the amount of data will increase by a factor of 10 between 2014 and 2025. The amount of data will likely reach 180 zettabytes. (Holst 2021; Reinsel et al. 2017)

Figure 1. Globally produced and consumed data. Reinsel et al. 2017

However, only a small piece of this data is retained. According to IDC, the global storage capacity in 2021 will be about 6.8 zettabytes. (IDC 2020; Bhat 2018; Vijesh et al. 2021)

Companies are facing a new challenge that could be described with another well-known example: Industrial mass production. Climate related discussion and running out of storage space has caused a change in lifestyle and consumption patterns. We began to question what is truly important. Nevertheless, we are struggling with a waste and unused goods. Data seems to be acting similarly.

We need to understand what data is important. The company that is capable to produce and utilize valuable data will stand out. The barriers revealed in the IDC study approve that only a few have resources for this (Hamel 2021).

While discussing about valuable data, it must be clarified that valuable data does not bring value to the company. Data-analysis is needed to provide valuable information from data. The process is challenging, as the data is extremely variable in form and properties. In addition, you must solve how to process data, where to use information and how. Predictions from the data are increasingly executed by machine learning methods, however data collection has become one of the bottlenecks in the development of machine learning. Collecting, selecting, and organizing data is the most time-consuming part of the project. As a result, the resources to generate valuable information from the data are decreased. (Roh et al. 2021; Ismail et al. 2019; Vijesh 2021; Ogbuke et al. 2020; Azimi et al. 2020).

In addition, transferring the developed methods from the academic environment to industry is challenging. The conditions of industrial production are changing, and each factory is unique. Long-term production cycles, individual processes, data accuracy, format and speed require solutions that fits in environment and can be retrained. As a conclude, the locally produced solutions are challenging to reuse. (Azimi et al. 2020 p.582; Zeiser et al. 2021 p. 599)

Companies are trying to solve their data challenges by hiring experts as if they were hiring home cleaners for their mansion. What does open data have to offer in this world’s biggest race?

Let’s proceed with the idea of industrial mass production and waste. Companies’ virtual storages and additional storages are bursting with unused goods. One company hires one virtual cleaner, second can hire hundreds of virtual cleaners, third could hire entire virtual cleaning company, and few hire an army of virtual cleaning companies.

Does the hiring company need to understand how cleaning is done? What are these goods in the storage and how are they utilized? Do you think that the companies with only a few cleaners will disappear under the trash?

According to Jed Sundwall (2018), “People don’t want data, they just want answers.” This is probably the truth – but how can one cleaner be as fast as an entire cleaning company? How this affects to the quality?

There is no shortcut to happiness. Quality is poor and there is no way to be equally fast. For this reason, you must lure the neighborhood to help you: People who work with open or semi-open data.

I have been examined open data that is suitable for developing machine learning applications in industry. Examination revealed that companies protect their data strictly, even though they experience a lack of resources. However, these resources are critical to be able to understand and utilize the data. At the same time, there is a lack of high-quality educational data in the development of machine learning (Roh et al. 2021). For example, if “perfect” and error free data is used while training machine learning model, the model transfer into a real and unstable industrial surrounding might become infeasible.

Companies are not willing to share data but are they interested in cooperative utilization? What sort of data is more beneficial to competitors than to a company whose data is publicly researched, organized, and processed into information? Do applications need to be public? Does the company see who has applied the data and how? How likely similar data will be available sooner or later? Companies are only interested in results. How do we ensure that publishing data is more attractive than improper use or destruction?

There is no open channel to meet the needs of industrial companies and researchers.

While data producers have a responsibility for data quality and data access rights, data dealers are important in terms of data usability. It appears that even reputable sources have significant shortcomings to enable efficient data reuse. Therefore, even those who want to provide their data for further use and do the groundwork with high quality, face difficulties in transmitting the data. There are services and several ongoing projects (for example GAIA-X and IDS) which are enabling safe data transfer. However, these services focus on fluent data transfer instead of motivating partners, researchers, and individual persons to provide, search and utilize data between stakeholders. Hackathons, data search engines, data collections and data banks try to serve solutions to these fields. Still, there seems to be lots of inconsistency between stakeholders demands and services. If stakeholders are not fully served it leads to incomplete utilization of data or even losing potential data. Data marketing is essential if we like to maximize the potential of the data or use open data systematically in research. It’s unfortunate if the only mistake in near-perfect data production is that the results gets trashed.

Growing amount of data, fundamental challenges and inconsistencies in data discoverability indicate that majority of the potential of open data is unused. I have been searching for data with three main methods: data search engines, data collections and data banks. The sources I have used can be found here. You can also examine the reviewed features these sources provide for sorting and evaluating data. Usable data sources that I have collected can be found here. There is wide range of data, as the needs of the industry vary extremely. The most interesting data sets are widely reusable (for example error detection, human movements). The goal was to find comprehensive data; thus, the potential of the data has been mostly assessed based on context and descriptions. There remains the most important question:

Is the open data high quality?

Valuable data is context-specific: Data quality refers to how well the data meets user requirements. However, this does not mean that there is only one or a few value-generating solutions available for the data. Pruning the data properties at too early a stage has a detrimental effect on the data reuse potential. (Sundwall 2018, Zeiser et al. 2021)

According to Zeiser et al. (2021), data collected from processes is useful only when it contains both metadata (such as process description and input parameters) and process expertise. Table 1 shows the quality requirements for industrial machine learning models. It seems that requirements of industrial machine learning models are based in many respects on the properties of the available data. (Azimi et al. 2020 p.580; Zeiser et al. 2021 p. 599)

Table 1. Requirements for industrial machine learning models (Zeiser et al. 2021 p. 599)

The lack of understanding on the quality requirements for the industrial machine learning models is reflecting in the varying quality of open data. However, the lack of quality is also affecting to commercial data. For this reason, collecting, selecting, and organizing data takes a plenty of time no matter the source. That doesn’t sound like an inspiring result, does it?

Do not forget the neighborhood. Always remember to be thankful for someone cleaning up the storage for you.


Azimi, S. & Pahl, C. (2020). A Layered Quality Framework for Machine Learning-driven Data and Information Models. Proceedings of the 22nd International Conference on Enterprise Information Systems (ICEIS 2020). Vol 1, pp. 579-587. ISSN: 2184-4992. Available: DOI: 10.5220/0009472305790587

Bhat, W.A. (2018). Bridging data-capacity gap in big data storage. Future Generation Computer Systems. Vol 87. pp. 538-548. ISSN 0167-739X. Available:

Hamel, J. (2021). AI Services Providers Bring the Future of Intelligence Into Focus. IDC. [Accessed 1.7.2021] Available:

Holst, A. (2021). Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025. Statista. [Accessed 2.9.2021] Available:

IDC (2020). IDC’s Global StorageSphere Forecast Shows Continued Strong Growth in the World’s Installed Base of Storage Capacity. [Accessed 4.9.2021] Available: IDC’s Global StorageSphere Forecast Shows Continued Strong Growth in the World’s Installed Base of Storage Capacity

Ismail, A., Truong, HL. & Kastner, W. (2019) Manufacturing process data analysis pipelines: a requirements analysis and survey. J Big Data. Vol 6. Available:

Ogbuke, N., Yusuf, Y.Y., Dharma, K. & Mercangoz, B.A. (2020) Big data supply chain analytics: ethical, privacy and security challenges posed to business, industries and society. Production Planning & Control. Available:

Reinsel, D., Gantz, J. & Rydning, J. (2017). Data Age 2025: The Evolution of Data to Life-Critical. Don’t Focus on Big Data; Focus on the Data That’s Big. IDC White Paper. Available:

Roh, Y., Heo, G. & Whang, S. E. (2021). A Survey on Data Collection for Machine Learning: A Big Data – AI Integration Perspective. IEEE Transactions on Knowledge and Data Engineering. vol. 33, no. 4, pp. 1328-1347. Available: DOI: 10.1109/TKDE.2019.2946162

Sundwall, J. (2018). Webinar: Working with Open Data on AWS. AWS cloud. [Accessed 2.9.2021] Available:

Vijesh J.C., Raj J.S. & Smys S. (2021) Big Data Analytics: Tools, Challenges, and Scope in Data-Driven Computing. In: Raj J.S. (eds) International Conference on Mobile Computing and Sustainable Informatics. ICMCSI 2020. EAI/Springer Innovations in Communication and Computing. Springer, Cham.

Zeiser, A., van Stein, B. & Bäck, T. (2021). Requirements towards optimizing analytics in industrial processes. Procedia Computer Science. Vol 184, pp. 597-605. ISSN: 1877-0509. Available: