Validating Big data workflows

Today, data is not so much the property of the application, but rather some separate entity that interacts with the application.

For example, the task of the application is to obtain data streams from several different sources, structure the obtained data, check their relevance, save them, process, filter, apply some aggregating function for further analysis and show the result in the form of a generated report

Testing software that uses Big data techniques is significantly more complex than testing other more traditional data management applications.

In order to test Big data applications effectively, continuous validation throughout the transformation stages is advocated.

There are different types of tests that can be conducted to maintain the standard of data. Data quality includes various dimensions that should be measured such as data accuracy, correctness, redundancy, readability, accessibility, consistency, usefulness, and trust. Data accuracy is usually measured by comparing the data in multiple data sources, as this quality factor refers to how close the results are to the values that are accepted as being true. We mainly focus on this factor in the validation of data in our work.

The processing of Big data, and thus its validation, can be divided into three different stages:

  1. Data staging: Loading data from various external sources. Validation includes verifying that the needed data were extracted and retrieved correctly, then uploaded into the system without any corruption.
  2. Processing: In this step, it is required to validate the results of a parallelized job application and other similar Big data application processes, while ensuring the accuracy and correctness of the data.
  3. Output: Extracting the output results, and where validation includes checking whether the data have been loaded correctly into the target system for any further processing.

Challenges in Big Data Testing

Automation: Automation testing for Big data requires someone with technical expertise. Also, automated tools are not equipped to handle unexpected problems that arise during testing

Virtualization: It is one of the integral phases of testing. Virtual machine latency creates timing problems in real-time Big data testing. Also managing images in Big data is a hassle.

Large Dataset:

  • Need to verify more data and need to do it faster
  • Need to automate the testing effort
  • Need to be able to test across different platform

Performance testing challenges:

  • A diverse set of technologies: Each sub-component belongs to different technology and requires testing in isolation
  • Unavailability of specific tools: No single tool can perform end-to-end testing. For example, NoSQL might not fit for message queues
  • Test Scripting: A high degree of scripting is needed to design test scenarios and test cases
  • Test environment: It needs a special test environment due to the large data size
  • Monitoring Solution: Limited solutions exist that can monitor the entire environment
  • Diagnostic Solution: a Custom solution is required to develop to drill down the performance bottleneck areas

And the main problem in testing Big Data Applications may be the lack of necessary expertise in the team:

  • Expertise with Big data management life cycle & Big data governance
  • Experience with data masking/obfuscation
  • Experience with data sub-setting in complex integrated environments
  • Implementation of data generation tools
  • Experience delivering Big data as a shared service
  • Expertise with data profiling & setup of Big data utilities
  • Experience with the definition of Big data management practices

Tenedo consultants will support your project with the necessary experts, help with setting up the environment, technical issues, working out scenarios, introducing new technologies into testing, or will completely take on the task of testing the application.


Related services:

Test project assessment

Test Project Assessment provides an independent and objective assessment of testing processes, provides the tools…