Data Mesh Tools: A Comprehensive Guide

Words By Ayyuce Kizrak

April 3, 2024

When I first heard about the data mesh, I was sceptical. As I dug deeper into the topic, I realized it was a breath of fresh air where data architecture finally reunited with its long-lost relative: business domain ownership. — Stan Christiaens, co-founder of Collibra

Data Mesh Tools: A Comprehensive Guide by Ayyuce Kizrak

This article will provide a comprehensive guide to data mesh tools across different stages of your data pipelines. In my previous article, I explained the Data Fabric and Data Mesh approaches, architecture, management and boundaries, which are the new data governance trends, partly metaphorically and partly with examples. I mentioned what difficulties these two new approaches, which complement each other, emerged to overcome in traditional methods.

In the intricate world of data governance, the Data Mesh concept emerges like an artist turning a tangled yarn into a harmonious mosaic, making each data strand shine in a complex yet seamless tapestry that propels our business forward. It transforms the daunting task of untangling our data ecosystem into crafting a masterpiece where every piece is vital, no matter how small. This approach not only simplifies management but also enriches our appreciation for the individual qualities of each data element, aligning them with our overarching business objectives.

As we delve into this reimagined data landscape, it becomes clear that equipping each domain team with the tools to manage and operate their data assets independently is paramount. This article aims to shed light on the tools that empower such autonomy and innovation. Yet, it’s crucial to understand that the selection of these tools will inevitably vary, as each team’s choice hinges on a multitude of factors, including the specific use case, existing technological infrastructure, and budgetary constraints.

1. Data Mesh Tools: Data Catalogue

A data catalogue is one of the most critical components of data network architecture. The data catalogue is essential as it is a central inventory of all data assets available within the organization. The data catalogue is necessary for data consumers/users because teams in different businesses and domains in the data network discover data assets and gain information about their scope through the data catalogue. In short, the data catalogue is the first component that teams from different businesses and domains will visit to use each other’s data products.

Four basic data features should be looked for in a suitable data catalogue tool: location, format, quality and metadata. This information allows the data to be discovered. Such a tool would enable data producers to document, understand and explain the context of their data assets. Having a proper data catalogue also provides the opportunity to manage data effectively.

When we do some market research, we come across many products that can serve our purpose of creating a data catalogue. I can list the three that stand out in my opinion as follows:

Collibra: Although it is a data governance platform, its data catalogue features stand out. It allows users to manage metadata, including domain terms, definitions, and classifications. It will enable visualizing the data lineage and tracking its movement and transformation within the organization. It is one of the products I recommend that facilitates collaboration between data stakeholders to promote data culture.

Strength — metadata management, data lineage, and collaboration features.

Informatica: It is a product leader in the Gartner Magic Quadrant almost yearly. Although it contains many features of Collibra, it also has sub-products. PowerCenter, which focuses on ensuring data quality, is an example. It also allows using keywords in data profiles, making data discovery easier.

Strength — unified metadata management, easy integration and data quality.

Alation: Combined with standard data catalogue features, it helps users understand the impact and usage patterns of data assets across the organization. Keyword usage and smart search features are also available in this product.

Strength — smart search-driven data discovery, collaboration, and user-friendly interface.

These products are just a few options. We can literally find dozens of different options for data cataloguing. However, if it were me and I had no issues with budget or other constraints in terms of the existing data ecosystem, I’d choose one of these three.

2. Data Mesh Tools: Data Storage

The second critical component of the data network is data storage. This can usually be a data lake, data warehouse or data space. In a data network architecture, we typically distribute data storage across the organization, and each team is responsible for its own storage needs. This has some positive aspects, but it can also lead to additional complexity since each team is free to choose its own storage, and as a result, we may end up with many different types of storage used in different areas. This allows domain teams to choose the storage solution that best suits their needs. It doesn’t force them to use the enterprise solution, which was the primary feature of the data network anyway if you remember. This is an advantage because it allows the team to choose the best storage.

For example, a team with unstructured data might choose one type of database, while a team that only works with structured data might choose a completely different relational database. The added benefit here is that teams can scale storage solutions independently, so we don’t have to deal with massive programs that take months and require a lot of resources to scale at the enterprise level. If it’s a storage solution that only caters to the needs of a single team, it can make it much more costly.

There are many more product options for data storage than data cataloguing. Our choice will depend on our usage scenario, company size, budget, etc.

Amazon S3: Provides object storage service. It has high scalability. It integrates with AWS Identity and Access Management (IAM) for access control. Allows versioning and tagging for metadata management. Provides server-side encryption for data security. It is suitable for storing large volumes of unstructured data and data lakes and as a storage solution in various AWS-based data architectures.
BigQuery: It is a serverless data warehouse. It scales automatically to handle large datasets and concurrent queries. Provides granular access control via IAM. Supports auditing and monitoring of query and table access. Easily integrates with Google Cloud’s Data Catalog for metadata management. It is suitable for organizations that need a serverless data warehouse with powerful analytics capabilities.
Snowflake: It is a cloud-based data warehouse. It offers flexible scalability to handle varying workloads. It implements Role-Based Access Control (RBAC) for security. Enables metadata management and data lineage tracking. It also supports native integration with various identity management solutions. Ideal for organizations looking for a cloud-based managed data warehouse that focuses on simplicity and scalability.
Apache Hadoop: Provides a distributed file system (Hadoop Distributed File System-HDFS) for structured and unstructured data. It is scalable across the commodity hardware cluster. Hadoop is more of a distributed processing framework, and additional tools often provide data management features (e.g., Apache Atlas). Supports authentication and authorization mechanisms. Includes encryption and auditing options. It is well-suited for organizations with large-scale data processing needs and allows storing and processing various data types.

Complexity: Amazon S3, BigQuery, and Snowflake are cloud-native solutions, which generally means easier management and scalability.

Ease of Use: BigQuery and Snowflake are known for their ease of use and require less infrastructure management than Hadoop.

Cost Model: Each platform has a different cost model, and organizations must consider their specific usage models to evaluate cost-effectiveness.

I think storage needs to be chosen locally at the domain level because of the characteristics of the data network, and then we need to have technical leadership within the organization and have a clear strategy for how we’re going to combine all these different storage. Of course, adopting the same data storage solution for all domains is even better. However, we should allow the tools different domains can use in data storage because this goes against the data network methodology.

3. Data Mesh Tools: Data Pipeline

Another thing we need to think about when it comes to tools is how we navigate data flow paths. ETL refers to extracting, transforming, and loading data from source systems into the data lake or data warehouse. This allows domain teams to move data from source systems or locations to the central repository. But this doesn’t happen magically! For this, we need a data pipeline.

A well-planned data pipeline should allow us to scale horizontally to meet increasing data volumes and processing requirements. It will enable us to be more flexible and easily change our data processing pipeline to support new data sources, data formats, or other requirements. A proper pipeline ensures data quality, prevents data loss during information transfer, provides error handling capabilities, optimizes data processing, minimizes situations such as delays, and ensures timely availability of data.

Amazon S3: Often used for data lakes, backup and restore application hosting, and static website hosting. It is well-suited for storing various data types, including images, videos, and log files. It is an integrated solution from Amazon, a fully managed service that can be used to create a data pipeline. It supports streaming data processing and batch processing and provides data transformation and cleansing tools.
Apache NiFi: Data integration tool for orchestrating data flows. It enables the creation of scalable data transaction pipelines. Ideal for creating pipelines and orchestrating ETL processes. It supports real-time data movement and integration between various systems. It can be used with storage solutions like Amazon S3 to manage data streams.

Integration and Scalability: As a cloud-based object storage service focused on scalability and security, Amazon S3 provides an ideal platform for storing large datasets. On the other hand, Apache NiFi is an open-source data integration tool and offers great convenience in managing data pipelines and simplifying integration processes. Combining the data management and stream orchestration capabilities of Apache NiFi with the secure and scalable storage provided by Amazon S3, these two services can create a seamless synergy for data storage and integration needs.

The most crucial factor to consider when choosing is based on the unique requirements and architecture of our data storage and management needs. Implementing data pipelines is essential because data pipelines are managed by various domain teams rather than a central structure. This implies that data pipelines and each component must be customized according to the needs of the domains. As a result, this approach supports data autonomy and eases the burden on centralized data management teams.

When choosing between them, You must adhere to the specific needs and architecture of your data storage and management requirements. It’s best to leave data pipelines to domain teams’ choices, encouraging data autonomy and reducing the burden on centralized data teams.

4. Data Mesh Tools: Data Quality Management

Data quality management is critical when it comes to data networking. Actually, it always is! But when we think about it in the context of a data network, many different data products will be created and consumed by many different teams across the organization. However, each area no longer has its own data quality management team. Therefore, ensuring the consistency and reliability of different data products will only be possible using a suitable data quality management tool.

This situation can lead to a variety of problems, such as inefficiency, confusion, and even legal issues. If we have experience using data quality management tools for data mesh, we know that the key components are data profiling, data cleansing, data validation, and data monitoring. If we come from a larger organization, we already have a data quality management tool.

The first step here is to understand whether a tool will meet the demands of a data networking application if it is already in use. If it does not, it will be necessary to investigate another tool that can better meet these needs. Informatica is my first tool of choice regarding data quality management tools.

Informatica: Stands out with its extensive data integration capabilities and scalability. For example, Informatica PowerCenter offers a powerful tool for enforcing data quality rules and cleaning data.

Weakness — It can be costly, and there may be a learning curve due to its complexity.

Talend: It offers a cost-effective and customizable solution with its open-source model and flexible structure. For example, Talend Data Quality can be used to profile and clean data quality.

Weakness — Complexity may be challenging for some users, and large-scale applications may require additional infrastructure.

Collibra: Known as a robust data governance platform. It can be used effectively in data quality management to define data policies and determine workflows. For example, Collibra’s metadata management features provide essential information in terms of data quality.

Weakness — It is limited in data integration and can be costly.

Atacama: Contributes to data quality management with advanced data profiling, automation and MDM (Master Data Management) capabilities. For example, Ataccama ONE can be used effectively to automate data quality rules and cleanse data.

Weakness — There may be a learning curve for new users, and integration complexity may be apparent in some scenarios.

Finally, it would be wiser to choose a standard tool for data quality management that will be used across all domains of the organization, even if it is for the data network, unlike all other elements. Therefore, choosing something that will work for all different areas is important.

5. Data Mesh Tools: Data Governance

Data governance is becoming more prominent, especially in the data network. Data governance ensures that data is managed according to regulatory requirements and previously agreed corporate policies. If you have the correct data governance tool for data mesh, it will be much easier for domain teams to implement policies and standards for data governance.

I suggest choosing the tool you chose for data quality management in the previous topic. You don’t want to have a separate tool for data governance and quality because, in most cases, these tools provide the necessary functionality for data governance and data quality. So, look for a tool that suits your data governance and data quality needs. Again, here are my personal favourites. Collibra and Informatica.

Rather than rushing to choose the right vehicle, understand your needs and evaluate 2 to 5 tools. If you are already using an effective tool, go for it, but if it is not suitable for your business, you can analyze different tools and choose the one with the best return on investment.

6. Data Mesh Tools: APIs and Service Networks

Next, you need to consider how data communication between different domain teams will occur. You will need APIs and service mesh according to data network best practices for this. Now, let’s discuss what an API is and what a service mesh is.

API is the abbreviation for Application Programming Interfaces. This is a software interface that allows different applications and systems to share data with each other. Regarding data networking, APIs allow our domain teams to present their data to other domains. In this way, teams can interact more efficiently with each other’s data.

Many people have heard of APIs, but very few have heard of service mesh. A service mesh is an infrastructure layer that manages service-to-service communications within a distributed system. To enable organizations to more effectively manage microservices based on architecture, e.g. servicing, discovery, load balancing traffic, routing, security, etc. It provides all the features and functions you need, including:

APIs and service mesh are key components of the data network. Because they facilitate data exchange and communication between different areas. Therefore, we enable teams to securely and efficiently access the proper data at the right time by having APIs and a service network.

Kubernetes: A powerful tool for managing container orchestration. It provides an effective scalability and service delivery solution in data network applications.

Weakness — Kubernetes is not a direct API management tool and may need additional tools to manage complex structures.

Istio: Provides advanced features for service mesh. It is a comprehensive solution in service network management, including traffic management, security, and monitoring.

Weakness — Installation and configuration complexity may be challenging for some users. It can also cause performance issues in high-traffic applications.

Kong: Powerful in API management and scales quickly. It is an effective solution for traffic management and security issues when used as an API gateway.

Weakness — It has deficiencies in service mesh features. In particular, it does not offer as wide a range of features as a comprehensive service mesh solution such as Istio.

Choosing the right technology here will reduce the complexity of managing data across multiple domains. It will also make it much easier for domains to maintain and scale their data infrastructure. The ones I have listed for your data network application are just some of the most well-known, and of course, these are only some of the options.

7. Data Visualization and Reporting

Another critical part of the data network that we must consider is data visualization and reporting. We need to think about the tools we use to present domain data and make it into a product that is easy to consume/use by every domain in the organization. Using APIs and service mesh to exchange data between domains is excellent, but remember that most data consumers/users are not technical people! Moreover, not only this, some of the consumers/users may also be leaders in other fields. So, we need to think of ways to visualize and structure our data to make life easier for everyone. That’s why we need to choose a data visualization and reporting product.

Some of the best-known options are Tableau, Power BI and Quick View. You also have solutions like Looker and even Excel. What we have to offer will depend on what kind of knowledge we have internally. For example, we have too many experts at Tableau and need to revoke licenses for the next 18 months. In this case, there would be no point in switching to Power BI or anything else. However, if we are starting from scratch and need to gain specific knowledge of the tool or data visualization tool we are currently paying for, then it would be a good idea to look at different options and see what they are to meet our needs best.

Tableau: Stands out with its user-friendly interface and powerful visualization capabilities. It is an effective tool for understanding and visualizing complex data relationships within a data network. It also provides easy integration into significant data sources.

Weakness — Tableau may experience some performance challenges when processing large data sets. It also lacks advanced analytics features compared to Power BI and Qlik View.

Power BI: Offers strong integration with the Microsoft ecosystem and makes it easier for users to access and analyze data. Helpful in combining and reporting data sets within Data Mesh. It also allows users to create fast and practical reports.

Weakness — Some of Power BI’s advanced analytics capabilities may be limited. It may perform less than competitors like Tableau when processing large-scale datasets.

QlikView: Stands out as a dedicated data exploration and visualization platform. It is powerful in building relational data models and making dynamic visualizations. It is an effective tool for understanding complex relationships within a Data Mesh.

Weakness — QlikView is more complex than Tableau and Power BI in terms of user-friendly interface and reporting features. Additionally, licensing costs can be high for certain user situations.

So, in summary, data visualization is important because it allows different domains to communicate their data more effectively and allows different business units to make better data-driven business decisions.

8. Collaboration and Information Sharing

Collaboration and knowledge sharing are critical, especially for autonomous and scalable multidisciplinary organizations. This maximizes customer and revenue potential and highlights the need for effective communication between completely independent teams. Effective collaboration and smooth information flow can be supported using chat platforms like Slack and Microsoft Teams and wiki tools such as Confluence and Notion. Factors such as the choice of these tools, the way documents are shared, and the organization of meetings directly affect the effectiveness of intra-organizational collaboration.

In the process of transitioning to data network implementation, the integration and use of these tools become one of the key elements that support the overall success of organizations. Therefore, strategically deciding which technologies and methods to adopt is vital to ensuring the organization’s long-term success in managing and sharing knowledge.

Choosing the Proper Tools — Summary

Choosing the proper data mesh tools for our data network architecture is one of our most complex tasks. There are so many options on the market that we can face a wide information load from data storage and management to various needs. I recommend developing a six-step strategy to manage this complexity and choose the right tools. These steps will help us understand the process, evaluate the options, and ultimately determine the best tools.

Defining Requirements: By involving stakeholders from all levels of our organization, we must ensure the participation of stakeholders such as data analysts, data scientists, data engineers, business analysts, IT managers, and senior executives. This will vary depending on the organizational structure. Stakeholders will help us determine future scaling expectations by providing insight into the types of data they are dealing with and their needs.
Exploring Options: After understanding the stakeholders’ requirements, we must explore the available options regarding the tools. In this process, we should avoid using the teams’ data, as these teams are the experts who will manage the tools daily and may have preferences.
Evaluation: After listing the tools that can meet the needs of various categories, we should start evaluating these tools. The people who will do the reviews are the ones who will use the tools daily, and they tend to choose the best tool. This should also be considered from a budget perspective.
Testing: It is crucial to test the tools being evaluated. Every day, users should run tests with the company’s actual data. Testing the tools with accurate data is possible by using free trial versions.
Other Considerations: It is important to consider what kind of support and training the tools need. It is necessary to effectively train users, mainly if a new tool is used. The vendor’s contract, pricing, and support flexibility should also be considered.
Decision: Once the fifth step is completed, the tools for the data network implementation must be decided. This process requires selecting appropriate tools within the company’s budget.

As a result, our data network journey is like a large orchestra playing in harmony, with each tool participating with its unique voice. In this process, choosing and integrating the right tools is like ensuring that different musical instruments create perfect harmony with each other. Our teams, each specialized in their own fields, and the tools we choose create a fascinating symphony while maximizing the success and sustainability of our data network.

This orchestration increases efficiency while minimizing complexity in data management and enabling our organization to reach new heights in data-driven decision-making.

Follow me on LinkedIn, GitHub and Twitter accounts for more content!

Check out some other blog posts published with Comet: