A common misconception of data products are that they consist only of polished and prepared datasets that are ready to be used for analysis. But in reality, it’s more than just the data itself; it’s also all of the technologies involved in preparing that data.
At Polyseam, we understand the process of preparing the data with all your technologies can be complex. To simplify the understanding of data products for you and your team, we have come up with the following definition:
“A data product is a valuable data asset that your company uses to make decisions and all the technologies and metadata associated with that asset”.
Let’s look at an example related to data for sales in a business:
So continuing this example, imagine working at an enterprise where you sell things in a bricks-and-mortar store and also online. You have two different sources generating sales data in this scenario. The above diagram shows how that sales data asset was created from beginning to end. Every step of this process is part of the data product.
In order to extract and prep that data, we need an orchestrating tool which can run ETL scripts which also needs to be hosted on some type of infrastructure. In the example tech stack above, the orchestrating tool is Apache Airflow, the scripts are written in Python, and the cloud infrastructure is AWS. The final sales data set is then stored in the Amazon S3 bucket and the metadata is registered into a search and discovery tool so that data consumers can easily find what they need.
Now you might be asking yourself what is a repeatable, cheap, and easy way to create that data stack above with your own technologies for your data product.This is where Polyseam’s open-source tool, CNDI Run, comes in. It's a completely free data product stack deployment tool. CNDI Run deploys your data product stack by having the user pass in their login information to their cloud provider platform and information about each technology in your stack that supports your data product.
Check out the README.md file in the GitHub repo for a full walkthrough on how to deploy your first data product!