We need open and vendor-neutral metadata services
As I spoke with friends leading up to Strata + Hadoop World NYC 2015, one topic continued to come up: metadata. It’s a topic that data engineers and data management researchers have long thought about because it has significant effects on the systems they maintain and the services they offer. I’ve also been having more and more conversations about applications made possible by metadata collection and analysis.
At the recent Strata + Hadoop World, U.C. Berkeley professor and Trifacta co-founder Joe Hellerstein outlined the reasons why the broader data industry should rally to develop open and vendor-neutral metadata services. He made the case that improvements in metadata collection and sharing can lead to interesting applications and capabilities within the industry.
Below are some of the reasons why Hellerstein believes the data industry should start focusing more on metadata:
Improved data analysis: metadata-on-use
You will never know your data better than when you are wrangling and analyzing it. — Joe Hellerstein
A few years ago, I observed that context-switching — due to using multiple frameworks — created a lag in productivity. Today’s tools have improved to the point that someone using a single framework like Apache Spark can get many of their data tasks done without having to employ other programming environments. But outside of tracking in detail the actions and choices analysts make, as well as the rationales behind them, today’s tools still do a poor job of capturing how people interact and work with data.
Enhanced interoperability: metadata-on-use
If you’ve read the recent O’Reilly report Mapping Big Data or played with the accompanying demo, then you’ve seen the breadth of tools and platforms that data professionals have to contend with. Recreating a complex data pipeline means knowing the details (e.g., version, configuration parameters) of each component involved in a project. With a view to reproducibility, metadata in a persistent (stored) protocol that cuts across vendors and frameworks would come in handy.
Comprehensive interpretation of results
Behind every report and model (whether physical or quantitative) are assumptions, code, and parameters. The types of models used in a project determine what data will be gathered and, conversely, models depend heavily on the data that is used to build them. So, proper interpretation of results needs to be accompanied by metadata that focuses on factors that inform data collection and model building.
As I noted above, the settings (version, configuration parameters) of each tool involved in a project are essential to the reproducibility of complex data pipelines. This usually means only documenting projects that yield a desired outcome. Using scientific research as an example, Hellerstein noted that having a comprehensive picture is often just as important. This entails gathering metadata for settings and actions in projects that succeeded as well as projects that failed.
Data governance policies by the people, for the people
Governance usually refers to policies that govern important items including the access, availability, and security of data. Rather than adhering to policies that are dictated from above, metadata can be used to develop a governance policy that is based on consensus and collective intelligence. A “sandbox” where users can explore and annotate data could be used to develop a governance policy that is “fueled by observing, learning, and iterating.”
Time travel and simulations
Comprehensive metadata services lead to capabilities that many organizations aspire to have: the ability to quickly reproduce data pipelines opens the door to “what-if” scenarios. If the right metadata is collected and stored, then models and simulations can fill in any gaps where data was not captured, perform realistic recreations, and even conduct “alternate” histories (recreations that use different settings).