A Vision for Open Data
The internet is the largest repository of freely available data on earth. Yet the data with highest value is not freely available online. Instead, it’s collected using covert means and stockpiled by large organizations. They use it to power their products, train AI models, and re-sell it to governments and other businesses.
Oz is creating a neutral protocol for the free sharing and use of specialized data. Our protocol will provide a framework for the creation of specialized data collectors and validators. Through these data collectors, anyone can sell verticalized data in an open economy. AI developers can use open data to fine-tune models to specialize in real use cases in any industry. Application developers can build enterprise analytics, reporting, and dashboarding to compete against incumbent data-centric businesses.
The AI revolution is well underway, and generalized foundation models are ubiquitous. What is missing is a provider of vertical-specific, specialized data to enable open development over traditionally siloed data.
The Era of Big Data
As Peter Thiel describes, the majority of innovation over the last 50 years has been in the world of bits, not in the world of atoms. The earliest digital innovation was focused on developing new software to replace or augment human work.
But shortly after, it became clear that the data generated by the day-to-day use of software was far more valuable than the software itself. The data told the true story of how we think and act in the digital world.
Once the value of these strings of bits became clear, an incentive was created to collect and hoard as much of it as possible. This was the start of the Big Data Era, which has largely played its course since ~2005. The winners from this era are the businesses and governments that were able to build scalable data collection mechanisms, and effectively leverage the power of data within their products.
The most notable example is Google, a company whose products learn what we like, want, and do from our online activity, then use it to serve us pinpoint-accurate advertisements directly to our phones.
Every high-tech industry and niche has its winners and losers. Organizations with better data can capture more value and outcompete the rest.
FICO - Personal financial data
Mandiant - Cybersecurity threat intelligence data
Citadel - Trading data
Cerner - Healthcare data
To these organizations, data is a competitive moat. For an upstart to displace any of these incumbents in their current markets, they would be competing against decades of data points that give these businesses a powerful understanding of their customers and markets.
The Need for Neutral Data
In today’s software products and networks, we are both the user and the product. As we live, work, and play online our interactions are cataloged.
Data collection occurs at every non private stage of transmission. As soon as your data leaves your home or personal device, ISPs and telecom companies collect metadata about your behavior.
The data traverses the ISPs network, eventually reaching the relevant application server in a data center. At that point, the application owner also collects metadata about your behavior, specific to the application itself.
These covert methods of data collection are not necessarily malicious. Data is only useful in high quantities. By concentrating and standardizing data, these intermediates make it usable either as their core product, or as a product in itself (data brokers).
Market Consequences
Businesses use data to make predictions about the future. The better the data, the better their predictions and hence the better their products. The result is that over time, incumbents become virtually impossible to displace.
Since their data is kept private, only internal teams can build products with it. New businesses wishing to compete with businesses like Google for search, Mandiant for threat intelligence, FICO for financial data will find it virtually impossible due to the competitive moat created by decades of private data collection.
Real World Consequences
More than merely economic consequences, data hoarding impacts our health and security as well.
Cybersecurity companies are exceptionally data dependent. Development of anomaly detection algorithms, threat intelligence repositories, and malware signatures each require large sets of user-collected data. The core business for organizations like Crowdstrike, Manidant, and Palo Alto Networks is tied to the quality of their data, which is sourced primarily from their customers.
The more customers these businesses acquire, the better their products become. But since data is not shared, it is a zero-sum game. The incentive to compete can lead to competitor’s customers falling victim to preventable cyberattacks. In OT and medical environments, these cyberattacks can lead to loss of life of severe property damage.
Neutral Data Applications
Two decades ago, software development was a specialized skill. Over time, writing code has gotten progressively easier and less expensive. In 2024, generative AI tools have made the creation of bespoke software significantly easier.
Some of the largest tech organizations - Salesforce, ServiceNow, Oracle - are built around legacy software that is expensive and disliked by their user base. Despite the desire to switch, the cost to build an in-house equivalent of their products has remained too high.
Now, the economics behind the build vs. buy equation has flipped. Organizations are choosing to replace legacy software tools with in-house solutions that precisely fit their internal workflows. But they still face an obstacle - the data powering these tools is not openly available.
Software is easy to write, but the distribution of the data collecting sensors remains centralized with the software organizations. Google with their apps, Palo Alto Networks with their firewalls, Cerner with their EMR software. These products deliver value, but also collect impossible-to-replicate data sets.
Through Sense Agents, Oz fills in the gaps on the data side to allow any organization to rip-and-replace legacy software with their own tools. In the future, whenever a business needs an internal software tool, we imagine a world where they can simply choose the datasets required from Oz and code the solution using generative AI.
Industry Applications
Cybersecurity
Data is the lifeblood of the cybersecurity industry. Every day, new vulnerabilities are identified, new malware is created, and new malicious domains are registered.
Without an open repository for cybersecurity data, incumbent security organizations are working to silo as much of this data as possible away from their competitors. This can cause the competition’s customers to be compromised by preventable cyberattacks.
Any security data provided to Oz would be completely permissionless, ensuring any company from anywhere in the world can build with it. That means brand new startups can create security products with equivalent predictive power to the largest existing organizations.
They can use this data to power features across cloud security, vulnerability management, compliance, malware detection, anomaly detection, and more. This kind of data can serve to completely upend the way security products are built today.
Research
Collecting data for scientific research can be challenging due to high costs and a lack of available infrastructure.
Scientists interested in studying a specific type of data can develop novel collectors and validators on top of Oz. As long as demand remains low for their dataset, the cost of data acquisition can remain very low while connecting researchers with willing participants who opt-in by deploying collectors in their homes or environments.
Insurance
Insurance companies use data to determine risk, manage premiums, and detect fraud. The profitability of an insurance company is based on how well they can predict these values. If they charge too much, their competition will take all of the customers. If they charge too little, they’ll end up paying out more in claims than they earn.
Accurately predicting how much to charge for premiums based on the determined risk level is the core product of all insurance companies.
Determining risk is done using statistical methods, which require huge amounts of data. Today, this data is primarily collected via covert means described above.
Existing insurance companies can use Oz to source valuable data to better determine customer risk and hence how much to charge. New insurance companies can better compete against incumbents by leveraging open data provided by Oz, giving them a jump-start on predicting risk where data may previously have been impossible to acquire or extremely expensive.
Privacy
When imagining a better way to collect and use data, privacy is paramount. We believe the most important aspect of privacy is the control over what data is shared and what is not.
To preserve privacy, every data collector deployed on Oz must provide the following:
Users must have complete visibility into what data is shared, including sample records prior to their submission to the network
If any attribute is not required for the core dataset, users must be able to opt out of its collection
Privacy is ultimately about control over what is disclosed, and with whom. When users choose to share data over Oz, they have full visibility over what is shared and control over its recipients.
The Collaborative Future
In order to build towards a future where our data is treated fairly, we need to imagine a viable alternative to today’s covert data collection methods. Without another path forward, today’s data monopolies will continue to profit and expand using information we create.
At Oz, we imagine a future where the largest, most valuable datasets are available for anyone to use. All participants retain full control over what is shared, and are fairly compensated for use.
Our architecture creates a collaborative environment for the collection and use of data, promoting the open sharing of data instead of hoarding, leading to better products and a safer world.
Last updated