Data Quality

Strategies to ensuring data accuracy

As a permissionless protocol, we face unique challenges in ensuring data quality. Since anyone can submit data to the protocol, mechanisms must exist to validate each data point.

In this article, we outline the strategies employed at Oz to prevent spam and ensure the highest possible signal-to-noise ratio.

Validation Strategies

Cybersecurity Data

The first dataset supported by the protocol is cybersecurity risk signals. These user-sourced data points can be used to determine if a domain, wallet, or smart contract is safe or malicious.

Let's consider a few sources of low-quality data, and how we can prevent them from poisoning the overall quality of our dataset.

Spam Submissions

Spam submissions occur whenever a user submits a random response to the protocol. For the cybersecurity dataset, this can take the form of randomly labeling a page as safe or unsafe.

Validators prevent this strategy through outlier detection. Most domains are either entirely malicious or entirely safe. If a good actor submits data to the network, we can expect them to be right almost 100% of the time.

Data validators periodically scan the dataset to identify outliers.

Consider a dataset with n total entries, where k entries are labeled "safe", and the remaining n-k entries are labeled unsafe.

If a spammer is randomly guessing whether a record is unsafe, their probability of being correct is the probability of selecting one of the k unsafe records out of the total n records:

Probability of correct guess = k/n

For good actors, we expect their submissions to be higher than the probability of randomly guessing.

If a user's submissions are consistently outliers, their reputation will be negatively affected, and eventually their data will be excluded from the dataset. Since users originally spent OZT to submit data to the protocol, there is a negative incentive to spam the network.

Advanced Spam

Most domains on the internet are not malicious. Intelligent data providers may realize this, and report thousands or millions of domains as safe as a method to earn tokens.

But the most valuable threat signals provided by a threat intelligence feed indicate which domains are harmful, rather than which domains are safe. When data is retrieved from the Oz threat intelligence feed, users may filter to exclude safe domains. For the majority of threat intelligence use cases, this will be the default query.

Since this type of data is unlikely to be used, there is no incentive for users to share it with the protocol.

Data Poisoning Attacks

A data poisoning attack occurs when a Sybil attacker creates multiple accounts to collude in reporting the same data point. This type of attack avoids outlier detection, as the user has created a situation where honest reporters are the outliers.

To avoid this type of attack, the threat intelligence validator will perform collusion detection. By clustering users that vote together on submissions, it is trivial to identify and punish users for this behavior.

The odds that a given group of users vote for the same entries multiple times is similar to the birthday paradox, in which we are determining the probability that one or more people in a group share a birthday.

In this scenario, we can calculate the expected number of collisions as follows:

Define

  • NNas the total elements in the dataset (total number of threat signals)

  • as the number of elements the user votes on

  • 𝑛𝑛 as the match threshold

Now, we can calculate p as the odds of 2 people providing the same signals for a given entry can be approximated using the birthday paradox equation -

If the dataset consists of just 10,000 total entries, and each user submits 5 votes, the probability of two users voting on the same 3 elements is 0.009995, or about 0.1%.

As the dataset grows larger, the probability of this kind of collision occuring becomes exponentially smaller. As a result, data poisoning attacks become even easier to detect. If a user is found to be participating in data poisoning, their reputation will be negatively impacted.

Reputation Scores

Reputation is maintained for every user within the Oz ecosystem. Reputation scores are improved by connecting third party accounts, and submitting data to the protocol.

As validators execute validation logic on the dataset, they also adjust the reputation of the users who submitted each data point.

If validators discover a user is participating in any kind of attack described above, the user's reputation score will be negatively impacted. Data from users with low reputation is never accessed, so attackers are not compensated for the data they provide.

Collectors and Data Standardization

Every datapoint submitted to Oz goes through a standard collector, built using our collector SDK (available late 2024).

Collectors normalize data before it's submitted to the protocol, ensuring only correctly formatted values are received, and that every record in the dataset has the same structure and validation applied.

Collectors are the most important tool used by Oz to ensure data quality. As of this writing, only the cybersecurity data collector is supported.

Additional collectors will be added to the network later this year, subject to community approval. Each collector has an accompanying dataset and set of validation logic to ensure data quality.

Sybil Attacks

Anywhere an incentive exists to own multiple accounts over a single account, Sybils will come.

As a data protocol, our objective is to design incentives that encourage the submission of real, high-quality data. Whether that data comes from one user or many users does not affect the result, provided the submissions are real and valid.

Oz prevents Sybil attacks through data validation strategies as described above.

Users pay OZT to submit data to the protocol and earn OZT when their data is accessed. Whether the data comes from one or multiple accounts, does not affect the number of tokens earned. There is no incentive to create multiple accounts, and no harm to the protocol if multiple accounts are created.

Last updated