Erowid Culture & Art Vaults : Information Theory - Data Points in the Void (2002)

Data Points in the Void

As presented at Gathering of the Tribes (GoTT)

by Fire & Earth Erowid

Oct 2002

Citation: Erowid, Fire & Erowid, Earth. "Data Points in the Void". Erowid Extracts. Oct 2002;3:6-7.

We confront the bounds of human knowledge and the limitations of interpretation on a daily basis. It has become apparent in the Erowid Project, as we work to answer people's questions and add to the sum of information available about psychoactives, that many people assume the world holds far more definitive answers than it actually does. In the course of our work we interact with a wide variety of people, and this assumption is not unique to the field of psychedelic research.

In May 2002, we had the opportunity to speak at the Gathering of The Tribes (GoTT) in Los Angeles. GoTT is a non-commercial collaborative project that brings together members of the North American and European underground electronic dance music community for workshops and events that support arts and activism. For the conference we decided to touch briefly on the issue of data interpretation in one of our presentations. Our talk was formulated around a visualization we sometimes use to describe the state of the Erowid archive. Many aspects of it are oversimplified for brevity.

* * * * *

The amount of data that humankind has accumulated is truly staggering. Much that has been documented has been lost, but most of it is simply inaccessible to most individuals. Unique information fills millions of books and thousands of libraries around the world. With the digital revolution, server farms have cropped up: mammoth buildings filled with row after row of floor to ceiling racks of servers, many filled with unique collections of information. But with all this information, we are still left without clear answers to most questions.

Now imagine each piece of this data as a point of light in an infinite, 3-dimensional black space. Each point represents a single documented fact or idea. Relationships between data points are represented by lines or bonds of different strengths--perhaps two pieces of data were acquired during the same experiment, or maybe one piece of knowledge was only possible after another was proven to be true--some of these connections are thick and ropey, others thin and almost invisible. Together, these points and lines form a vast webwork of information. While there are other layers that can be viewed, such as interpretations of data or commentary, for now we're looking solely at the data layer.

Imagine each piece of data as a point of light in an infinite, 3-dimensional black space.

This network of points and lines stretches for as far as we can see in all directions. Yet, despite the vast quantity of information currently in the collective human archives, the data field is mostly dark, empty space between lines and points. There are large areas of the data field where nothing is known at all. There are areas where there are only a few points and a few tentative lines, where the data is thin and not much is known. There are also areas of knowledge that are quite densely populated with data points. In these areas, the relationships between the points are relatively well understood, agreed upon, and the connecting lines are thick and plentiful.

But overall, it's mostly empty space.

When we want to answer a question, we drill down, through the layers of interpretation and commentary, to see whether there is a data point that directly answers our question. In most cases, specific questions do not directly hit a data point but fall in the spaces in between; there may be pieces of data related to our question but nothing which directly addresses it.

In these cases, the best we can do is attempt to interpolate from the closest (most relevant) points. Sometimes this interpolation is easy, with nearby data points providing an obvious answer on which most people can agree. But when the relevant data points get further away, when the links become more tenuous, interpretation becomes difficult and controversial.

There is no single agreed upon way to interpret most data because the process of interpretation is defined by personal bias. Intepretations are abstractions of the data layer that form another mostly distinct layer of points and connections. Interpretations are shaped by the perceived relevance and reliability of selected points of data, by the assessed importance of relationships, by the views, biases, and opinions of the interpreter, and occasionally by fabricated or assumed data points. Data points that are considered spurious, inaccurate, or irrelevant are filtered out by the interpreter as the data is assembled into a sensible order. Interpretations are, in part, an attempt to fill in the empty space between the actual data points, so when we visualize the interpretation layer, the points and lines cover more space and their edges are less distinct--leaving less holes--than in the underlying data layer.

Because most answers are based on the interpretive layer, who answers a question is often as important as the underlying data itself.

When trying to answer the question "Is MDMA dangerous?" a physician working at a drug treatment program is more likely to heavily weigh reports of addiction and brain damage, while a transpersonal therapist might be more likely to value information about possible therapeutic uses and consider the relative danger of daily use of antidepressants.

Most people want simple answers, but the answers are seldom simple.

Imagine looking at the actual data points related to MDMA neurotoxicity. One point may be, "When a Sprague-Dawley breed rat, housed in a small plastic box at 75°F, is given 5 mg per kilogram of MDMA injected into their abdomen four times over the course of eight hours, it shows lasting reductions of serotonin and serotonin metabolites in several regions of its brain." For most people, this specific piece of data is of limited value. What do Sprague-Dawley breed rats have to do with humans? How would this change if the MDMA were given orally rather than injected? How does that dose compare to a normal human dose and do lower doses produce the same effects? What effect do reduced serotonin levels have? Are the changes permanent or do serotonin levels return to normal?

There are thousands of similar data points, written in ScienceSpeak, that only a small portion of the population can read, let alone really understand. There is more data available in most specific fields than any single person can keep in their head at one time. And there is strong pressure, when trying to inform people, to simplify answers, to draw conclusions where there are still mostly questions. This pressure leads to oversimplifications that further introduce errors into the interpretive layer.

The image becomes more complicated when we consider that each apparent data point may or may not actually be "true". It could be false information, a typo, a misstatement of fact; it could be based on a political or moral viewpoint that one does not agree with. The data field of psychoactive drug information is polluted by many apparent data points that are political, legal, or moral views masquerading as "fact layer" data. As these pseudo-facts spread into the interpretive layers, the pollution also spreads.

Because the basic interpretive process requires so much specific skill and knowledge, we are forced to rely on layers of interpretation and abstracted information. We are dependent on generalizations and summaries. It becomes imperative that we who accept answers from others consider what effect the funding, motivation, and viewpoint of those supplying answers might have on the conclusions they draw. We need to demand that answers come with ways to verify them independently.

One of the challenges we face, both as a species and with the Erowid Project, is to create better systems for improving the reliability and quality of complex information. One thing we believe will help improve the situation is for more people to offer interpretations of data that can easily be traced back to their anchor points in the vast web of data. Without being able to find the direct and explicit connections between interpretation and data, a view cannot easily be evaluated for its accuracy, even by those with expertise in the field.

We continue to work to provide people with more access to the web of data and the links between answers and underlying data layers.