I'm not sure it's worse than misinformation. In my field, bad data often has a more damaging impact than no data.
But I suppose it will depend on the circumstances, and I'd honestly be interested to hear your thoughts on why censorship is worse.
As for the inevitability of abuse? When it comes to corporate interests, that seems to be nearly axiomatic. The Verge's list of fascinating & horrifying exchange at Apple about app approvals & secret deals makes for a great case-study in this. [0]
Censorship is bad data, because it is selectively excluded data.
If gamma rays randomly excluded one post in a thousand, that would be mussing data. Censors excluding one post in ten thousand is worrying because they have motivations of their own, which gamma rays do not.
It looks like we're going to get a massive test of largely misinformation (US) vs. largely censorship (China) writ large in the coming decades. Place your bets on the outcome.
From my understanding, China's model of media control focuses more on dillution and distraction than on overt censorship.
Both exist. But the larger effort is put into distraction.
The recent Russian model is more on bullshit and subverting notions of trust entirely.
American propaganda seems largely based on a) what sells and b) promoting platitudes, wishful thinking, and c) (at least historically) heart-warming (rather than overtly divisive) notions of nationalism.
The c) case is now trending more toward divisive and heat-worming.
Yes, censorship and propaganda go hand in hand. In 1922 Walter Lippmann wrote in his seminal work, Public Opinion,
> Without some form of censorship, propaganda in the strict sense of the word is impossible. In order to conduct a propaganda there must be some barrier between the public and the event. [1] [2]
This is 'some bad data' vs 'systemically biased data' and the latter is much worse. Most datasets will contain some bad data but it can be worked around because the errors are random.
A statement of "I don't know' clearly indicates a lack of knowledge.
A statemnt of "I have no opinion" clearly indicates that the speaker has not formed an opinion.
In each case, a spurious generated response:
1. Is generally accepted as prima facie evidence of what it purports.
2. Must be specifically analysed and assessed.
3. Is itself subject to repetition and/or amplification. With empirical evidence suggesting that falsehoods outcompete truths, particularly on large networks operating at flows which overload rational assessment.
4. Competes for attention with other information, including the no-signal case specifically, which does very poorly against false claims as it is literally nothing competing against an often very loud something.
Yes: bad data is much, much, much, much worse than no data.
It's useful to note what is excluded. But you exclude bad data from the analysis.
Remember that what you're interested in is not the data but the ground truth that the data represent. This means that the full transmission chain must be reliable and its integrity assured: phenomenon, generated signal, transmission channel, receiver, sensor, interpretation, and recording.
Noise may enter at any point. And that noise has ... exceedingly little value.
Deliberately inserted noise is one of the most effective ways to thwart an accurate assessment of ground truths.
Defining terms here is important, so let's avoid the word bad for a moment because it can be applied in different ways.
1) You can have an empty dataset.
2) You can have an incomplete dataset.
3) you can have a dataset where the data is wrong
All of these situations, in some sense, are "bad"
What I'm saying is that, going into a situation, my preference would be #2 > #1 > #3.
Because I always assume a dataset could be incomplete, that it didn't capture everything. I can plan for it, look for evidence that something is missing, try to find it. If I suspect something is missing but can't find it then I at least know that much, and maybe even the magnitude of uncertainty that adds to the situation. Either way, I can work around it understanding the limits if what I'm doing or if there's too much missing, make a judgement call and say that nothing useful can be done with it.
If I have what appears to be a dataset that I can work with, but the data is all incorrect, I may never even know it until things start to break or, before that if I'm lucky, I waste large amounts of time to find out that the results just don't make sense.
It's probably important to note that #2 and #3 are also not mutually exclusive. Getting out of the dry world of data analysis, if your job is propaganda & if you're good at your job, #2 and #3 combined is where you're at.
I'd argue Facebook's censorship leaves us with 2 and 3. They don't remove things bevause they're wrong; they remove them because they go against the current orthodoxy. Most things are wrong, so most things that go against the modern orthodoxy are wrong... but wrong things that go WITH the modern orthodoxy aren't removed.
It's a scientist who removes outliers in the direction that refute his ideas, but not ones in the direction that support it.
Let's note that this thread's been shifting back and forth between information which is publicised over media and data, with the discussion focusing on use in research.
These aren't entirely dissimilar, but they have both similarities and differences.
Data in research is used to confirm or deny models, that is, understandings of the world.
Data in operations is used to determine and shape actions (including possibly inaction), interacting with an environment.
Information in media ... shares some of this, but is more complex in that it both creates (or disproves) models, and has a very extensive behavioural component involving both individual and group psychology and sociology.
Media platform moderation plays several roles. In part, it's performed in the context that the platforms are performing their own selection and amplification, and that there's now experimental evidence that even in the absence of any induced bias, disinformation tends to spread especially in large and active social networks.
The situation is made worse when there's both intrinsic tooling of the system to boost sensationalism (a/k/a "high engagement" content), and deliberate introduction of false or provocative information.
TL;DR: moderation has to compensate and overcome inherent biases for misinformation, and take into consideration both causal and resultant behaviours and effects. At the same time, moderation itself is subject to many of the same biases that the information network as a whole is (false and inflammatory reports tend to draw more reports and quicker actions), as well as spurious error rates (as I've described at length above).
All of which is to say that I don't find your own allegation of an intentional bias, offered without evidence or argument, credible.
An excellent distinction. In the world of data with research & operations, I only very rarely deal with data that is intentionally biased. Counted on the fingers of my hand. Cherry picked is more common, but intentionally wrong to present things in a different light, that's rare.
Well, it's rare that I know of. The nature of things is that I might never know. But most people that don't work with data as a profession also don't know how to create convincingly fake data, or even cherry pick without leaving the holes obvious. Saying "Yeah, so I actually need all of the data" isn't too uncommon. Most of the time it's not even deliberate, people just don't understand that their definition of "relevant data" isn't applicable. Especially when I'm using it to diagnose a problem with their organization/department/etc.
Propaganda... Well, as you said there's some overlap in the principles. Though I still stand by more preference of #2 > #1 > #3. And #3 > 2&3 together.
Does your research data include moderator actions? I imagine such data may be difficult to gather. On reddit it's easy since most groups are public and someone's already collected components for extracting such data [1].
I show some aggregated moderation history on reveddit.com e.g. r/worldnews [2]. Since moderators can remove things without users knowing [3], there is little oversight and bias naturally grows. I think there is less bias when users can more easily review the moderation. And, there is research that suggests if moderators provide removal explanations, it reduces the likelihood of that user having a post removed in the future [4]. Such research may have encouraged reddit to display post removal details [5] with some exceptions [6]. As far as I know, such research has not yet been published on comment removals.
Data reliability is highly dependent on the type of data you're working with, and the procedures, processes, and checks on that.
I've worked with scientific, engineering, survey, business, medical, financial, government, internet ("web traffic" and equivalents), and behavioural data (e.g., measured experiences / behavour, not self-reported). Each has ... its interesting quirks.
Self-reported survey data is notoriously bad, and there's a huge set of tricks and assumptions that are used to scrub that. Those insisting on "uncensored" data would likely scream.
(TL;DR: multiple views on the same underlying phenomenon help a lot --- not necessarily from the same source. Some will lie, but they'll tend to lie differently and in somewhat predictable ways.)
Engineering and science data tend to suffer from pre-measurement assumptions (e.g., what you instrumented for vs. what you got. "Not great. Not terrible" from the series Chernobyl is a brilliant example of this (the instruments simply couldn't read the actual amount of radiation).
In online data, distinguishing "authentic" from all other traffic (users vs. bots) is the challenge. And that involves numerous dark arts.
Financial data tends to have strong incentives to provide something, but also a strong incentive to game the system.
I've seen field data where the interests of the field reporters outweighed the subsequent interest of analysts, resulting in wonderfully-specified databases with very little useful data.
Experiential data are great, but you're limited, again, to what you can quantify and measure (as well has having major privacy and surveillance concerns, often other ethical considerations).
Government data are often quite excellent, at least within competent organisations. For some flavour of just how widely standards can vary, though, look at reports of Covid cases, hospitalisations, recoveries, and deaths from different jurisdictions. Some measures (especially excess deaths) are far more robust, though they also lag considerably from direct experience. (Cost, lag, number of datapoints, sampling concerns, etc., all become considerations.)
I've worked with a decent variety as well, though nothing close to engineering.
>Self-reported survey data is notoriously bad
This is my least favorite type of data to work with. It can be incorrect either deliberately or through poor survey design. When I have to work with surveys I insist that they tell me what they want to know, and I design it. Sometimes people come to me when they already have survey results, and sometimes I have to tell them there's nothing reliable that I can do with to. When I'm involved from the beginning, I have final veto. Even then I don't like it. Even a well designed survey with proper phrasing, unbiased likert scales, etc can have issues. Many things don't collapse nicely to a one-dimensional scale. Then there is the selection bias inherent when by definition you only receive responses from people willing to fill out the survey. There are ways to deal with that, but they're far from perfect.
Bad data is often taken as good data, because sifting through it incurs 100x more friction than taking it at face value. When you ultimately get bad results you can just blame the bad data, and you still end up with a paycheck for the month(s) you wasted.
As a metaphor, you can imagine a blind person in the wilderness who has no idea what is in front of him. He will proceed cautiously, perhaps probing the ground with a stick or his foot. You could also imagine a delusional man in the same wilderness incorrectly believing he's in the middle of a foot race. The delusional man just run forward at full speed. If the pair are in front of a cliff...
As the saying goes, it's not what you don't know that gets you into trouble. It's what you know for sure that just ain't so.
Not quite: if you have no data, you get new hires and news systems to collect and track it.
You may be ignorant, but you know it, and can deal with it. Let's call is starting from 0.
When you have bad data, you frequently don't know that you have bad data until things go very very wrong. You aren't starting from 0. 0 would be an improvement.
This seems like extending the "known knowns" concept to an additional dimension, involving truth.
In the known-knowns model, you have knowledge and metaknowledge (what you know, what you know you know):
K U -- What you know
K KK KU
U UK UU
\
What you know you know
If we add truth to that, you end up with a four-dimensional array with dimensions of knowledge, knowledge of knowledge, truth-value, and knowledge-of-truth-value. Rather than four states, there are now 16:
TT TF FT FF (Truth & belief of truth)
---- ---- ---- ----
KK | KKTT KKTF KKFT KKFF
KU | KUTT KUTF KKFT KKFF
UK | UKTT UKTF UKFT UKFF
UU | UUTT UUTF UUFT UUFF
False information is the FT and FF columns.
In both the TF and FT columns, belief of the truth-value of data is incorrect.
In both the KU and UU columns, there is a lack of knowledge (e.g., ignorance), either known or unknown.
(I'm still thinking through what the implications of this are. Mapping it out helps structure the situation.)
But I suppose it will depend on the circumstances, and I'd honestly be interested to hear your thoughts on why censorship is worse.
As for the inevitability of abuse? When it comes to corporate interests, that seems to be nearly axiomatic. The Verge's list of fascinating & horrifying exchange at Apple about app approvals & secret deals makes for a great case-study in this. [0]
[0] https://www.theverge.com/22611236/epic-v-apple-emails-projec...