The Scent of Knowledge

The Scent of Knowledge

AI isn’t the end of knowledge. It’s the birth of meta-knowledge.

by Kurt Schiller // Illustration by Vin Tanner

Technology has always had an uneasy relationship with the truth.

The birth of the printing press was accompanied not only by an upswing in plagiarism, but a proliferation of the idea of plagiarism itself. The adoption of the post office, the newspaper, the telegraph, and the telephone were each accompanied by novel forms of scams and flimflammery that took advantage of evolving social norms and communication standards. As technology continues to evolve and our culture continues to adapt around it, each new complication is inevitably accompanied by novel misuses and misapplications, scams and schemes and half-measures that spread and proliferate with no less speed than the technology itself.

The recent and rapid adoption of generative AI (including large language models like ChatGPT and image generators like Stable Diffusion) has proven that this tendency is alive and well. As these new technologies seek footholds in the digital economy, we have already begun to see scam sites, high-profile slip ups, get-rich-quick schemes, and other digital malfeasance springing up around them like mushrooms after a brisk rain.

Appropriately, a great deal of concern has already been heaped upon the subject of AI adoption, in particular by the online left and progressive journalists. But while this early coverage and skepticism has been fruitful in highlighting specific instances of AI’s misuse and misapplication, it has also failed to identify a larger and much more consequential issue—in part by taking the claims of AI boosters too seriously, and in part by not taking them seriously enough.

The truth is that generative AI is not a sham. Nor is it “merely” a tool to replace artists, authors, journalists, and other members of the creative and knowledge industries. Rather, it is nothing less than the birth of a new kind of knowledge—a meta-knowledge, holding many of the hallmarks and yet none of the substance of existing facts and data, and which, once loosed, will prove maddeningly (and cost-prohibitively) difficult to identify, track, and mitigate.

* * * * *

To understand AI, you must first understand the internet. On the one hand, it is the most powerful informational tool ever developed. But on the other—as anyone who has used it for more than a short while can attest—it is also frustratingly unreliable.

Early internet norms were defined by a recognition of this unreliability. The early denizens of this now-ubiquitous digital domain were ciphers, faceless visages jealously guarding their true identities or intentionally adopting hastily manufactured noms de programmé for purposes both legitimate and illicit. Individual websites were just as likely to contain exhaustively researched and authoritative knowledge as complete nonsense or, worse, intentional misdirection—an outgrowth of the egalitarian, non-hierarchical foundations of the internet itself, a global distributed information network built not by top-down organizations, but (up until the late 2000’s) by the largely unrestrained and undirected activities of individual users.

It’s appropriate, then, that the rise of the internet neatly corresponded with the introduction, or perhaps reintroduction, of a novel idea: the wisdom of crowds. What if the individual contributions to this vast info network—blogs, social media posts, reviews, and so on—are less important than the cumulative effect they provide? What if sheer volume can make up for quality and reliability? What’s one bad review when there are 200 others on the way? What’s one or two pranksters in the face of the masses of well-meaning pedants that populate the forums, feeds, and front pages of our networked universe?

This underlying theory, that the foibles of the individual can be mitigated or erased through sheer volume, lay at the heart of some of the internet’s earliest and most profitable successes. Google, for one example, made its initial fortune sifting wheat from the chaff of online noise, and it did so in large part by abstracting top-down specificity—”What do we think are the best answers to a question, search, or need?”—through that same mass of human activity and data—”What answers do our users prefer?” Facebook, Reddit, Twitter, and many others operate around some variation of this principle: likes, thumbs ups, shares, and retweets are all essentially devices for the refinement, abstraction, and above all curation of information of one sort or another. They are a mechanism to harness the chaotic behavior of users to self-direct the information that they’ll be shown—sometimes with benign if naive intent, sometimes with overtly sinister objectives.

A more refined version of this ideology—one that abandons the complex but still coherent abstraction of mass numbers for the more opaque complexities of pure data, which seeks not to drown out the noise but to transmute noise into something akin to truth—is also at the heart of the emerging technology of large language model-based generative AI. Google sought truth by the law of averages; machine learning, it is understood, digs deeper, not merely engaging with oceans of data in totum but embarking on what amount to statistical fishing expeditions, combing the interrelations of pure data for the grains of insight… such as they are.

And make no mistake: Large language models and cousin technologies like Stable Diffusion could not exist without the internet. No other human network or technology could have readily and all-but-instantly provided the vast sets of training data necessary to birth these systems in their modern form, and certainly not at the speed, specificity, and near-zero price point on which their initial construction depended. Philosophically if not mathematically, LLMs like ChatGPT might almost be understood as a sort of meta-internet; a rapid and generative index that contains not the information of the internet per se, but rather a generalized abstraction of it; a prediction of the information these data sets might have contained, or which might have been readily extrapolated from them.

This is a challenge, of course. The genius and the curse of modern LLMs is that they are explicitly non-symbolic, eschewing the representative meaning of our own cognition for the unfettered madness of pure data.

This is a distinction that has proven difficult to grasp for AI critics and advocates alike, leading to questions about what Chat GPT “thinks” a concept or idea “is”—when the truth is that not only do they not “think” (at least, not as we understand it), but they completely lack the capacity for symbolic thought and understanding.

When a human thinks about “a lemon,” we understand that we are thinking about the idea of a lemon—a collection of attributes and relationships to other ideas, such as “juice”, “tree”, and “-heads”—but also that this idea also represents something in the real world: an actual, physical lemon. This is not a distinction that LLMs like ChatGPT can make. They have no concept of what a lemon, a chicken, or France “is”, because the idea of “is” eludes them. At the same time, an LLM still “knows” what sort of statements the token “chicken” is likely to appear in, around, and absent; that a chicken is more likely to accompany words like yellow or white than purple or green; that lemons are often accompanied by words like sour, bitter, astringent, and beverages like lemonade or soda; which countries and concepts France is often mentioned alongside. Large language models and other similar tools can capture the complex interrelationships between words and ideas, but they lack the ability to see beyond the idea and conceive of the thing it represents. For them, the data—the meta-knowledge, the concept of “lemon” without the lemon itself—is all that exists.

And yet there is symbolism in their application, if not their function; much of the early buzz for generative AI lies in its function as a sort of meta-librarian, a human-usable front-end for the increasingly unnavigable straits of information that constitute our modern data seas, polluted and cluttered as they are. Rather than dive deep ourselves in search of the pearls of truth, as Google once encouraged us to do, LLMs encourage us to move beyond the data set itself. They do not search the existing internet, but surface possibility from a hypothetical that exists only as model: meta-data without the data itself.

* * * * *

As any student of semiotics could tell you, there is a fundamental difference between the symbolic representation of a thing and the thing itself—no matter how close that resemblance might appear at first glance, and no matter how useful that representation may be in practice. Search engines, social media, and large language models all rely on data abstractions, presenting us with a sanitized and more usable view of mass data. But while the mechanisms of the first two may appear superficially similar to LLMs, the reality is quite different. Earlier abstractions like algorithmically curated social media feeds or even Google’s increasingly reviled search results all possess a visible and discrete referent: they represent something that exists, and if you so desire you can examine the particular origins of each bit of data. On Google, you can click through to the original site and see if it is a scholarly resource or junk science, regardless of Google’s own opinion on its utility; on social media, you might look at the profile and posting history of an individual user and judge their trustworthiness. The surfaced data presented to us by these systems are data signifiers—intelligent abstractions of manual searches, albeit at many times the speed and complexity with which a human could hope to navigate it—but they are signifiers with a readily identifiable signified, a signpost which might be followed to a point of origin.

The genius and the curse of modern LLMs is that they are explicitly non-symbolic, eschewing the representative meaning of our own cognition for the unfettered madness of pure data.

Not so with LLMs. Unlike Google, whose ultimate or at least original goal was to produce an index of existing knowledge, large language models like ChatGPT can deliver only meta-knowledge: a statistically probable semblance of knowledge, data, and communication that is likely to have occurred based upon a set of preconditions (be they a text prompt or some combination of prompt and rules/restrictions). You can ask a large language model for a source for its claims, and it will happily provide one—but, at least for the current generation of LLMs, these “origins” are nonsense. Unlike Google, which scours the datascape for seemingly reliable sources and the information they contain, contemporary LLMs lack any such one-to-one relationships with particular claims or sources. Their source, in a very real sense, is everything they contain. There is no origin point. There is only abstraction, synthesis, and recombination—all without the benefit or ability to see the bones of the soup, as it were.

Do you want to know the origin of ChatGPT’s useful information, the meta-knowledge that so many companies have so quickly identified as so valuable? Tough shit. It is, quite simply, impossible to know in any real sense, at least with the current generation of technology. And while developments such as self-checking and self-revising AI with the ability to seek out and compare secondary sources are already underway, it cannot address the underlying issue that discrete knowledge or data simply does not truly exist within a tool like ChatGPT. And even more troublingly, the speed at which this meta-knowledge is delivered opens up new and unique hazards when translated to a human scale.

The shortcomings of human abstraction are well-known and understood, and human fallibility is a precondition for designing any system. If you were to hire a random person off the street to perform some sort of essential data task—say, looking up the addresses of thousands of different stores—you would presumably check their work at some point to verify whether it was accurate or just nonsense. The same sort of verification could in theory be applied to the output of LLMs, but it would be a slow, labor intensive process—and it would be even moreso when conducted on generative AI output, which cannot be directly traced to any individual source or claim. Given that the chief selling point of large language models is eliminating human labor and increasing speed, we must immediately face the question of what strange logic could possibly compel AI adopters to turn around and reintroduce the very factors they sought to eliminate? The goodness of their hearts? Absent some sort of legal or regulatory requirement, we should assume that any AI output presented to an end user has received only the barest and most cursory sort of verification—at best, the exact minimum that’s needed for the application, in fact. (And in practice, companies will be compelled to push even that minimal limit; after all, the less human labor required, the better.)

And there are other dangers above and beyond mere shoddy data. Generative AI’s on-demand and all but instantaneous nature opens up the possibility of new and entirely novel forms of untruth; consider the prospect of “designer” lies and disinfo campaigns (whose intent could be marketing, political, criminal, etc.) with a target list of one. Savvy internet users are well acquainted with the most common forms of online deception, but how will they fare against techniques specifically designed to deceive them and them alone? Less tech-knowledgeable users already struggle to identify digital imposters pretending to be friends, family, and coworkers. Will they be able to identify AI-enabled imposters? Digital fraud is already an $8 billion industry—and given that criminals are often the first to adopt new technology, we should expect that number to grow.

What we are left, then, is a technology that is simultaneously an index and the work itself. But it is a work that exists only in potentia and which can never itself be viewed—a sort of Schrödinger’s encyclopedia closely guarded by a deranged librarian with an infrequent, yet irresistible propensity to lie with remarkable speed and aplomb. And while there is research showing that there are distinct mechanical differences between an LLM that is lying—”I know this is probably not correct, but I am saying it anyway”—and one experiencing what AI researchers have taken to calling a hallucination—”I believe this to be correct, but I am mistaken”—it remains an open question how thoroughly such errors can be mitigated, and one whose existence has tellingly failed to slow the constant drumbeat of technology hype and boosterism.

Critics have already extrapolated this tendency to lie, prevaricate, and misrepresent into ominous predictions that AI-generated web content will quickly overwhelm real, human search results, leading to search results comprised entirely of an impenetrable soup of half-truths, un-data, and algorithmic meta-knowledge, as even Google and other knowledge sifters lose the ability to tell the difference.

This is a possibility, but evidence suggests not a very likely one. Google is already no stranger to counter-optimization and malicious quasi-knowledge. Indeed, they have devoted (and continue to devote) enormous efforts to refining their search results, screening out the most execrable of the internet’s bullshit. It’s easy to dismiss these efforts, especially as their own search results have nevertheless become increasingly dominated by a glut of highly optimized SEO spam. But we should not take this laggardly rear-guard action as evidence of Google’s inability to act—it speaks, rather, to a fundamental tension at the heart of any knowledge business. Search engines do not serve one set of customers; they serve many groups, and many conflicting interests. While end-users might wish to see an end to WikiHow, Quora, Pinterest, and other such search-cluttering bêtes noire of the infosphere, from Google’s point of view such organizations are equally important to the business of search. You are one customer, true—but Google is just as beholden to the content farms, legitimate and otherwise, that make up an equally important half of their business, and for which you yourself are the most valuable product.

Furthermore, the data lords of the modern economy—Google, yes, but also Microsoft, Facebook, Apple, and indeed practically every other large tech company—require actual data to function, and it is in their best interests to retain the ability to sift SEO spam and other faux data (along with whatever future manifestations such distractions may adopt through the supercharging capabilities of AI) from the human-generated data on which their empires were built to begin with. We already know that many large language models degrade when trained on the output of other LLMs, and businesses can run afoul of algorithmic half-knowledge just as easily as individual humans. The idea that this will somehow lead to the downfall of the current knowledge economy presumes that our ability to identify AI content or remain exactly as it is now—and yet, this is already starting to change.

While humans may not be able to readily distinguish fact from algorithmic meta-fact, it increasingly appears that other large language models can—whether through the intrinsic hallmarks of their output, or through the intentional inclusion of machine-readable data tell-tales called “watermarks”. Existing “AI content detectors” are unreliable at best, but this exact capability will soon form an underlying requirement of doing any kind of knowledge business online. Google, after all, needs at least a minimum of real data to keep its digital ad and search business afloat. The same goes for Microsoft and its business intelligence offerings, along with a host of other multinational mega-conglomerates whose businesses rely on their ability to provide factual data to other businesses. Law firms, GIS companies, academic content aggregators, publishing houses—a vast swath of the modern economy depends at some level on the existence of real, actual data, and for this reason the detection and flagging of AI content is unsurprisingly one of the most closely watched areas of this newfound economy.

The pressing question, then, is not “How will we be able to tell AI content from real human data?” but “Who will own and control the tools that will make it possible, and who will have access to them?” Given the centralized model that the nascent AI industry has held to—companies like Open AI have increasingly eschewed academically minded transparency for corporate secrecy—the most obvious answer is that it will be the very companies who are so eagerly rushing to implement their opposite numbers: Google, Microsoft, OpenAI, and the like.

What we are witnessing, then, is not the complete death of internet knowledge, but the bifurcation of data and certainty itself. Most of us will have access to a “free” tier—our existing reservoir of human knowledge, but now shot through with the statistical prognostication and algorithmic droppings of AI models in all the various forms their output might take—while those who need it will shell out in one form or another for “premium” knowledge—knowledge which can be traced back to real sources and citations, and which has been cleansed of AI output. The ability to tell human activity from AI output is already big business, and provided AI adoption continues to grow, there is every reason to believe it will remain so; and absent some sort of government-mandated AI “right to know” law, we should not expect such a service to come cheap.

* * * * *

Most reactions to the proliferation of LLMs and their propensity for half-knowledge have fallen into one of two broad categories: the boosters, who assert that any and all issues with AI will be resolved by its rapid improvement and growing maturity; and the skeptics, who assert that every flaw will remain more or less as it is, and that the continued adoption of this technology will either doom the companies implementing it or cause the market for it to collapse completely, much like the recent collapse of cryptocurrencies and NFTs.

The reality is more complicated. To begin with, and contrary to the claims of the most extreme skeptics, large language models do work, and extremely well… at least, for a given value of “work”.

At my first full-time writing job, I spent an inordinate amount of time rewriting and condensing press releases into short, two-to-three sentence snippets, accompanied by a punchy headline. This task—which I performed for between 10 and 12 hours a week, and for which I was paid roughly $12 an hour—could easily and reliably have been performed by any of the current generation of LLMs, with nobody the wiser.

The early and absurd stumbles of AI adoption may delight skeptics, but we shouldn’t forget that the mega-businesses of our modern economy are nothing if not persistent.

AI skeptics have tended to hand-wave away such ignoble content tasks as irrelevant or worthless. And yet for me (and my employer!), it had a very obvious and immediate worth: roughly $6,200 a year, measured in hourly wages. Equally “useless” content tasks exist throughout the modern business world, from writing short summaries of larger documents, to copying down customer service notes, to generating FAQs and documentation that is contractually mandated to exist but never actually meant for human consumption. (I also, as it happens, once worked for a government contractor writing long FAQs with an intended audience of precisely one: the state auditor who would verify we had satisfied our contractual obligation to provide the FAQ in question.)

As easy as it is to dismiss such tasks as the make-work jobs of modern capital, they have a genuine (and sizable) material impact on our world; and since AI excels at these tasks, cheaply and quickly, it beggars belief to continue asserting that it won’t be widely adopted for these exact purposes. It is comforting to fall back on the idea that AI “simply does not work”, but reality in this case has not kept pace with AI’s own most strident critics. The ability to eliminate human labor at extremely low cost is something of a magic money tree for capitalism; we should not assume that no one will avail themselves of its fruit just because it is difficult and painful to climb.

But we also should not assume, as the AI boosters do, that AI will only be deployed in places where it makes sense, or that new capabilities and improvements will address the issues that are already beginning to plague end users.

We have already seen search engines’ response to the un-data of SEO spam, and it has not been to ruthlessly eliminate it: rather, it has been to mitigate it, often through features and capabilities that address only part of the shortcomings. Google and Bing already generate “intelligent” snippets that seek to surface relevant information from the soup of nonsense that has suffused the larger corpus of search results. But it is only that: a mitigation, and one dictated by the needs of Google itself—snippets don’t really need to be correct, they only need to be at least plausible most of the time. (And indeed, they frequently already contain the exact sort of plausible but incorrect half-truths that AI excels at—even without the benefit of actual AI.) The end result of the endless homeostatic pressure between user and capital is not “good”, but merely “good enough”—good enough for Google to keep making money, and good enough that most users will at the very least tolerate its performance.

Data businesses’ adoption of AI has followed the same path and logic. Rather than using large language models to improve search results by identifying and eliminating SEO spam (something that is very definitely within their capabilities, and for which a vast data set already exists), Bing and Google have opted to short-circuit search entirely, seemingly accepting that search results will always be bad and that the most valuable innovation would be the abandonment of human-legible search entirely. “Search,” their logic seems to say, “is not and never will be for regular humans. Let the robots handle that one.”

We should expect the same behavior from other parts of the economy. The ability to eliminate human labor at a cost of near-zero is far too valuable to pass up for the ruthless optimizers tasked with putting the “profit above all else” logic of capital into practice, and we have already seen fumbling attempts to unwisely eliminate humans from a variety of processes entirely.

That many of these early attempts have visibly and spectacularly failed should not be taken as evidence that capital will back away from further AI adoption; rather, they should be understood as putting forth tendrils and feelers, identifying limits and opportunities, ruthlessly refining and probing. Capital has a habit of walking itself right up to the edge and peering over it; and if a few million humans and a few enterprises tumble over into the abyss, so be it. Perhaps they’ll move the outer limits a few millimeters closer next time—but they will never retreat entirely.

The underlying logic of capitalism dictates that most corporations mostly serve most of their own needs most of the time. Capital stumbles, yes, and spectacularly so; but it also seeks to persist, to adapt, to persevere—if not in its current form, then in some new variation. AI will inevitably be adopted to the exact limit at which it begins to self-destruct, and we shall ever more find ourselves up against that limit, struggling to perceive it.

And it is another hallmark of capitalism that any and all value must be inevitably and irrevocably monetized. Within the next few years, the ability to provide genuine, human, AI-free information will be an essential capability of any knowledge business. If the human noosphere becomes polluted with AI trash, why, we must only build a better filter—and then charge for it, of course. Such “AI-free” information will come at a cost, as well as at the discretion of its gatekeepers. Businesses, multinational organizations, government, and other large enterprises will be more than willing to pay a high premium for “real” data, while the rest of us will be left with whatever scraps they see fit to provide, all watched over by machines of senseless effrontery. Truth for me, but not for thee.

We may soon find knowledge itself divided into defined and well-monetized grades, like so much maple syrup. Businesses will ask themselves, “Exactly how pure and uncut do I need my knowledge to be?” And you, the end user—who does not get to choose, who does not get to demand real data and real human interaction, who has no real say over the implementation and particulars of these vast and immane systems, and who is merely along for the ride—probably won’t like the answer.

The early and absurd stumbles of AI adoption may delight skeptics, but we shouldn’t forget that the mega-businesses of our modern economy are nothing if not persistent. Digital consolidation may have looked ludicrous in the Pets.com era, with its high-profile collapses and facile attempts at “innovation”; but that has not prevented subsequent collapses, subsequent market capture, and subsequent transformation of the world by the very same mechanisms that led to those early embarrassments. Thirty years ago, grocery store self-check outs seemed like a recipe for shoplifting—and it is. But as it turns out, the cost savings of eliminating humans more than makes up for its very obvious shortcomings. Sound familiar?

In the end, the only real question about the adoption of AI “knowledge” is how far and how fast the degradation of human knowledge will go. That LLMs work as well as they do is itself a wonder, albeit an ominous one fraught with ethical, moral, and indeed societal complication. That they stumble and lurch awkwardly through the economy and its accompanying noosphere—and that, to those of us who are of a materialist stripe, they only appear to heighten the very contradictions of capital that they are meant to resolve—is not novel or unique. They are hardly the first transformative technology to do so, and they won’t be the last.

It may be that humans really do require a considerable degree of actual knowledge to function, and for Google, Microsoft, and the rest to make money, and that the limits of knowledge automation lie closer than they seem. Or it may be that abstracted quasi-knowledge is more than sufficient for the status quo to re-emerge in a new and marginally updated form, and that the majority of humans will soon be expected to persist in a de-facto “post-knowledge” world, where true certainty is a luxury afforded only to those who can afford it. Either outcome would be a dramatic and marked transformation from our present status quo, where the free and unfettered proliferation of high-quality data is a prerequisite for every aspect of society, from jobs to research to socialization to simply finding how late the store around the corner is open.

The question, then, is not “Is this the end of knowledge?” Instead, it’s “What sort of knowledge will become the new norm for the average person?” Will most of us remain in the realm of the tangible, the referential, and the ontologically verifiable? Or will we sink forever into a soup of meta-knowledge, accepting that “good enough” is sufficient?

Perhaps appropriately, we simply have no way of knowing.