"You're the most beautiful girl in the world" and Wittgensteinian Language Games

12d

Wittgenstein argues that we shouldn't understand language by piecing together the dictionary meaning of each individual word in a sentence, but rather that language should be understood in context as a move in a language game.

Consider the phrase, "You're the most beautiful girl in the world". Many rationalists might shy away from such a statement, deeming it statistically improbable. However, while this strict adherence to truth is commendable, I honestly feel it is misguided.

It's honestly kind of absurd to expect your words to be taken literally in these kinds of circumstances. The recipient of such a compliment will almost certainly understand it as hyperbole intended to express fondness and desire, rather than as a literal factual assertion. Further, by invoking a phrase that plays a certain role...

(See More – 41 more words)

Odd anon19m10

This isn't that complicated. The halo effect is real and can go to extremes when romantic relationships are involved, and most people take their sense data at face value most of the time. The sentence is meant completely literally.

The formal goal is a pointer

Pi Rogers

When I introduce people to plans like QACI, they often have objections like "How is an AI going to do all of the simulating necessary to calculate this?" or "If our technology is good enough to calculate this with any level of precision, we can probably just upload some humans." or just "That's not computable."

I think these kinds of objections are missing the point of formal goal alignment and maybe even outer alignment in general.

To formally align an ASI to human (or your) values, we do not need to actually know those values. We only need to strongly point to them.

AI will figure out our values. Whether it's aligned or not, a recursively self-improving AI will eventually get a very good model of our values, as part...

(See More – 155 more words)

1Pi Rogers1h

I'm 60% confident that SBF and Mao Zedong (and just about everyone) would converge to nearly the same values (which we call "human values") if they were rational enough and had good enough decision theory. If I'm wrong, (1) is a huge problem and the only surefire way to solve it is to actually be the human whose values get extrapolated. Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values. I think (2) is a very human problem. Due to very weird selection pressure, humans ended up really smart but also really irrational. I think most human evil is caused by a combination of overconfidence wrt our own values and lack of knowledge of things like the unilateralist's curse. An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human. (Also 60% confident. I would not want to stake the fate of the universe on this claim) I agree that moral uncertainty is a very hard problem, but I don't think we humans can do any better on it than an ASI. As long as we give it the right pointer, I think it will handle the rest much better than any human could. Decision theory is a bit different, since you have to put that into the utility function. Dealing with moral uncertainty is just part of expected utility maximization. To solve (2), I think we should try to adapt something like the Hippocratic principle to work for QACI, without requiring direct reference to a human's values and beliefs (the sidestepping of which is QACI's big advantage over PreDCA). I wonder if Tammy has thought about this.

Wei Dai23m20

Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.

But we could have said the same thing of SBF, before the disaster happened.

Due to very weird selection pressure, humans ended up really smart but also really irrational. [...] An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to

... (read more)

1quetzal_rainbow8h

I think the endorsed answer is "QACI as self-contained field of research is seeking which goal is safe, not how to get AI pursue this goal in robust way". Also, if you can create AI which makes correct guesses about galaxy-brained universe simulations, you can also create AI which makes correct guesses about nanotech design, which is kinda exfohazardous.

Ironing Out the Squiggles

136

Zack_M_Davis

Adversarial Examples: A Problem

The apparent successes of the deep learning revolution conceal a dark underbelly. It may seem that we now know how to get computers to (say) check whether a photo is of a bird, but this façade of seemingly good performance is belied by the existence of adversarial examples—specially prepared data that looks ordinary to humans, but is seen radically differently by machine learning models.

The differentiable nature of neural networks, which make them possible to be trained at all, are also responsible for their downfall at the hands of an adversary. Deep learning models are fit using stochastic gradient descent (SGD) to approximate the function between expected inputs and outputs. Given an input, an expected output, and a loss function (which measures "how bad" it...

(Continue Reading – 3116 more words)

2Wei Dai6h

Do you know if it is happening naturally from increased scale, or only correlated with scale (people are intentionally trying to correct the "misalignment" between ML and humans of shape vs texture bias by changing aspects of the ML system like its training and architecture, and simultaneously increasing scale)? I somewhat suspect the latter due the existence of a benchmark that the paper seems to target ("humans are at 96% shape / 4% texture bias and ViT-22B-384 achieves a previously unseen 87% shape bias / 13% texture bias"). In either case, it seems kind of bad that it has taken a decade or two to get to this point from when adversarial examples were first noticed, and it's unclear whether other adversarial examples or "misalignment" remain in the vision transformer. If the first transformative AIs don't quite learn the right values due to having a different inductive bias from humans, it may not matter much that 10 years later the problem would be solved.

1gallabytes3h

adversarial examples definitely still exist but they'll look less weird to you because of the shape bias. anyway this is a random visual model, raw perception without any kind of reflective error correction loop, I'm not sure what you expect it to do differently, or what conclusion you're trying to draw from how it does behave? the inductive bias doesn't precisely match human vision, so it has different mistakes, but as you scale both architectures they become more similar. that's exactly what you'd expect for any approximately Bayesian setup. the shape bias increasing with scale was definitely conjectured long before it was tested. ML scaling is very recent though,and this experiment was quite expensive. Remember when GPT-2 came out and everyone thought that was a big model? This is an image classifier which is over 10x larger than that. They needed a giant image classification dataset which I don't think even existed 5 years ago.

Wei Dai39m20

the inductive bias doesn’t precisely match human vision, so it has different mistakes, but as you scale both architectures they become more similar. that’s exactly what you’d expect for any approximately Bayesian setup.

I can certainly understand that as you scale both architectures, they both make less mistakes on distribution. But do they also generalize out of training distribution more similarly? If so, why? Can you explain this more? (I'm not getting your point from just "approximately Bayesian setup".)

They needed a giant image classification data

... (read more)

3Carl Feynman9h

An interesting question! I looked in “Towards Deep Learning Models Resistant to Adversarial Attacks” to see what they had to say on the question. If I’m interpreting their Figure 6 correctly, there’s a negligible increase in error rate as epsilon increases, and then at some point the error rate starts swooping up toward 100%. The transition seems to be about where the perturbed images start to be able to fool humans. (Or perhaps slightly before.). So you can’t really blame the model for being fooled, in that case. If I had to pick an epsilon to train with, I would pick one just below the transition point, where robustness is maximized without getting into the crazy zone. All this is the result of a cursory inspection of a couple of papers. There’s about a 30% chance I’ve misunderstood.

Transformers Represent Belief State Geometry in their Residual Stream

332

Adam Shai

Ω 13015d

Produced while being an affiliate at PIBBSS^[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, Sarah, and @Guillaume Corlouer for suggestions on this writeup.

Introduction

What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. We'll explain exactly what this means in the post. We are excited by these results because

We have a formalism that relates training data to internal

...

(Continue Reading – 3335 more words)

snewman6h20

I am trying to wrap my head around the high-level implications of this statement. I can come up with two interpretations:

What LLMs are doing is similar to what people do as they go about their day. When I walk down the street, I am simultaneously using visual and other input to assess the state of the world around me ("that looks like a car"), running a world model based on that assessment ("the car is coming this way"), and then using some other internal mechanism to decide what to do ("I'd better move to the sidewalk").
What LLMs are doing is harder than

... (read more)

9habryka10h

Promoted to curated: Formalizing what it means for transformers to learn "the underlying world model" when engaging in next-token prediction tasks seems pretty useful, in that it's an abstraction that I see used all the time when discussing risks from models where the vast majority of the compute was spent in pre-training, where the details usually get handwaived. It seems useful to understand what exactly we mean by that in more detail. I have not done a thorough review of this kind of work, but it seems to me that also others thought the basic ideas in the work hold up, and I thought reading this post gave me crisper abstractions to talk about this kind of stuff in the future.

Tamsin Leake's Shortform

Tamsin Leake

Ω 31y

Pi Rogers1h10

What about the following:

My utility function is pretty much just my own happiness (in a fun-theoretic rather than purely hedonistic sense). However, my decision theory is updateless with respect to which sentient being I ended up as, so once you factor that in, I'm a multiverse-wide realityfluid-weighted average utilitarian.

I'm not sure how correct this is, but it's possible.

Why I'm doing PauseAI

Joseph Miller

GPT-5 training is probably starting around now. It seems very unlikely that GPT-5 will cause the end of the world. But it’s hard to be sure. I would guess that GPT-5 is more likely to kill me than an asteroid, a supervolcano, a plane crash or a brain tumor. We can predict fairly well what the cross-entropy loss will be, but pretty much nothing else.

Maybe we will suddenly discover that the difference between GPT-4 and superhuman level is actually quite small. Maybe GPT-5 will be extremely good at interpretability, such that it can recursively self improve by rewriting its own weights.

Hopefully model evaluations can catch catastrophic risks before wide deployment, but again, it’s hard to be sure. GPT-5 could plausibly be devious enough to circumvent all of...

(See More – 955 more words)

1Prometheus1h

My birds are singing the same tune.

1Odd anon5h

Sam Altman confirmed (paywalled, sorry) in November that GPT-5 was already under development. (Interestingly, the confirmation was almost exactly six months after Altman told a senate hearing (under oath) that "We are not currently training what will be GPT-5; we don't have plans to do it in the next 6 months.")

Prometheus1h10

It probably began training in January and finished around early April. And they're now doing evals.

1MrCheeze3h

"Under development" and "currently training" I interpret as having significantly different meanings.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

David Gross's Shortform

David Gross

David Gross1h20

And then today I read this: “We yearn for the transcendent, for God, for something divine and good and pure, but in picturing the transcendent we transform it into idols which we then realize to be contingent particulars, just things among others here below. If we destroy these idols in order to reach something untainted and pure, what we really need, the thing itself, we render the Divine ineffable, and as such in peril of being judged non-existent. Then the sense of the Divine vanishes in the attempt to preserve it.” (Iris Murdoch, Metaphysics as a Guide to Morals)

We are headed into an extreme compute overhang

devrandom

If we achieve AGI-level performance using an LLM-like approach, the training hardware will be capable of running ~1,000,000s concurrent instances of the model.

Definitions

Although there is some debate about the definition of compute overhang, I believe that the AI Impacts definition matches the original use, and I prefer it: "enough computing hardware to run many powerful AI systems already exists by the time the software to run such systems is developed". A large compute overhang leads to additional risk due to faster takeoff.

I use the types of superintelligence defined in Bostrom's Superintelligence book (summary here).

I use the definition of AGI in this Metaculus question. The adversarial Turing test portion of the definition is not very relevant to this post.

Thesis

Due to practical reasons, the compute requirements for training LLMs...

(See More – 408 more words)

8gwern6h

You know 'finetunes are composable', because a finetune is just a gradient descent step on a batch of data and a parameter update, and if you train on more than one GPU and share updates, DL training still works {{citation needed}}. If you can train asynchronously on a thousand, or 20,000, or 100,000 GPUs, that is what you are doing; this is especially true in DRL, where you might be, say, training across 170,000 CPU-cores. This works because you don't insist on everything being up to date every moment and you accept that there will be degrees of inconsistency/outdatedness. (You are certainly not accumulating the gradient across the entire cluster by waiting for every single node, pausing everything, calculating a single global step, and pushing it out, and only then resuming, as if it were a single GPU! Really, you don't even want to do that on a single GPU for DRL if you gotta go fast.) This works so well that people will casually talk about training "an" AlphaZero, even though they actually mean something more like "the 512 separate instances of AlphaZero we are composing finetunes of" (or more).* You do have issues with stale gradients and off-policyness of updates and how to best optimize throughput of all of the actors vs training nodes and push out model updates efficiently so nodes stop executing outdated parameters as quickly as possible, and DeepMind & OpenAI etc have done a lot of work on that - but at that point, as in the joke, you have conceded that finetunes are composable and you can keep a very large number of replicas in sync, and it is merely a matter of haggling over how much efficiency you lose. Also note that it takes a lot less compute to keep a model up to date doing simple online learning on new data than it does to train it from scratch on all historical data summed together (obviously), so what devrandom is talking about is actually a lot easier than creating the model in the first place. A better model to imagine is not "somehow finet

2faul_sname5h

I think we may be using words differently. By "task" I mean something more like "predict the next token in a nucleotide sequence" and less like "predict the next token in this one batch of training data that is drawn from the same distribution as all the other batches of training data that the parallel instances are currently training on". It's not an argument that you can't train a little bit on a whole bunch of different data sources, it's an argument that running 1.2M identical instances of the same model is leaving a lot of predictive power on the table as compared by having those models specialize. For example, 70B model trained on next-token prediction only on the entire 20TB GenBank dataset will have better performance at next-nucleotide prediction than a 70B model that has been trained both on the 20TB GenBank dataset and on all 14TB of code on Github. Once you have a bunch of specialized models "the weights are identical" and "a fine tune can be applied to all members" no longer holds.

6gwern3h

I don't believe that's obvious, and to the extent that it's true, I think it's largely irrelevant (and part of the general prejudice against scaling & Bitter Lesson thinking, where everyone is desperate to find an excuse for small specialist models with complicated structures & fancy inductive biases because that feels right). Nor do I see how this is relevant to your original claim. If you have lots of task-specialist models, how does this refute the claim that those will be able to coordinate? Of course they will. They will just share weight updates in exactly the way I just outlined, which works so well in practice. You may not be able to share parameter-updates across your protein-only and your Python-only LLMs, but they will be able to share updates within that model family and the original claim ("AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members.") remains true, no matter how you swap out your definition of 'model'. DL models are fantastically good at collaborating and updating each other, in many ways completely impossible for humans, whether you are talking about AGI models or narrow specialist models.

faul_sname2h20

I don't believe that's obvious, and to the extent that it's true, I think it's largely irrelevant (and part of the general prejudice against scaling & Bitter Lesson thinking, where everyone is desperate to find an excuse for small specialist models with complicated structures & fancy inductive biases because that feels right).

Man, that Li et al paper has pretty wild implications if it generalizes. I'm not sure how to square those results with the Chinchilla paper though (I'm assuming it wasn't something dumb like "wall-clock time was better with... (read more)

tlevin's Shortform

tlevin

9Akash13h

Agree with lots of this– a few misc thoughts [hastily written]: 1. I think the Overton Window frame ends up getting people to focus too much on the dimension "how radical is my ask"– in practice, things are usually much more complicated than this. In my opinion, a preferable frame is something like "who is my target audience and what might they find helpful." If you're talking to someone who makes it clear that they will not support X, it's silly to keep on talking about X. But I think the "target audience first" approach ends up helping people reason in a more sophisticated way about what kinds of ideas are worth bringing up. As an example, in my experience so far, many policymakers are curious to learn more about intelligence explosion scenarios and misalignment scenarios (the more "radical" and "speculative" threat models). 2. I don't think it's clear that the more effective actors in DC tend to be those who look for small wins. Outside of the AIS community, there sure do seem to be a lot of successful organizations that take hard-line positions and (presumably) get a lot of their power/influence from the ideological purity that they possess & communicate. Whether or not these organizations end up having more or less influence than the more "centrist" groups is, in my view, not a settled question & probably varies a lot by domain. In AI safety in particular, I think my main claim is something like "pretty much no group– whether radical or centrist– has had tangible wins. When I look at the small set of tangible wins, it seems like the groups involved were across the spectrum of "reasonableness." 3. The more I interact with policymakers, the more I'm updating toward something like "poisoning the well doesn't come from having radical beliefs– poisoning the well comes from lamer things like being dumb or uninformed, wasting peoples' time, not understanding how the political process works, not having tangible things you want someone to do, explaining ideas poorl

tlevin2h3-2

Quick reactions:

Re: how over-emphasis on "how radical is my ask" vs "what my target audience might find helpful" and generally the importance of making your case well regardless of how radical it is, that makes sense. Though notably the more radical your proposal is (or more unfamiliar your threat models are), the higher the bar for explaining it well, so these do seem related.
Re: more effective actors looking for small wins, I agree that it's not clear, but yeah, seems like we are likely to get into some reference class tennis here. "A lot of successful o

... (read more)

4trevor13h

Recently, John Wentworth wrote: And I think this makes sense (e.g. Simler's Social Status: Down the Rabbit Hole which you've probably read), if you define "AI Safety" as "people who think that superintelligence is serious business or will be some day". The psych dynamic that I find helpful to point out here is Yud's Is That Your True Rejection post from ~16 years ago. A person who hears about superintelligence for the first time will often respond to their double-take at the concept by spamming random justifications for why that's not a problem (which, notably, feels like legitimate reasoning to that person, even though it's not). An AI-safety-minded person becomes wary of being effectively attacked by high-status people immediately turning into what is basically a weaponized justification machine, and develops a deep drive wanting that not to happen. Then justifications ensue for wanting that to happen less frequently in the world, because deep down humans really don't want their social status to be put at risk (via denunciation) on a regular basis like that. These sorts of deep drives are pretty opaque to us humans but their real world consequences are very strong. Something that seems more helpful than playing whack-a-mole whenever this issue comes up is having more people in AI policy putting more time into improving perspective. I don't see shorter paths to increasing the number of people-prepared-to-handle-unexpected-complexity than giving people a broader and more general thinking capacity for thoughtfully reacting to the sorts of complex curveballs that you get in the real world. Rationalist fiction like HPMOR is great for this, as well as others e.g. Three Worlds Collide, Unsong, Worth the Candle, Worm (list of top rated ones here). With the caveat, of course, that doing well in the real world is less like the bite-sized easy-to-understand events in ratfic, and more like spotting errors in the methodology section of a study or making money playing poker.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Adversarial Examples: A Problem

Introduction

Definitions

Thesis

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA