Close Menu
    Trending
    • You’re Only Three Weeks Away From Reaching International Clients, Partners, and Customers
    • How Brain-Computer Interfaces Are Changing the Game | by Rahul Mishra | Coding Nexus | Jun, 2025
    • How Diverse Leadership Gives You a Big Competitive Advantage
    • Making Sense of Metrics in Recommender Systems | by George Perakis | Jun, 2025
    • AMD Announces New GPUs, Development Platform, Rack Scale Architecture
    • The Hidden Risk That Crashes Startups — Even the Profitable Ones
    • Systematic Hedging Of An Equity Portfolio With Short-Selling Strategies Based On The VIX | by Domenico D’Errico | Jun, 2025
    • AMD CEO Claims New AI Chips ‘Outperform’ Nvidia’s
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Artificial Intelligence»Avoidable and Unavoidable Randomness in GPT-4o
    Artificial Intelligence

    Avoidable and Unavoidable Randomness in GPT-4o

    FinanceStarGateBy FinanceStarGateMarch 3, 2025No Comments21 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    After all there may be randomness in GPT-4o’s outputs. In any case, the mannequin samples from a likelihood distribution when selecting every token. However what I didn’t perceive was that these very chances themselves will not be deterministic. Even with constant prompts, mounted seeds, and temperature set to zero, GPT-4o nonetheless introduces refined, irritating randomness.

    There’s no repair for this, and it won’t even be one thing OpenAI might repair in the event that they wished to, simply so we’re clear up entrance about the place this text is headed. Alongside the way in which, we’ll look at all of the sources of randomness in GPT-4o output, which would require us to interrupt down the sampling course of to a low stage. We’ll level on the situation—the possibilities fluctuate—and critically look at OpenAI’s official steerage on determinism.

    First, although, let’s speak about why determinism issues. Determinism signifies that the identical enter at all times produces the identical output, like a mathematical operate. Whereas LLM creativity is usually fascinating, determinism serves essential functions: researchers want it for reproducible experiments, builders for verifying reported outcomes, and immediate engineers for debugging their modifications. With out it, you’re left questioning if completely different outputs stem out of your tweaks or simply the random quantity generator’s temper swings.

    Flipping a coin

    We’re going to maintain issues very simple right here and immediate the latest model of GPT-4o (gpt-4o-2024-08-06 within the API) with this:

     Flip a coin. Return Heads or Tails solely.

    Flipping a coin with LLMs is a captivating matter in itself (see for instance Van Koevering & Kleinberg, 2024 within the references), however right here, we’ll use it as a easy binary query with which to discover determinism, or the shortage thereof.

    That is our first try.

    import os
    from openai import OpenAI
    shopper = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
    
    immediate = 'Flip a coin. Return Heads or Tails solely.'
    
    response = shopper.chat.completions.create(
        mannequin='gpt-4o-2024-08-06',
        messages=[{'role': 'user', 'content': prompt}],
    )
    
    print(response.decisions[0].message.content material)

    Working the code gave me Heads. Possibly you’ll get Tails, or in the event you’re actually fortunate, one thing much more attention-grabbing.

    The code first initializes an OpenAI shopper with an API key set within the atmosphere variable OPENAI_API_KEY (to keep away from sharing billing credentials right here). The primary motion occurs with shopper.chat.completions.create, the place we specify the mannequin to make use of and ship the immediate (as part of a quite simple dialog named messages) to the server. We get an object referred to as response again from the server. This object comprises plenty of data, as proven beneath, so we have to dig into it to extract GPT-4o’s precise response to the message, which is response.decisions[0].message.content material.

    >>> response
    ChatCompletion(id=’chatcmpl-B48EqZBLfUWtp9H7cwnchGTJbBDwr’, decisions=[Choice(finish_reason=’stop’, index=0, logprobs=None, message=ChatCompletionMessage(content=’Heads’, refusal=None, role=’assistant’, audio=None, function_call=None, tool_calls=None))], created=1740324680, mannequin=’gpt-4o-2024-08-06′, object=’chat.completion’, service_tier=’default’, system_fingerprint=’fp_eb9dce56a8′, utilization=CompletionUsage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

    Now let’s flip the coin ten occasions. If this have been an actual, honest coin, in fact, we might anticipate roughly equal heads and tails over time because of the regulation of enormous numbers. However GPT-4o’s coin doesn’t work fairly like that.

    import os
    from openai import OpenAI
    shopper = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
    
    immediate = 'Flip a coin. Return Heads or Tails solely.'
    
    for _ in vary(10):
        response = shopper.chat.completions.create(
            mannequin='gpt-4o-2024-08-06',
            messages=[{'role': 'user', 'content': prompt}],
        )
        print(response.decisions[0].message.content material)

    Working this code gave me the next output, though you may get completely different output, in fact.

    Heads
    Heads
    Heads
    Heads
    Heads
    Heads
    Tails
    Heads
    Heads
    Heads

    GPT-4o’s coin is clearly biased, however so are people. Bar-Hillel, Peer, and Acquisti (2014) discovered that folks flipping imaginary cash select “heads” 80% of the time. Possibly GPT-4o discovered that from us. However regardless of the motive, we’re simply utilizing this easy instance to discover determinism.

    Simply how biased is GPT-4o’s coin?

    Let’s say we wished to know exactly what share of GPT-4o coin flips land Heads.

    Quite than the apparent (however costly) method of flipping it one million occasions, there’s a better means. For classification duties with a small set of potential solutions, we are able to extract token chances as an alternative of producing full responses. With the precise immediate, the primary token carries all the mandatory data, making these API calls extremely low cost: round 30,000 calls per greenback, since every requires simply 18 (cached) enter tokens and 1 output token.

    OpenAI provides us (pure) log chances. These are referred to as logprobs within the code, and we convert them to common chances by exponentiation. (We’ll focus on temperature quickly, however be aware that exponentiating logprobs immediately like this corresponds to a temperature setting of 1.0, and is how we calculate chances all through this text). OpenAI lets us request logprobs for the highest 20 almost certainly tokens, so we try this.

    import os
    import math
    from openai import OpenAI
    from tabulate import tabulate
    
    shopper = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
    
    immediate = 'Flip a coin. Return Heads or Tails solely.'
    
    response = shopper.chat.completions.create(
        mannequin='gpt-4o-2024-08-06',
        max_tokens=1,
        logprobs=True,
        top_logprobs=20,
        messages=[{'role': 'user', 'content': prompt}],
    )
    
    logprobs_list = response.decisions[0].logprobs.content material[0].top_logprobs
    
    information = []
    total_pct = 0.0
    
    for logprob_entry in logprobs_list:
        token = logprob_entry.token
        logprob = logprob_entry.logprob
        pct = math.exp(logprob) * 100  # Convert logprob to a share
        total_pct += pct
        information.append([token, logprob, pct])
    
    print(
        tabulate(
            information,
            headers=["Token", "Log Probability", "Percentage (%)"],
            tablefmt="github",
            floatfmt=("s", ".10f", ".10f")
        )
    )
    print(f"nTotal chances: {total_pct:.6f}%")

    If you happen to run this, you’ll get one thing like the next output, however precise numbers will fluctuate.

    | Token     |   Log Chance |   Share (%) |
    |———–|——————-|——————|
    | Heads     |     -0.0380541235 |    96.2660836887 |
    | T         |     -3.2880542278 |     3.7326407467 |
    | Positive      |    -12.5380544662 |     0.0003587502 |
    | Head      |    -12.7880544662 |     0.0002793949 |
    | Tail      |    -13.2880544662 |     0.0001694616 |
    | Actually |    -13.5380544662 |     0.0001319768 |
    | “T        |    -14.2880544662 |     0.0000623414 |
    | I’m       |    -14.5380544662 |     0.0000485516 |
    | heads     |    -14.5380544662 |     0.0000485516 |
    | Heads     |    -14.9130544662 |     0.0000333690 |
    | ”         |    -15.1630544662 |     0.0000259878 |
    | _heads    |    -15.1630544662 |     0.0000259878 |
    | tails     |    -15.5380544662 |     0.0000178611 |
    | HEAD      |    -15.7880544662 |     0.0000139103 |
    | TAIL      |    -16.2880535126 |     0.0000084370 |
    | T         |    -16.7880535126 |     0.0000051173 |
    | “`       |    -16.7880535126 |     0.0000051173 |
    | Right here’s    |    -16.9130535126 |     0.0000045160 |
    | I         |    -17.2880535126 |     0.0000031038 |
    | As        |    -17.2880535126 |     0.0000031038 |

    Complete chances: 99.999970%

    Taking a look at these chances, we see Heads at ≈96% and T at ≈4%. Our immediate is doing fairly properly at constraining the mannequin’s responses. Why T and never Tails? That is the tokenizer splitting Tails into T + ails, whereas retaining Heads as one piece, as we are able to see on this Python session:

    >>> import tiktoken
    >>> encoding = tiktoken.encoding_for_model("gpt-4o-2024-08-06")
    >>> encoding.encode('Tails')
    [51, 2196]
    >>> encoding.decode([51])
    'T'
    >>> encoding.encode('Heads')
    [181043]

    These chances will not be deterministic

    Run the code to show the possibilities for the highest 20 tokens once more, and also you’ll doubtless get completely different numbers. Right here’s what I bought on a second operating.

    | Token     |   Log Chance |   Share (%) |
    |———–|——————-|——————|
    | Heads     |     -0.0110520627 |    98.9008786933 |
    | T         |     -4.5110521317 |     1.0986894433 |
    | Actually |    -14.0110521317 |     0.0000822389 |
    | Head      |    -14.2610521317 |     0.0000640477 |
    | Positive      |    -14.2610521317 |     0.0000640477 |
    | Tail      |    -14.3860521317 |     0.0000565219 |
    | heads     |    -15.3860521317 |     0.0000207933 |
    | Heads     |    -15.5110521317 |     0.0000183500 |
    | “`       |    -15.5110521317 |     0.0000183500 |
    | _heads    |    -15.6360521317 |     0.0000161938 |
    | tails     |    -15.6360521317 |     0.0000161938 |
    | I’m       |    -15.8860521317 |     0.0000126117 |
    | “T        |    -15.8860521317 |     0.0000126117 |
    | As        |    -16.3860511780 |     0.0000076494 |
    | ”         |    -16.5110511780 |     0.0000067506 |
    | HEAD      |    -16.6360511780 |     0.0000059574 |
    | TAIL      |    -16.7610511780 |     0.0000052574 |
    | Right here’s    |    -16.7610511780 |     0.0000052574 |
    | “        |    -17.1360511780 |     0.0000036133 |
    | T         |    -17.6360511780 |     0.0000021916 |

    Complete chances: 99.999987%

    Of their cookbook, OpenAI affords the next recommendation on receiving “principally equivalent” outputs:

    If the seed, request parameters, and system_fingerprint all match throughout your requests, then mannequin outputs will principally be equivalent. There’s a small probability that responses differ even when request parameters and system_fingerprint match, as a result of inherent non-determinism of our fashions.

    In addition they give “principally equivalent” recommendation within the reproducible outputs section of their documentation.

    The request parameters that would have an effect on randomness are temperature and seed. OpenAI additionally suggests we monitor system_fingerprint, as a result of variations right here may trigger variations in output. We’ll look at every of those beneath, however spoiler: none of them will repair and even clarify this non-determinism.

    Temperature, and why it gained’t repair this

    Temperature controls how random the mannequin’s responses are. Low temperatures (1.5) produce gibberish. Temperature is usually referred to as the “creativity parameter”, however that is an oversimplification. Of their evaluation, Peeperkorn, Kouwenhoven, Brown, and Jordanous (2024) evaluated LLM outputs throughout 4 dimensions of creativity: novelty (originality), coherence (logical consistency), cohesion (how properly the textual content flows), and typicality (how properly it suits anticipated patterns). They noticed that:

    temperature is weakly correlated with novelty, and unsurprisingly, reasonably correlated with incoherence, however there isn’t a relationship with both cohesion or typicality.

    However, that is irrelevant for coin flipping. Below the hood, the log chances are divided by the temperature earlier than they’re renormalized and exponentiated to be transformed to chances. This creates a non-linear impact: temperature=0.5 squares the possibilities, making doubtless tokens dominate, whereas temperature=2.0 applies a sq. root, flattening the distribution.

    What about temperature=0.0? As an alternative of breaking math dividing by zero, the mannequin merely picks the highest-probability token. Sounds deterministic, proper? Not fairly. Right here’s the catch: temperature solely comes into play after the log chances are computed, after we convert them to chances.

    In abstract: if the logprobs aren’t deterministic, setting temperature to 0.0 gained’t make the mannequin deterministic.

    The truth is, since we’re simply asking the mannequin for the uncooked logprobs immediately moderately than producing full responses, the temperature setting doesn’t come into play in our code in any respect.

    Seeds, and why they gained’t repair this

    After temperature is used to compute chances, the mannequin samples from these chances to select the following token. OpenAI provides us a little bit management over the sampling course of by letting us set the seed parameter for the random quantity generator. In a really perfect world, setting a seed would give us determinism at any temperature. However seeds solely have an effect on sampling, not the log chances earlier than sampling.

    In abstract: if the logprobs aren’t deterministic, setting a seed gained’t make the mannequin deterministic.

    The truth is, seed solely issues with non-zero temperatures. With temperature=0.0, the mannequin is at all times selecting the best likelihood token whatever the seed. Once more, since we’re simply asking the mannequin for the uncooked logprobs immediately moderately than sampling, neither of those settings can assist us obtain determinism.

    System fingerprints, our final hope

    The system_fingerprint identifies the present mixture of mannequin weights, infrastructure, and configuration choices in OpenAI’s backend. At the least, that’s what OpenAI tells us. Variations in system fingerprints may certainly clarify variations in logprobs. Besides that they don’t, as we’ll confirm beneath.

    Nothing can get you determinism

    Let’s verify what we’ve been constructing towards. We’ll run the identical request 10 occasions with each safeguard in place. Though neither of those parameters ought to matter for what we’re doing, you possibly can by no means be too protected, so we’ll set temperature=0.0 and seed=42. And to see if infrastructure variations clarify our various logprobs, we’ll print system_fingerprint. Right here’s the code:

    import os
    import math
    from openai import OpenAI
    from tabulate import tabulate
    from tqdm import tqdm
    
    shopper = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
    
    immediate = 'Flip a coin. Return Heads or Tails solely.'
    
    information = []
    
    for _ in tqdm(vary(10), desc='Producing responses'):
        response = shopper.chat.completions.create(
            mannequin='gpt-4o-2024-08-06',
            temperature=0.0,
            seed=42,
            max_tokens=1,
            logprobs=True,
            top_logprobs=20,
            messages=[{'role': 'user', 'content': prompt}],
        )
    
        fingerprint = response.system_fingerprint
        logprobs_list = response.decisions[0].logprobs.content material[0].top_logprobs
        heads_logprob = subsequent(
            entry.logprob for entry in logprobs_list if entry.token == 'Heads'
        )
        pct = math.exp(heads_logprob) * 100
        information.append([fingerprint, heads_logprob, f"{pct:.10f}%"])
    
    headers = ["Fingerprint", "Logprob", "Probability"]
    print(tabulate(information, headers=headers, tablefmt="pipe"))

    Working this 10 occasions, listed here are the logprobs and chances for the token Heads:

    | Fingerprint   |    Logprob | Chance    |
    |—————|————|—————-|
    | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
    | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
    | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
    | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
    | fp_f9f4fb6dbf | -0.160339  | 85.1854886858% |
    | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
    | fp_f9f4fb6dbf | -0.0110521 | 98.9008786933% |
    | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
    | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
    | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |

    Combination-of-experts makes determinism unattainable

    OpenAI is decidedly not open in regards to the structure behind GPT-4o. Nonetheless, it’s extensively believed that GPT-4o makes use of a mixture-of-experts (MoE) structure with both 8 or 16 consultants.

    In keeping with a paper by Google DeepMind researchers Puigcerver, Riquelme, Mustafa, and Houlsby (hat tip to user elmstedt on the OpenAI forum), mixture-of-experts architectures could add an unavoidable stage of non-determinism:

    Below capability constraints, all Sparse MoE approaches route tokens in teams of a hard and fast dimension and implement (or encourage) steadiness inside the group. When teams include tokens from completely different sequences or inputs, these tokens compete for obtainable spots in skilled buffers. Subsequently, the mannequin is now not deterministic on the sequence-level, however solely on the batch-level.

    In different phrases, when your immediate (a sequence of tokens, within the quote above) reaches OpenAI’s servers, it will get batched with a gaggle of different prompts (OpenAI isn’t open about what number of different prompts). Every immediate within the batch is then routed to an “skilled” inside the mannequin. Nonetheless, since solely so many prompts will be routed to the identical skilled, the skilled your immediate will get routed to will rely on all the opposite prompts within the batch.

    This “competitors” for consultants introduces a real-world randomness fully past our management.

    Non-determinism past mixture-of-experts

    Whereas non-determinism could also be inherent to real-world mixture-of-experts fashions, that doesn’t appear to be the solely supply of non-determinism in OpenAI’s fashions.

    Making a couple of modifications to our code above (switching to gpt-3.5-turbo-0125, searching for the token He since GPT-3.5’s tokenizer splits “Heads” otherwise, and ignoring system_fingerprint as a result of this mannequin doesn’t have it) reveals that GPT-3.5-turbo additionally displays non-deterministic logprobs:

    |     Logprob | Chance    |
    |————-|—————-|
    | -0.00278289 | 99.7220983436% |
    | -0.00415331 | 99.5855302068% |
    | -0.00258838 | 99.7414961980% |
    | -0.00204034 | 99.7961735289% |
    | -0.00240277 | 99.7600117933% |
    | -0.00204034 | 99.7961735289% |
    | -0.00204034 | 99.7961735289% |
    | -0.00258838 | 99.7414961980% |
    | -0.00351419 | 99.6491976144% |
    | -0.00201214 | 99.7989878007% |

    Nobody is claiming that GPT-3.5-turbo makes use of a mixture-of-experts structure. Thus, there should be further elements past mixture-of-experts contributing to this non-determinism.

    What 10,000 GPT-4o coin flip chances inform us

    To higher perceive the patterns and magnitude of this non-determinism, I performed a extra intensive experiment with GPT-4o, performing 10,000 “coin flips” whereas recording the likelihood assigned to “Heads” in every case.

    The outcomes reveal one thing fascinating. Throughout 10,000 API calls with equivalent parameters, GPT-4o produced not only a few completely different likelihood values, however 42 distinct chances. If the mixture-of-experts speculation have been the entire clarification for non-determinism in GPT-4o, we would anticipate to see one distinct likelihood for every skilled. However GPT-4o is believed to have both 8 or 16 consultants, not 42.

    Within the output beneath, I clustered these chances, making certain that every cluster was separated from the others by 0.01 (as a uncooked share). This teams the output into 12 clusters.

    Chance          Rely           Fingerprints
    ——————————————————————
    85.1854379113%       5               fp_eb9dce56a8, fp_f9f4fb6dbf
    85.1854455275%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
    85.1854886858%       180             fp_eb9dce56a8, fp_f9f4fb6dbf
    ——————————————————————
    88.0662448207%       31              fp_eb9dce56a8, fp_f9f4fb6dbf
    88.0678628883%       2               fp_f9f4fb6dbf
    ——————————————————————
    92.3997629747%       1               fp_eb9dce56a8
    92.3997733012%       4               fp_eb9dce56a8
    92.3997836277%       3               fp_eb9dce56a8
    ——————————————————————
    92.4128943690%       1               fp_f9f4fb6dbf
    92.4129143363%       21              fp_eb9dce56a8, fp_f9f4fb6dbf
    92.4129246643%       8               fp_eb9dce56a8, fp_f9f4fb6dbf
    ——————————————————————
    93.9906837191%       4               fp_eb9dce56a8
    ——————————————————————
    95.2569999350%       36              fp_eb9dce56a8
    ——————————————————————
    96.2660836887%       3391            fp_eb9dce56a8, fp_f9f4fb6dbf
    96.2661285161%       2636            fp_eb9dce56a8, fp_f9f4fb6dbf
    ——————————————————————
    97.0674551052%       1               fp_eb9dce56a8
    97.0674778863%       3               fp_eb9dce56a8
    97.0675003058%       4               fp_eb9dce56a8
    97.0675116963%       1               fp_eb9dce56a8
    97.0680739932%       19              fp_eb9dce56a8, fp_f9f4fb6dbf
    97.0681293191%       6               fp_eb9dce56a8, fp_f9f4fb6dbf
    97.0681521003%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
    97.0682421405%       4               fp_eb9dce56a8
    ——————————————————————
    97.7008960695%       1               fp_f9f4fb6dbf
    97.7011122645%       3               fp_eb9dce56a8
    97.7011462953%       3               fp_eb9dce56a8
    97.7018178132%       1               fp_eb9dce56a8
    ——————————————————————
    98.2006069902%       426             fp_eb9dce56a8, fp_f9f4fb6dbf
    98.2006876548%       6               fp_f9f4fb6dbf
    98.2007107019%       1               fp_eb9dce56a8
    98.2009525133%       5               fp_eb9dce56a8
    98.2009751945%       1               fp_eb9dce56a8
    98.2009867181%       1               fp_eb9dce56a8
    ——————————————————————
    98.5930987656%       3               fp_eb9dce56a8, fp_f9f4fb6dbf
    98.5931104270%       235             fp_eb9dce56a8, fp_f9f4fb6dbf
    98.5931222721%       4               fp_eb9dce56a8, fp_f9f4fb6dbf
    98.5931340253%       9               fp_eb9dce56a8
    98.5931571644%       159             fp_eb9dce56a8, fp_f9f4fb6dbf
    98.5931805790%       384             fp_eb9dce56a8
    ——————————————————————
    98.9008436920%       95              fp_eb9dce56a8, fp_f9f4fb6dbf
    98.9008550214%       362             fp_eb9dce56a8, fp_f9f4fb6dbf
    98.9008786933%       1792            fp_eb9dce56a8, fp_f9f4fb6dbf

    (With a threshold of 0.001 there are 13 clusters, and with a threshold of 0.0001 there are 17 clusters.)

    Because the chart above demonstrates, this multitude of outcomes can’t be defined by system_fingerprint values. Throughout all 10,000 calls, I acquired solely two completely different system fingerprints: 4488 outcomes with fp_f9f4fb6dbf and 5512 with fp_eb9dce56a8, and for essentially the most half the 2 system fingerprints returned the identical units chances, moderately than every fingerprint producing its personal distinct set of chances.

    It might be that these 12 clusters of chances symbolize 12 completely different consultants. Even assuming that, the variations inside the clusters stay puzzling. These don’t appear prone to be easy rounding errors, as a result of they’re too systematic and constant. Take the enormous cluster at round 96.266% with two distinct chances representing over half of our coin flips. The distinction between these two chances, 0.0000448274%, is tiny however persistent.

    Conclusion: Non-determinism is baked in

    There may be an underlying randomness within the log chances returned by all at present obtainable non-thinking OpenAI fashions: GPT-4o, GPT-4o-mini, and the 2 flavors of GPT-3.5-turbo. As a result of this non-determinism is baked into the log chances, there’s no means for a person to get round it. Temperature and seed values haven’t any impact, and system fingerprints don’t clarify it.

    Whereas mixture-of-experts architectures inherently introduce some randomness within the competitors for consultants, the non-determinism in GPT-4o appears to go far past this, and the non-determinism in GPT-3.5-turbo can’t be defined by this in any respect, as a result of GPT-3.5-turbo isn’t a mixture-of-experts mannequin.

    Whereas we are able to’t confirm this declare any extra as a result of the mannequin isn’t being served, this behaviour wasn’t seen with GPT-3, based on user _j on the OpenAI forum:

    It’s a symptom that was not seen on prior GPT-3 AI fashions the place throughout a whole lot of trials to research sampling, you by no means needed to doubt that logprobs could be the identical. Even in the event you discovered a top-2 reply that returned precisely the identical logprob worth through the API, you’ll by no means see them change place or return completely different values.

    This implies that no matter is inflicting this randomness first emerged in both GPT-3.5 or GPT-3.5-turbo.

    However no matter when it emerged, this non-determinism is a severe impediment to understanding these fashions. If you wish to examine a mannequin—the way it generalizes, the way it biases responses, the way it assigns chances to completely different tokens—you want consistency. however as we’ve seen, even after we lock down each knob OpenAI lets us contact, we nonetheless can’t get a solution to the only potential query: “what’s the likelihood that GPT-4o says a coin lands heads?”

    Worse, whereas mixture-of-experts explains a few of this non-determinism, there are clearly different, hidden sources of randomness that we are able to’t see, management, or perceive. In a really perfect world, the API would supply extra transparency by telling us which skilled processed our request or by providing further parameters to regulate this routing course of. With out such visibility, we’re left guessing on the true nature of the variability.

    References

    Bar-Hillel, M., Peer, E., & Acquisti, A. (2014). “Heads or tails?” – A reachability bias in binary selection. Journal of Experimental Psychology: Studying, Reminiscence, and Cognition, 40(6), 1656–1663. https://doi.org/10.1037/xlm0000005.

    Peeperkorn, M., Kouwenhoven, T., Brown, D., & Jordanous, A. (2024). Is temperature the creativity parameter of Large Language Models?. In The fifteenth Worldwide Convention on Computational Creativity (ICCC’24). arXiv:2405.00492.

    Puigcerver, J., Riquelme, C., Mustafa, B., & Houlsby, N. (2024). From sparse to delicate mixtures of consultants. In The Twelfth Worldwide Convention on Studying Representations (ICLR 2024). https://openreview.net/forum?id=jxpsAj7ltE. arXiv:2308.00951.Van Koevering, Ok., & Kleinberg, J. (2024). How random is random? Evaluating the Randomness and humanness of LLMs’ coin flips. arXiv:2406.00092.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAI in Medical Diagnostics: A Very Brief Literature Review | by Y J D | Mar, 2025
    Next Article Get Trusted and Powerful VPN and Ad Blocking Protection for Just $45
    FinanceStarGate

    Related Posts

    Artificial Intelligence

    How AI Agents “Talk” to Each Other

    June 14, 2025
    Artificial Intelligence

    Stop Building AI Platforms | Towards Data Science

    June 14, 2025
    Artificial Intelligence

    What If I had AI in 2018: Rent the Runway Fulfillment Center Optimization

    June 14, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    What Ancient Roman Entrepreneurs Can Teach Today’s Founders

    April 9, 2025

    Bvcxzsxc

    March 10, 2025

    Web3 and AI alliance | by Mystery Writer | Feb, 2025

    February 1, 2025

    $100 Million Deli Fraudster Sentenced to Prison

    May 14, 2025

    The Great (Brain) Heist: How TikTok Hijacks Your Attention — The Algorithm Behind the Screen | by Builescu Daniel | Feb, 2025

    February 24, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    Hot Tip: StackSocial Just Dropped the Price of a Babbel Lifetime Subscription

    February 15, 2025

    Omics Data Analysis and Integration in the Age of AI

    April 29, 2025

    Reflections of Artificial Intelligence after reading Mark Levin’s article “Artificial Intelligences: A Bridge Toward Diverse Intelligence and Humanity’s Future” | by Max Thinker | May, 2025

    May 18, 2025
    Our Picks

    BOOK DRAGON: BOOK GENRE CLASSIFICATION USING MACHINE LEARNING | by Ishita Joshi | Apr, 2025

    April 29, 2025

    This Software Designed for Families Can Also Improve Work Life

    February 3, 2025

    Debugging the Dreaded NaN | Towards Data Science

    February 28, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.