Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No, that's not how training works. It's not just about having an example in a given language, but also how many examples and the ratio of examples compared to other languages. English hugely eclipses any other language on most US models and that's why performance on other languages is subpar compared to performance on english.


There's actually a research showing that llms are more accurate when questions are in Polish: https://arxiv.org/pdf/2503.01996


My first impulse is to say that some languages have better SNR on the internet. (less garbage autogenerated or SEO content compared to useful information)


I have never noticed any major difference in performance of ChatGPT between English and Spanish. The truth is that as long as the amount of training data of a given language is above some threshold, knowledge transfers between languages.


The issue starts, when an LLM's transfers knowledge between languages, even though that knowledge is not correct in that language. I have seen this with e.g. ChatGPT answers regarding laws for example where it refers to US laws when asked in German, which are obviously not relevant.


> The issue starts, when an LLM's transfers knowledge between languages, even though that knowledge is not correct in that language. I have seen this with e.g. ChatGPT answers regarding laws for example where it refers to US laws when asked in German, which are obviously not relevant.

There is no necessary correlation between language and the correct set of laws to reference. The language of the question (or the answer, if for some reason they are not the same) is an orthogonal issue to the intended scope. There is no reason US laws couldn't be the relevant to a question asked in German (and, conversely, no reason US laws couldn't be wrong for a question asked in English, even if it was specifically and distinguishably US English.)


When you ask an LLM (in German) without further clarifying your location I expect it to refer to German (or Austrian/Swiss) laws.

For most questions it does this pretty well (e.g. asking for the legal age to drink). However once the answer becomes more complex it starts to halucinate very quickly. The fact that some of the hallucinations are just translated US laws makes me think that the knowledge transfer between languages is probably not helping in instances like this.


Ratio/quantity is important, but quality is even more so.

In recent LLMs, filtered internet text is at the low end of the quality spectrum. The higher end is curated scientific papers, synthetic and rephrased text, RLHF conversations, reasoning CoTs, etc. English/Chinese/Python/JavaScript dominate here.

The issue is that when there's a difference in training data quality between languages, LLMs likely associate that difference with the languages if not explicitly compensated for.

IMO it would be far more impactful to generate and publish high-quality data for minority languages for current model trainers, than to train new models that are simply enriched with a higher percentage of low-quality internet scrapings for the languages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: