<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://research.moraleconomy.au/index.php?action=history&amp;feed=atom&amp;title=Research%3AThe_LLM_Olympics</id>
	<title>Research:The LLM Olympics - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://research.moraleconomy.au/index.php?action=history&amp;feed=atom&amp;title=Research%3AThe_LLM_Olympics"/>
	<link rel="alternate" type="text/html" href="https://research.moraleconomy.au/index.php?title=Research:The_LLM_Olympics&amp;action=history"/>
	<updated>2026-04-24T17:49:11Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.43.0</generator>
	<entry>
		<id>https://research.moraleconomy.au/index.php?title=Research:The_LLM_Olympics&amp;diff=25273&amp;oldid=prev</id>
		<title>Reversedragon: the test will not stop</title>
		<link rel="alternate" type="text/html" href="https://research.moraleconomy.au/index.php?title=Research:The_LLM_Olympics&amp;diff=25273&amp;oldid=prev"/>
		<updated>2026-02-16T05:07:16Z</updated>

		<summary type="html">&lt;p&gt;the test will not stop&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 05:07, 16 February 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;This is a collection of tests designed to solidly demonstrate the reasons LLMs shouldn&#039;t be used to answer questions, do formal logic, write papers, script videos, or even make alchemical combination games. Many tech demos focus on clumsily attempting to demonstrate that LLMs {{em|can}} do things, and many opinion pieces go on and on about the purely philosophical reasons they &quot;definitely&quot; couldn&#039;t capture the unique human spirit the author purports to exist, but it isn&#039;t as common to put tasks in front of LLMs that {{em|should be reasonable}} and keep guiding them onward and onward in good faith toward achieving those tasks until they absolutely break. (With the exception of &quot;jailbreaking&quot; research, of course.)&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;This is a collection of tests designed to solidly demonstrate the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;actual &lt;/ins&gt;reasons LLMs shouldn&#039;t be used to answer questions, do formal logic, write papers, script videos, or even make alchemical combination games. Many tech demos focus on clumsily attempting to demonstrate that LLMs {{em|can}} do things, and many opinion pieces go on and on about the purely philosophical reasons they &quot;definitely&quot; couldn&#039;t capture the unique human spirit the author purports to exist, but it isn&#039;t as common to put tasks in front of LLMs that {{em|should be reasonable}} and keep guiding them onward and onward in good faith toward achieving those tasks until they absolutely break. (With the exception of &quot;jailbreaking&quot; research, of course.)&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;All these tests were run on [https://docs.ollama.com/linux ollama], an offline LLM runtime that operates in a terminal window, and on a standard consumer computer, within less than 2 gigabytes (GiB) of RAM. Further technical specifications will be given on the individual test pages.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;All these tests were run on [https://docs.ollama.com/linux ollama], an offline LLM runtime that operates in a terminal window, and on a standard consumer computer, within less than 2 gigabytes (GiB) of RAM. Further technical specifications will be given on the individual test pages.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l7&quot;&gt;Line 7:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 7:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Three very important rules will be followed in all of these tests:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Three very important rules will be followed in all of these tests:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;1) Absolutely no online models will be used, only models that can be run entirely offline. This is mainly for the ethical concern of making sure that running the models does not use more computing power or rack space than a regular computer program. However, it also has the benefit of creating the simplest test cases with no external variables. If there is only 1 gigabyte of model or less and not 10 more gigabytes of model hiding out of view, it is easier to know the full range of behaviors of the model, and if nobody else is running the model, there will not be any external actions &quot;the company&quot; can take at the same time the test is running such as datamining conversations or inserting ads. All the causes and effects inside the test will be in one place.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;strong&amp;gt;&lt;/ins&gt;1)&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/strong&amp;gt; &lt;/ins&gt;Absolutely no online models will be used, only models that can be run entirely offline. This is mainly for the ethical concern of making sure that running the models does not use more computing power or rack space than a regular computer program. However, it also has the benefit of creating the simplest test cases with no external variables. If there is only 1 gigabyte of model or less and not 10 more gigabytes of model hiding out of view, it is easier to know the full range of behaviors of the model, and if nobody else is running the model, there will not be any external actions &quot;the company&quot; can take at the same time the test is running such as datamining conversations or inserting ads. All the causes and effects inside the test will be in one place.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;2) No generated sentences will be directly copied onto any page. All the text on these pages is created manually. The longest quotations of generated text on these pages will be approximately three words long.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;strong&amp;gt;&lt;/ins&gt;2)&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/strong&amp;gt; &lt;/ins&gt;No generated sentences will be directly copied onto any page. All the text on these pages is created manually. The longest quotations of generated text on these pages will be approximately three words long.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;3) The LLM must not be given an unreasonable task, only tasks which fit within the boundaries of its known programming, bugs, and quirks. Each task will include several steps of &quot;testing understanding&quot; to make sure the LLM is getting the intended answers at every single step before then giving it harder questions requiring inference and not directly explained in the text. Unless &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;it &lt;/del&gt;proves to be impossible, the test will &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;continue &lt;/del&gt;until the LLM actually completes the task.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;strong&amp;gt;&lt;/ins&gt;3)&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;/strong&amp;gt; &lt;/ins&gt;The LLM must not be given an unreasonable task, only tasks which fit within the boundaries of its known programming, bugs, and quirks. Each task will include several steps of &quot;testing understanding&quot; to make sure the LLM is getting the intended answers at every single step before then giving it harder questions requiring inference and not directly explained in the text. Unless &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;the task &lt;/ins&gt;proves to be &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;truly &lt;/ins&gt;impossible, the test will &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;not stop &lt;/ins&gt;until the LLM actually completes the task.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Tests ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Tests ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Reversedragon</name></author>
	</entry>
	<entry>
		<id>https://research.moraleconomy.au/index.php?title=Research:The_LLM_Olympics&amp;diff=25272&amp;oldid=prev</id>
		<title>Reversedragon: background</title>
		<link rel="alternate" type="text/html" href="https://research.moraleconomy.au/index.php?title=Research:The_LLM_Olympics&amp;diff=25272&amp;oldid=prev"/>
		<updated>2026-02-16T05:01:36Z</updated>

		<summary type="html">&lt;p&gt;background&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;This is a collection of tests designed to solidly demonstrate the reasons LLMs shouldn&amp;#039;t be used to answer questions, do formal logic, write papers, script videos, or even make alchemical combination games. Many tech demos focus on clumsily attempting to demonstrate that LLMs {{em|can}} do things, and many opinion pieces go on and on about the purely philosophical reasons they &amp;quot;definitely&amp;quot; couldn&amp;#039;t capture the unique human spirit the author purports to exist, but it isn&amp;#039;t as common to put tasks in front of LLMs that {{em|should be reasonable}} and keep guiding them onward and onward in good faith toward achieving those tasks until they absolutely break. (With the exception of &amp;quot;jailbreaking&amp;quot; research, of course.)&lt;br /&gt;
&lt;br /&gt;
All these tests were run on [https://docs.ollama.com/linux ollama], an offline LLM runtime that operates in a terminal window, and on a standard consumer computer, within less than 2 gigabytes (GiB) of RAM. Further technical specifications will be given on the individual test pages.&lt;br /&gt;
&lt;br /&gt;
== Rules ==&lt;br /&gt;
&lt;br /&gt;
Three very important rules will be followed in all of these tests:&lt;br /&gt;
&lt;br /&gt;
1) Absolutely no online models will be used, only models that can be run entirely offline. This is mainly for the ethical concern of making sure that running the models does not use more computing power or rack space than a regular computer program. However, it also has the benefit of creating the simplest test cases with no external variables. If there is only 1 gigabyte of model or less and not 10 more gigabytes of model hiding out of view, it is easier to know the full range of behaviors of the model, and if nobody else is running the model, there will not be any external actions &amp;quot;the company&amp;quot; can take at the same time the test is running such as datamining conversations or inserting ads. All the causes and effects inside the test will be in one place.&lt;br /&gt;
&lt;br /&gt;
2) No generated sentences will be directly copied onto any page. All the text on these pages is created manually. The longest quotations of generated text on these pages will be approximately three words long.&lt;br /&gt;
&lt;br /&gt;
3) The LLM must not be given an unreasonable task, only tasks which fit within the boundaries of its known programming, bugs, and quirks. Each task will include several steps of &amp;quot;testing understanding&amp;quot; to make sure the LLM is getting the intended answers at every single step before then giving it harder questions requiring inference and not directly explained in the text. Unless it proves to be impossible, the test will continue until the LLM actually completes the task.&lt;br /&gt;
&lt;br /&gt;
== Tests ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Context window test&lt;br /&gt;
* AI badly solves Deltarune&lt;br /&gt;
* Explaining wave machines&lt;br /&gt;
* Wavebuilder combinations test - make sure it is getting the same combinations, then start pushing it&lt;br /&gt;
* Real proposition test  -  Is or isn&amp;#039;t Deng Xiaoping Thought an anarchism?&lt;br /&gt;
---&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Thesis portals]] [[Category:LLM Olympics (RD)]]&lt;/div&gt;</summary>
		<author><name>Reversedragon</name></author>
	</entry>
</feed>