HTLM: Hyper-Text Pre-Training and Prompting of Language Models

Armen Aghajanyan,Dmytro Okhonko,Mike Lewis,Mandar Joshi,Hu Xu,Gargi Ghosh,Luke Zettlemoyer

We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-task-adjacent supervision (e.g. 'class' and 'id' attributes often encode document category information), and (3) it allows for new structured prompting that follows the established semantics of HTML (e.g. to do zero-shot summarization by infilling '' tags for a webpage that contains the input text). We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and fine-tuning for classification benchmarks, while also setting new state-of-the-art performance levels for zero-shot summarization. We also find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs, and that HTLM is highly effective at auto-prompting itself, by simply generating the most likely hyper-text formatting for any available training data. We will release all code and models to support future HTLM research. </p> <div class="badge"> <div class="badgeImgWrapper" style="float:left"><img src="/scripts/img/like_png.png" class="like" onclick="likeUpVote(this, 'null', '100143652', 'Article Admin','')"><a class="labellike">0</a></div> <div class="badgeImgWrapper" style="float:left"><img src="/scripts/img/dislike_png.png" class="dislike" onclick="dislikeDownVote(this, 'null', '100143652', 'Article Admin', '')"><a class="labeldislike">0</a></div> </div> </div> </div> </div> </div> <!-- 留言信息列表展示 --> <div id="comment_list" name="comment_list"> <h2 id="comment_header"><a>Discussion</a></h2> <ul> <!-- 先遍历留言信息(一条留言信息,下面的全是回复信息) --> </ul> </div><!-- 留言的表单, 插入一条评论的记录 --> <div id="leave_comment" name="leave_comment"> <form class="leave_comment_form" action="/content/addcomment" method="post" style="width:80%;"> <input name="userId" value="" hidden="hidden"/> <input name="entityId" value="100143652" hidden="hidden"/> <input name="otherUserId" value="Article Admin" hidden="hidden"/> <input name="status" value="1" hidden="hidden"/> <div class="layui-input-block" style="margin-left: 0;"> <textarea class="textarea_comment" name="content" placeholder="Leave your comment..." style="height: 150px;"></textarea> </div> <br/> </form> <br> </div> <br> <div id="login_div" name="login_div"> <!-- 先遍历留言信息(一条留言信息,下面的全是回复信息) --> <input type="submit" class="button" value="Login"> </div> <div id="related_content" name="related_content"> <h2><a>Related Contents</a></h2> </div> </div> <div id="footerContainer"> <div id="footer"> <ul> <li><a href="/about">About Us</a></li> <li><a href="">Terms and Privacy</a></li> </ul> <ul> <li><a>Copyright 2025 www.deepnlp.org All rights reserved.</a></li> </ul> </div> </div> </div> </body> </html>