Potential harms of large language models can be mitigated by watermarking
model output, i.e., embedding signals into generated text that are invisible to
humans but algorithmically detectable from a s
Language models (LMs) have demonstrated remarkable performance on downstream
tasks, using in-context exemplars or human instructions. Recent works have
shown that chain-of-thought (CoT) prompting can
Autoformalization is the process of automatically translating from natural language mathematics to formal specifications and proofs.
We show that large language models (llm) can correctly translate a significant portion of mathematical competition problems perfectly to formal specifications in isabelle/hol.
We live in an era of rapid progress in artificial intelligence, both within the field and in the public sphere.
The more adept large language models become at mimicking human language, the more vulnerable we become to anthropomorphism, to seeing the systems in which they are embedded as more human-like than they really are.
Large language models-large neural networks trained on a simple predictive objective over a massive corpus of natural language-have reinvigorated debate over whether human cognitive capacities might emerge in such generic models given sufficient training data.
Of particular interest is the ability of these models to reason about novel problems zero-shot, without any direct training on those problems.
Large language models (LLMs) have shown exceptional performance on a variety
of natural language tasks. Yet, their capabilities for HTML understanding --
i.e., parsing the raw HTML of a webpage, with
Large Language Models (LLMs) are powerful tools, capable of leveraging their
training on natural language to write stories, generate code, and answer
questions. But can they generate functional video
We show how large language models can be made to perform faithful multi-step reasoning via a process whose causal structure mirrors the underlying logical structure of the problem.
Our approach works by chaining together reasoning steps, where each step results from calls to two fine-tuned large language models, one for selection and one for inference, to produce a valid reasoning trace.
This paper explores the limits of the current generation of large language
models for program synthesis in general purpose programming languages. We
evaluate a collection of such models (with between
Large language models (llm) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high.
Today, attempts to assess models'clinical knowledge typically rely on automated evaluations on limited benchmarks.
Large language models (llm) are increasingly able to answer questions like these accurately, but that ability does not necessarily imply a general understanding of concepts relevant to the anchor query.
We propose conceptual consistency to measure a llm s understanding of relevant concepts.
We undertake a joint study of controllability and robustness in the context of large language models (llm).
We demonstrate that state-of-the-art state-of-the-art t5 and palm (both pretrained and finetuned) could exhibit poor controllability and robustness, which do not scale with increasing model size.
The dominant paradigm of natural language processing consists of large-scale
pre-training on general domain data and adaptation to particular tasks or
domains. As we pre-train larger models, conventio
Large Language Models (LLMs) have limited performance when solving arithmetic
reasoning tasks and often provide incorrect answers. Unlike natural language
understanding, math problems typically have a
We investigate the optimal model size and number of tokens for training a
transformer language model under a given compute budget. We find that current
large language models are significantly undertra
Sentence Simplification aims to rephrase complex sentences into simpler
sentences while retaining original meaning. Large Language models (LLMs) have
demonstrated the ability to perform a variety of n
Large Language Models (LLMs) have achieved excellent performances in various
tasks. However, fine-tuning an LLM requires extensive supervision. Human, on
the other hand, may improve their reasoning ab