This work proposes a novel persona-driven knowledge synthesis methodology that leverages varied views inside a LLM to create various artificial knowledge. To completely exploit this system at scale, Persona Hub is launched — a group of 1 billion (∼13% of the world’s whole inhabitants) various personas robotically curated from internet knowledge.
The challenge is accessible at GitHub.
The dataset is accessible at HuggingFace.
Two scalable approaches are proposed to derive various personas to assemble Persona Hub from huge internet knowledge: Textual content-to-Persona and Persona-to-Persona.
Textual content-to-Persona
An individual with particular skilled experiences and cultural backgrounds may have distinctive pursuits in studying and writing. Subsequently, from a selected textual content, a selected persona who’s more likely to learn, write, like, or dislike the textual content might be inferred. On condition that textual content knowledge on the internet is just about limitless and all-encompassing, a wide-ranging assortment of personas might be obtained just by prompting an LLM with these internet texts.
In follow, LLMs are requested to output persona descriptions as particularly as potential. The granularity of persona descriptions might be influenced by specifying it within the immediate. Enter texts can even affect the granularity of persona descriptions.
Persona-to-Persona
To complement the personas that Textual content-to-Persona would possibly hardly attain, Persona-to-Persona is proposed. Persona-to-Persona derives personas with interpersonal relationships from these obtained via Textual content-to-Persona. This may be simply achieved by prompting the LLM “Who’s in shut relationship with the given persona?”
In keeping with the six levels of separation idea (any two individuals on Earth might be linked via a sequence of not more than 5 intermediaries (or six steps in whole) , six iterations of persona relationship growth are carried out for every persona obtained via Textual content-to-Persona, thereby enriching the persona assortment even additional.
Deduplication
First, Textual content-to-Persona is run on the RedPajama v2 dataset after which Persona-to-Persona is carried out. To make sure the range of Persona Hub, billions of personas are deduplicated in two methods:
- MinHash-based Deduplication: 1-gram and a signature dimension of 128 for MinHash deduplication are used. Deduplication is carried out at a similarity threshold of 0.9.
- Embedding-based Deduplication: a textual content embedding mannequin (e.g., the text-embedding-3-small mannequin from OpenAI) is used to compute an embedding for every persona, after which personas with a cosine semantic similarity higher than 0.9 are filtered out.
After deduplication and utilizing easy heuristic strategies to filter out low-quality persona descriptions, a complete of 1,015,863,523 personas to kind Persona Hub.
Simply as zero-shot or few-shot strategies can be utilized to immediate an LLM, the persona-driven methodology can also be versatile and appropriate with varied types of prompts to create artificial knowledge. Three persona-driven knowledge synthesis prompting strategies are proposed:
- Zero-shot prompting doesn’t leverage any present examples (i.e., demonstrations), thereby absolutely exploiting the mannequin’s creativity with out being constrained by particular examples.
- Few-shot prompting can higher be sure that the synthesized knowledge meets the necessities by offering some demonstrations.
- Persona-enhanced few-shot prompting is more practical in enhancing the LLM’s persona-driven knowledge synthesis capabilities. Nevertheless, its disadvantage is that it requires deriving the corresponding persona for every demonstration within the few-shot immediate beforehand.
The persona-driven method is flexible and adaptable to completely different knowledge synthesis situations by adjusting the information synthesis immediate.
Math Drawback Synthesis
- Including a persona to a math downside creation immediate leads the LLM to generate issues associated to that persona.
- The immediate’s flexibility isn’t hindered; focus and issue can nonetheless be specified.
- Utilizing personas of math professionals ends in more difficult issues requiring superior mathematical data.
Logical Reasoning Issues
- Logical reasoning issues might be synthesized utilizing the persona-driven methodology.
- Ruozhiba-style logical reasoning issues can be created with personas.
Directions (Consumer Prompts):
- Persona Hub can simulate customers to know their requests for LLM help, leading to various directions.
- Zero-shot and persona-enhanced few-shot prompting strategies can be utilized.
- The persona-enhanced few-shot technique includes inferring personas from present instruction datasets.
- Simulated user-LLM conversations might be generated to reinforce instruction-following and conversational talents.
Information-rich Texts:
- The persona-driven methodology can create knowledge-rich plain textual content for pre-training and post-training of LLMs.
- LLMs might be prompted to put in writing Quora articles utilizing personas.
Sport NPCs:
- Persona Hub can create various NPCs for video games by projecting personas into characters throughout the recreation’s world.
Instrument (Perform) Growth:
- Persona Hub can predict the instruments customers would possibly want, permitting for pre-building these instruments.
- LLMs can name these pre-built instruments to return outcomes with out constructing them from scratch.
- Interface definitions might be transformed into code implementations.
Scaling Artificial Knowledge Creation with 1,000,000,000 Personas 2406.20094