WildChat
The WildChat Dataset is a corpus of 1 million real-world user-ChatGPT interactions, characterized by a wide range of languages and a diversity of user prompts. It was constructed by offering free access to ChatGPT and GPT-4 in exchange for consensual chat history collection. Using this dataset, we finetuned Meta's Llama-2 and created WildLlama-7b-user-assistant, a chatbot which is able to predict both user prompts and assistant responses.
To learn more: dataset / model / paper / interactive search tool
FAQ
Q: Which versions of GPT are used, and why do some responses to inquiries about the GPT version indicate "GPT-3"?
A: To collect the dataset, we employ two models: ChatGPT (GPT-3.5 Turbo) and GPT-4. During most of our data collection period, GPT-4 was trained with information available up to September 2021, preceding its release. Consequently, the chatbot might self-identify as GPT-3 due to the training data cut-off. However, a notable observation is that when interacting through the ChatGPT web interface, GPT-4 accurately recognizes and states its version. This accuracy could be attributed to a system prompt or a hidden context underlying the web interface.
Q: How was the WildChat dataset created?
A: To create the WildChat dataset, we deployed two chatbot services, one utilizing the GPT-3.5 Turbo API and another leveraging the GPT-4 API. Both services were hosted on Hugging Face Spaces and made available to the public. Importantly, users were not required to create an account or provide personal information to access our services. The dataset currently encompasses interactions from April 9, 2023, to April 30, 2024.
Q: What are some limitations of the WildChat dataset and the WildLlama model?
A:
- User Demographics: Our chatbot service, hosted on Hugging Face Spaces, predominantly attracts developers or individuals connected to the IT sector. Consequently, the user base may not accurately represent the broader population. This specificity influences the dataset, often reflecting conversations with a technological tilt, such as coding queries.
- Toxicity and Selection Bias: The anonymity offered by our chatbot service might encourage users to engage in conversations they would avoid on platforms requiring registration.
- Bias in WildLlama: The WildLlama model, finetuned on machine-generated data from ChatGPT and GPT-4, may inherit biases inherent in its parent models.
- Usage Constraints for WildLlama: WildLlama is designed for research purposes and not for practical advice or applications involving direct human interaction. Its use should be confined to academic and exploratory contexts.