Pioneering High-Quality Open-Source Data for AI

In the rapidly evolving field of artificial intelligence, the quality and quantity of training data play a crucial role in developing powerful AI models. This article explores MIZU's strategies for building and maintaining high-quality open source data: generating synthetic data, implementing a universal tagging system, and ensuring data freshness through continuous updates.

Synthetic Data Generation

While large foundation models like ChatGPT are trained on internet data written by humans, this approach faces significant challenges:

Data scarcity: Human-generated data is limited in size, and AI consumes data much faster than humans generate it.
Data privacy and barriers: Certain domains have inaccessible data due to privacy issues or proprietary restrictions.
High cost: High-quality data often requires expensive human annotation.

To address these challenges, MIZU is utilizing various data synthesis approaches that have been verified as effective by recent research. The power of synthetic data in improving AI performance is evident in recent developments across the industry. For instance, OpenAI's o1 models, trained on synthesized chain-of-thought (CoT) data, demonstrated greatly improved reasoning abilities in math, coding, and avoidance of jailbreaking.

Specifically, the following approaches have been verified effective for synthetic data generation:

Agentic Framework for Data Synthesis: Recent research, such as Microsoft Research's AgentInstruct, has shown that using agents to synthesize data is a promising approach. This method leverages iterative refinement, tool use, and multi-agent collaboration to produce diverse, high-quality synthetic data.
Persona-Driven Data Synthesis: Work by institutions like Tencent AI Lab has demonstrated the effectiveness of using mined personas during the data synthesis process. By simulating diverse human perspectives, this approach can elicit a wider range of knowledge and capabilities from Language Models (LLMs).
Chain-of-Thought (CoT) Data Synthesis: Recent publications from institutions such as Google DeepMind and Stanford University have explored sophisticated CoT data synthesis processes. These involve breaking down complex questions into sub-problems and resolving them step by step, often using a reward model or verifier to optimize the CoT path.

By leveraging these synthetic data generation techniques, MIZU is creating diverse, high-quality datasets that enhance the capabilities of AI models while overcoming data scarcity, privacy concerns, and domain-specific barriers.

2. Community-Driven Universal Data Tagging

Data tagging is a crucial step in processing data for model training. While conventional approaches to data processing typically involve specific steps (often kept confidential by most LLM developers), they generally include:

Deduplication: Removing redundancy in the data.
Data filtering: Eliminating low-quality, toxic data, or data containing personal information.
Data tagging: Assigning tags to each data example (e.g., domains, categories, language) for use in selecting specific data mixes during training.