Project Structure¶
src
└─llmkit_data
├─cli (entry point for scripts)
├─converter (converts standard datasets to framework-specific formats)
├─std (tools for preprocessing standard datasets)
└─utilsllmkit is designed to be used both as a library and as a command-line tool.
Entry Points: Only Python files under
clican have an entry point like:if __name__ == "__main__": passFunction Modules: Other Python modules should be collections of functions. This allows easy importation in other projects or scripts, such as:
from llmkit_data.utils.json import read_jsonl, write_jsonlFunction Scope: Functions defined in
clishould only be used within the script and should not be imported by others. If a function might be imported in the future, place it underclitemporarily and refactor it later.
Adding a New Dataset¶
To add a new dataset, create an entry named prep_{dataset_name}.py under cli. Users can then use it via:
python -m llmkit_data.cli.prep_{dataset_name}Adding a New Converter (Support for a New Framework)¶
- Research: Read the new framework’s documentation to understand its data formats and how it works.
- Converter Implementation:
- Add a new converter in
converter/{framework}.py. Most functions should be placed here for easy importation. - Create an entry point in
cli/convert_to_{framework}.py.
- Add a new converter in
Note: This command-line tool typically supports various dataset formats. Refer to Dataset Formats for more details.