We are seeking a highly skilled and motivated candidate with expertise in programming, problem-solving, and machine learning (ML) and artificial intelligence (AI). The ideal candidate will have strong programming skills, with a particular focus on Python, and experience using key data manipulation libraries such as Pandas and NumPy.
Key Requirements:
1. Proficiency in Python and its data manipulation libraries (e.g., Datasets, Pandas, NumPy).
2. Demonstrable experience designing and building scalable data pipelines for collecting, cleaning, transforming, and versioning large-scale datasets (text, code, structured) specifically for ML model training / ML applications.
3. Hands-on experience in preparing and formatting diverse datasets into specific structures required for ML model training.
4. Experience in curating, cleaning, and structuring datasets for ML model evaluation, ensuring data quality and relevance for various benchmarks.
5. Familiarity with common challenges in training data data preparation, such as bias detection/mitigation, data distribution analysis, and data augmentation techniques.
6. Solid understanding of data engineering best practices, including data quality, versioning, and efficient data processing.
Desirable skills:
1. LLM fine tuning.
2. Data set preparation for LLMs including prompt-completion, instruction-following, chat formats
3. Familiarity with LLM model evaluation strategies.
4. Familiarity with common challenges in LLM data preparation, such as bias detection/mitigation, data distribution analysis, and data augmentation techniques.
5. Familiarity with the Hugging Face Transformers and Datasets libraries is highly desirable.
In addition to technical expertise, the ideal candidate will have experience with Git and GitHub for version control and a proven ability to collaborate effectively in a team environment, particularly when working on shared codebases and remote projects. Strong data management and manipulation skills are crucial, as is experience working on remote servers to develop and deploy machine learning models.