Manuel Romero

Manuel Romero

Mula, Región de Murcia, España
4 mil seguidores Más de 500 contactos

Acerca de

I would say I am a Senior Back-End developer working with the stack: Node.js / Express /…

Artículos de Manuel

Contribuciones

Actividad

Unirse para ver toda la actividad

Experiencia

  • Gráfico Maisa

    Maisa

    Madrid, Comunidad de Madrid, España

  • -

    Madrid, Comunidad de Madrid, España

  • -

    Madrid, Comunidad de Madrid, España

  • -

  • -

  • -

    Mula, Región de Murcia, España

  • -

  • -

  • -

  • -

  • -

  • -

Educación

  • Universidad de Murcia

    -

  • -

Licencias y certificaciones

Publicaciones

  • SantaCoder: don’t reach for the stars!

    Over the last two years, we have witnessed tremendous progress in the development of code
    generating AI assistants (Chen et al., 2021; Chowdhery et al., 2022; Nijkamp et al., 2022;
    Fried et al., 2022; Li et al., 2022; Athiwaratkun et al., 2022). Machine learning models are
    now capable of assisting professional developers through the synthesis of novel code snippets,
    not only from surrounding code fragments, but also from natural language instructions. The
    models powering these…

    Over the last two years, we have witnessed tremendous progress in the development of code
    generating AI assistants (Chen et al., 2021; Chowdhery et al., 2022; Nijkamp et al., 2022;
    Fried et al., 2022; Li et al., 2022; Athiwaratkun et al., 2022). Machine learning models are
    now capable of assisting professional developers through the synthesis of novel code snippets,
    not only from surrounding code fragments, but also from natural language instructions. The
    models powering these code completion systems are usually referred to as Large Language
    Models for Code—or code LLMs—and are created by training large transformer neural
    networks (Vaswani et al., 2017) on big corpora of source code. However, there is a lack of
    transparency in the research community on the development of these models due to their
    commercial value and the legal uncertainty around distributing training data and models.
    Some groups have released model weights (Fried et al., 2022; Nijkamp et al., 2022) or
    provided access to the model through a paid API service (Chen et al., 2021; Athiwaratkun
    et al., 2022), but these papers did not release the full training data or the preprocessing
    methods that were used.

    Ver publicación
  • BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling (2022)

    Arxiv

    The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name perplexity sampling that enables the pre-training of language models in roughly half the amount of steps…

    The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name perplexity sampling that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.

    Ver publicación
  • The BigScience Corpus A 1.6TB Composite Multilingual Dataset

    -

    As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to…

    As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience LargeOpen-science Open-access Multilingual language model (BLOOM). We further release a large initial subset of the corpus and analyses thereof, and hope to empower further large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research into studying this large multilingual corpus.

    Ver publicación

Cursos

  • Machine Learning Crash Course

    -

Idiomas

  • Inglés

    -

  • Francés

    -

  • Castellano

    -

Más actividad de Manuel

Ver el perfil completo de Manuel

  • Descubrir a quién conocéis en común
  • Conseguir una presentación
  • Contactar con Manuel directamente
Unirse para ver el perfil completo

Perfiles similares

Otras personas con el nombre de Manuel Romero en España

Añade nuevas aptitudes con estos cursos