Hi there !

Welcome to Lay’s blog ! 馃憢

Spoiler : it’s not gonna be about chips, but about data and AI. 😎

Journey into building a custom RAG (Day 2)

Now that I have a basic RAG working, I performed a few experiments to assess its accuracy in retrieving relevant documents. I did an experiment where I took one part of huggingface TRL library that is supposed to be included in my embeddings, regarding DPO (Direct Preference Optimization) : Then, I asked my model questions about it, but it wouldn鈥檛 get them right, and outputted something like chatGPT would, without context and knowledge about DPO....

February 9, 2024 路 Cyril LAY

Journey into building a custom RAG (Day 1)

I am currently building a RAG (Retrieval Augmented Generation) LLM application, and recently decided to share my journey on this blog. I skip the intro and jump straight to my current status. Here鈥檚 where I am: Starting Point: To get started with all the boilerplate code, I used the Metaflow RAG demo repository and this anyScale blog post for insights and second point of view. Metaflow鈥檚 repo include a streamlit chat webapp, utilities for scraping and parsing markdown documentation files, etc....

February 8, 2024 路 Cyril LAY

Develop a Retrieval Augmented Generation (RAG) LLM Application

This article explains how to customize a large language model for your specific needs. There are many approaches to adapt a Large Language Model (LLM) for your data, with the goal of enhancing the responses relevance. Prompt engineering : you spend time working on a good prompt, eventually a templated prompt where you can include your data. This is the fastest way, but will also quickly become limited, as the context size (maximum input you can give to a model) is usually in the range of 4 - 16K tokens (3 - 11K words), and it is shared between question and response....

February 6, 2024 路 Cyril LAY

Detect Deep Fakes in 30 minutes with Computer Vision

In the realm of digital content, the rise of deep fakes has presented a unique set of challenges. I joined the Kaggle Deep Fake Detection Challenge (DFDC) and challenged myself to submit a working solution in two days maximum. I wrote my approach below. Code is here. The Challenge at a Glance The dataset size is a challenge by itself, being a 500GB of video content dataset. Approach and Methodology Frame Extraction: Capture a few frames from each video to immediately reduce the dataset size by 90%, and obtain something small to iterate with....

December 1, 2018 路 Cyril LAY

Apache Spark Best Practices Megalist

Navigating through the complexities of Apache Spark job optimization can be challenging. Based on recent consulting experiences with various companies, I have compiled a list of best practices for optimizing Apache Spark applications. These strategies have proven to be effective in enhancing performance and cost-efficiency. Partitioning Partition wisely: It鈥檚 important to balance the size of partitions. Avoid too many small partitions (big coordination overhead) and too few large ones (no parallelism)....

May 22, 2017 路 Cyril LAY

Detecting bot traffic on a webserver

TL;DR I did a basic bot detection POC/analysis on HTTP logs of a website, for an interview with a company in 2017. Here are the results : Among the 68,000 unique IPs, 10% of them were classified as potential bots by our algorithm. Among the 5 million requests, 80% were classified as bot traffic. According to this article, 40% of the overall web traffic comes from bots. Given the simplicity of our algorithm, it is most probable that it flags a lot of false positives, as we will see very soon....

February 26, 2017 路 Cyril LAY