AI

After an outcry, OpenAI swiftly rereleased 4o to paid users. But experts say it should not have removed the model so suddenly.

OpenAI’s decision to replace 4o with the more straightforward GPT-5 follows a steady drumbeat of news about the potentially harmful effects of extensive chatbot use. Reports of incidents in which ChatGPT sparked psychosis in users have been everywhere for the past few months, and in a blog post last week, OpenAI acknowledged 4o’s failure to…

AI

‘Cheapfake’ AI Celeb Videos Are Rage-Baiting People on YouTube

“They’re tweaking my voice or whatever they’re doing, tweaking their own voice to make it sound like me, and people are commenting on it like it is me and it ain’t me,” Washington recently told WIRED, when asked about AI. “I don’t have an Instagram account. I don’t have TikTok. I don’t have any of…

AI

GPT-5 Doesn’t Dislike You—It Might Just Need a Benchmark for Emotional Intelligence

Since the all-new ChatGPT launched on Thursday, some users have mourned the disappearance of a peppy and encouraging personality in favor of a colder, more businesslike one (a move seemingly designed to reduce unhealthy user behavior.) The backlash shows the challenge of building artificial intelligence systems that exhibit anything like real emotional intelligence. Researchers at…

AI

OpenAI Designed GPT-5 to Be Safer. It Still Outputs Gay Slurs

OpenAI is trying to make its chatbot less annoying with the release of GPT-5. And I’m not talking about adjustments to its synthetic personality that many users have complained about. Before GPT-5, if the AI tool determined it couldn’t answer your prompt because the request violated OpenAI’s content guidelines, it would hit you with a…

Software

Building My First Large Language Model from Scratch

psitbdUser2 months ago08 mins

Welcome everyone. I would like to share my experience of building my own LLM from scratch. In this article, you will come across the details of LLM architecture. I followed a great book: Building Large Language Models (From Scratch) by Sebastian Raschka.

The whole point is that I built the GPT architecture piece by piece, layer by layer. Once done, it was impossible to train on the CPU/GPU I have locally, so I loaded the weights of GPT-2, which are publicly available from OpenAI. As the last part of the process, I also fine-tuned the model to solve classification problems like spam detection. The series will contain 2-3 articles – from building the GPT architecture to pre-training to fine-tuning.

The whole experience enriched my understanding of the deep inner workings of Large Language Models.

I will be adding some code snippets, which are just a glimpse of some aspects covered. You can skip those if not needed. They are optional and only serve to deepen understanding.

LLMs use a Decoder to predict the next word

Broadly speaking, the GPT architecture predicts the next word, which happens repeatedly. This iterative process generates entire new sentences, paragraphs, and even pages.

GPT-2 is an autoregressive, decoder-only model. Autoregressive models incorporate their previous outputs as inputs for future predictions.
How does it predict? The architecture will be explained step by step in the next sections.

Building text tokenisation

This layer maps discrete objects (here, texts) to points in a continuous vector space.

Tiktoken has a public dataset with token values for all the vocabulary in GPT-2. This was used to create the tokenising layer. Tokenisation is simply converting every vocabulary item into an ID, which is an integer number. A total of 50,257 tokens are used in GPT-2.

You can learn more about how byte-pair encoding is used to generate token IDs here: https://www.geeksforgeeks.org/nlp/byte-pair-encoding-bpe-in-nlp/

Building text embeddings

An embedding layer is created which converts the tokens into embeddings, each of 768 dimensions. The layer corresponds to two steps.

Step 1: A torch embedding layer is created with the input dimension equal to vocabulary size and output size of 768 (for GPT-2 small). This neural network layer is trained during pre-training (via backpropagation).

Step 2: A positional embedding is calculated and added to the token embedding to get the final vector embedding. The positional embedding has the same dimension as the token embedding. This is calculated by taking values [0,1,2,3…] and embedding them using a torch embedding layer.

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

positional_embedding_layer = torch.nn.Embedding(context_len, output_dim)
positional_embedding = positional_embedding_layer(torch.arange(context_len))

# Calculating vector embedding
input_embedding = token_embedding + positional_embedding

The attention mechanism is very important in LLMs. It allows each position in the input to consider the relevance of all other positions in a sequence.

You can learn more about attention mechanism basics here: https://www.ibm.com/think/topics/attention-mechanism.

In this implementation, a multi-head attention mechanism is coded. Multiple instances of self-attention are created, each with its own set of weights. The outputs are then combined. Multi-head attention is computationally expensive but very important for recognizing complex patterns.

Multi-head attention is also called Scaled Dot-Product Attention.

Multi-head attention mechanism implementation

You can read more about multi-head attention here: https://www.geeksforgeeks.org/nlp/multi-head-attention-mechanism/.

In GPT-2 small, there are 12 attention heads. So each output vector of a head has a dimension of output_dim (768) / num_heads (12).

The three weights—W_query, W_key, and W_value—are trained later via backpropagation.

Keys, queries, and values are obtained by splitting the input embeddings (which already include positional embeddings) into multiple heads. Then, the dot product is calculated for each head. Masking is applied before calculating the attention scores. Finally, the outputs of all heads are concatenated.