AI and fintech notes

Understanding Python List Comprehensions

Milen Kraev — Wed, 16 Oct 2024 12:15:25 GMT

List comprehensions are one of Python's most elegant features, allowing you to create lists in a concise and readable way. Rather than using multiple lines of code with loops and conditionals to build a list, a list comprehension lets you do the same thing in just one line. This can make your code not only shorter but also more efficient.

At its core, a list comprehension is a way to generate a new list by applying an expression to each item in an existing list, range, or any other iterable. The basic idea is to take a structure you’re looping over and apply some transformation or filtering criteria to create a new list. The resulting syntax is simple and clean, often replacing what might take several lines of code with a single, readable statement.

Imagine you want to create a list of squared numbers for a given range, say from 1 to 5. Using a traditional for loop, you would write something like this:

squares = []

for x in range(1, 6):

squares.append(x ** 2)

This loop goes through each number in the range, squares it, and appends the result to the list.

With list comprehensions, this process is reduced to a single line:

squares = [x ** 2 for x in range(1, 6)]

In this case, x ** 2 is the expression being applied to each item, and for x in range(1, 6) is the loop that iterates over the numbers from 1 to 5. The list comprehension simplifies the process while achieving the same result, giving you a list of squared numbers: [1, 4, 9, 16, 25].

This is a more concise and readable way to achieve the same goal.

This feature shines especially when you need to apply a quick transformation or filter some data. For example, if you only want even numbers from a sequence, you can add a simple condition to the list comprehension. Instead of writing extra lines to check each element and decide whether to include it, you simply add an if clause at the end. The result is a filtered and transformed list with minimal code.

Let’s look at an example to demonstrate this concept.

Suppose you want to filter out only the even numbers from a range of numbers, say from 1 to 10. With a traditional for loop, you’d need to check each number and decide whether to append it to the list:

even_numbers = []

for x in range(1, 11):

if x % 2 == 0:

even_numbers.append(x)

Here, you’re using an if statement inside the loop to filter out the odd numbers.

With a list comprehension, this filtering process is much simpler. You can add the condition directly at the end of the comprehension:

even_numbers = [x for x in range(1, 11) if x % 2 == 0]

In this version, x for x in range(1, 11) iterates over the numbers from 1 to 10, and the condition if x % 2 == 0 filters out only the even numbers. The result is a concise, single-line expression that gives you a list of even numbers: [2, 4, 6, 8, 10].

This makes your code both more compact and easier to read.

Though list comprehensions can be very powerful, it’s important to balance efficiency with readability. The simplicity and speed they offer can make them tempting to use for everything. But if your transformation logic becomes too complex, it may be better to use a traditional loop. Sometimes, fitting everything into one line can make the code harder to understand, especially for someone reading it for the first time.

In general, list comprehensions are ideal for tasks that involve basic transformations or filtering. They let you focus on what you want the result to be rather than how to achieve it, which is a key part of writing Pythonic code. However, for more complicated logic or operations involving multiple steps, sticking to a more explicit approach using loops can make your code clearer.

AutoTokenizer

Milen Kraev — Tue, 27 Aug 2024 13:52:27 GMT

Hugging Face is a leading company in artificial intelligence (AI) and natural language processing (NLP) that has become synonymous with open-source tools and community-driven AI development. Founded in 2016, Hugging Face initially began as a chatbot app aimed at casual conversation but quickly shifted focus to become a central hub for AI research, particularly in language models.

At the heart of Hugging Face’s offerings is the Transformers library, an open-source project that has dramatically transformed the use of deep learning models in NLP. This library provides pre-trained models that can be easily fine-tuned for a variety of tasks, such as text classification, translation, and question answering. By doing so, Hugging Face has made cutting-edge NLP technologies more accessible, allowing developers and researchers to utilize powerful models without needing extensive computational resources or deep machine learning expertise.

A key feature of Hugging Face’s platform is the Model Hub, a repository where users can share, discover, and use models created by the community. This collaborative approach has nurtured a vibrant ecosystem where researchers and developers contribute models trained on diverse datasets, covering a wide range of languages and applications. Hugging Face has also expanded its support beyond NLP to include models for tasks in computer vision and audio processing.

Hugging Face has developed various tools and frameworks that simplify the deployment and scaling of machine learning models in production environments. The Hugging Face Hub, for instance, offers a cloud-based platform for hosting models, making it easy for users to integrate AI into their applications through APIs. The company also provides services like the Inference API, which allows developers to run models at scale without worrying about infrastructure management.

Hugging Face's influence extends deeply into the research community, where their tools have been instrumental in pushing the boundaries of AI. The company regularly collaborates with academic institutions and research labs, and their models frequently set benchmarks in NLP tasks. Hugging Face is also committed to promoting ethical AI, encouraging conversations about biases in models and the responsible use of AI technologies.

Hugging Face is a trailblazer in AI and NLP, building a rich ecosystem of open-source tools, community-driven model sharing, and accessible AI services. Their contributions have significantly lowered the barriers to entry in AI, empowering developers, researchers, and organizations to harness the potential of machine learning in innovative and impactful ways.

The AutoTokenizer is an integral component of the Hugging Face Transformers library, designed to simplify the process of preparing text data for use with different transformer models. Tokenization, the process of converting text into a format that a machine learning model can understand, is crucial in natural language processing (NLP). Each transformer model, whether it be BERT, GPT, or others, has its own specific way of tokenizing text. The AutoTokenizer class streamlines this process by automatically selecting the correct tokenizer based on the model you are using.

When working with transformer models, the text must first be broken down into smaller units, often words or subwords, which are then converted into numerical representations (tokens) that the model can process. Tokenizers also handle various other tasks such as adding special tokens required by the models (like [CLS] and [SEP] for BERT), padding sequences to the same length, and managing the attention masks that tell the model which parts of the input it should focus on. The AutoTokenizer automates all of these tasks, allowing users to seamlessly transition between different models without needing to worry about the specifics of each tokenizer.

To use the AutoTokenizer, one simply needs to specify the name of the model they intend to use, and the class will automatically load the appropriate tokenizer. This is particularly useful because it abstracts away the complexities involved in tokenization, making it easier for developers and researchers to experiment with different models. For example, if you switch from BERT to GPT-2, the AutoTokenizer will automatically adjust to use the correct tokenization method for GPT-2, including handling the distinct tokenization process and special tokens that GPT-2 requires.

Additionally, the AutoTokenizer is designed to be highly flexible and customizable. While it typically selects the default tokenizer associated with a given model, users can fine-tune the tokenizer settings according to their specific needs. This might include adjusting the vocabulary, changing the tokenization strategy (such as switching from word-level to subword-level tokenization), or modifying the special tokens used. Despite this flexibility, the AutoTokenizer maintains a user-friendly interface, making these adjustments straightforward.

In summary, the AutoTokenizer is a powerful tool within the Hugging Face Transformers library that simplifies the tokenization process for a wide range of transformer models. By automatically selecting the appropriate tokenizer and managing the intricacies of tokenization, it allows users to focus on developing and fine-tuning models without getting bogged down in the technical details of how each model processes text. This automation and ease of use make the AutoTokenizer an essential component for anyone working with transformer models in NLP.

How your model is fed

Milen Kraev — Fri, 23 Aug 2024 12:11:02 GMT

A tokenizer in natural language processing (NLP) plays a key role in preparing raw text for machine learning models. It starts by taking the input text, which can be anything from a single sentence to an entire document, and breaking it down into smaller units called tokens. Depending on the tokenizer's design and the model it's paired with, these tokens might be words, subwords, or even individual characters.

After tokenizing the text, the tokenizer converts each token into a numerical identifier, known as a token ID, using a predefined vocabulary. This vocabulary is essentially a list of all the tokens the model has been trained to recognize. By mapping tokens to their corresponding IDs, the tokenizer transforms the text into a sequence of numbers that the machine learning model can process.

Beyond simple tokenization, the tokenizer also handles a range of preprocessing tasks. It might add special tokens to mark the start or end of a sentence, indicate the separation between different parts of the input, or provide clues for sentence segmentation. The tokenizer can also pad sequences to ensure they all have the same length, which is necessary for batch processing in models that require fixed-size inputs.

If the text is too long, the tokenizer can truncate it to fit within a specified maximum length, ensuring it works smoothly with the model's architecture. In tasks that involve text pairs, like question-answering, the tokenizer encodes the pairs in a way that the model can understand their relationship.

Once the tokenizer has done its work, it outputs a dictionary of tensors—arrays of numbers that represent the token IDs and other necessary information, such as attention masks. This output is then fed into a machine learning model for tasks like text classification, translation, or summarization. In essence, the tokenizer transforms human language into a format that models can understand and work with effectively.

The AutoTokenizer in NLP is a versatile tool that simplifies the process of preparing text data for machine learning models. It automatically selects the appropriate tokenizer class based on the model you're using, making it easier to work with different models without needing to manually configure the tokenizer.

When you load an AutoTokenizer with a specific model name, it initializes the tokenizer that is best suited for that model. For instance, if you load the tokenizer for a BERT model, it will choose the BertTokenizer, and if you're using a T5 model, it will pick the T5Tokenizer. This flexibility allows you to switch between models seamlessly while ensuring that the text processing is always optimized for the particular architecture of the model you're using.

The AutoTokenizer handles all the usual tasks of a tokenizer, such as splitting text into tokens, converting those tokens into numerical IDs based on the model’s vocabulary, and adding any special tokens required by the model. It also manages tasks like padding sequences to a uniform length, truncating overly long inputs, and handling paired inputs for tasks like question-answering or text similarity.

One of the biggest advantages of using AutoTokenizer is its ability to abstract away the complexities of different tokenizer implementations. You don't need to worry about the specific details of each model's tokenizer; the AutoTokenizer automatically configures itself to work with the model you've chosen, ensuring that your input text is processed correctly.

Once the text has been tokenized and processed by the AutoTokenizer, it outputs a set of tensors—numeric representations of the text that the machine learning model can then use for various tasks, such as classification, translation, or text generation. This automation and flexibility make the AutoTokenizer an essential tool for efficiently preparing text data in NLP workflows.

There are several alternatives to AutoTokenizer that you can consider depending on your specific needs in natural language processing (NLP).

One alternative is to use model-specific tokenizers, such as BertTokenizer, GPT2Tokenizer, or T5Tokenizer. These tokenizers are tailored to the architecture of their respective models. Instead of relying on AutoTokenizer, you can directly choose the tokenizer associated with the model you're using. This approach provides more control and allows access to features unique to each model’s tokenizer.

Another option is manual tokenization using general-purpose NLP libraries. For example, SpaCy is a powerful NLP library that offers comprehensive tokenization along with other text processing capabilities like part-of-speech tagging, named entity recognition, and dependency parsing. SpaCy’s tokenization is highly customizable and can be a good choice if you need more fine-grained control over how text is processed.

You could also consider the nltk library, which is another well-established tool in the NLP community. nltk provides basic tokenization methods, such as word and sentence tokenizers, and can be useful for simpler tokenization tasks or when working within traditional NLP pipelines.

Another option is the SentencePiece tokenizer, which is often used for training tokenizers on custom datasets. SentencePiece can be used with models like BERT and GPT, and it’s particularly useful when you need to create subword tokenization from scratch for a new language or domain-specific vocabulary.

Finally, the Tokenizer class from the Keras library can be used for tokenizing text when working in TensorFlow-based environments. This tokenizer is simple and effective for many standard machine learning tasks, especially when training neural networks for text classification or sequence modeling.

Train your first mini model.

Milen Kraev — Tue, 20 Aug 2024 12:34:46 GMT

In the upcoming posts, I’ll be diving into several concepts related to machine learning. At first, some of these ideas might seem disconnected, but don't worry—I’ve got you covered. To set the stage, I'm sharing a simple model example here. I’ll walk you through the different approaches used in this example step by step. If it feels a bit overwhelming at first, don’t lose heart—every line will be explained in due course.

import tensorflow as tf
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
from sklearn.model_selection import train_test_split
import pandas as pd

This code snippet sets up the necessary libraries for a machine learning project focused on natural language processing (NLP). It begins by importing TensorFlow, a powerful framework for building and training deep learning models. Next, it brings in components from the transformers library, specifically tools for working with sequence-to-sequence models and tokenizing text. Additionally, the train_test_split function from Scikit-learn is imported, which is essential for splitting data into training and testing sets. Finally, the pandas library is imported, providing tools for data manipulation and analysis, particularly when working with structured data in tables. Together, these imports prepare the environment for developing, training, and evaluating a machine learning model in the context of NLP tasks.

training_data = pd.read_csv (r'yourpath.yourfile.csv.csv')

This line of code reads a CSV file containing training data into a pandas DataFrame. The pd.read_csv() function is used to load the data from the specified file path (r'yourpath.yourfile.csv'). The raw string notation (r'') ensures that any backslashes in the file path are treated correctly. Once loaded, the data is stored in the training_data variable, which can then be used for further processing, analysis, or model training.

X = training_data.drop(columns=['Column for the training data']).applymap(str)

X_combined = X.apply(lambda row: ' '.join(row), axis=1)

y = training_data['Work notes (Internal)'].tolist()

train_texts, val_texts, train_summaries, val_summaries = train_test_split(
X_combined.tolist(), y, test_size=0.2, random_state=42)

This code processes the training data by first removing the 'Column for the training data' column and converting the remaining data to strings. The next step combines all the columns in each row into a single string, creating a text representation for each data entry. The 'Column for the training data' column is then extracted as the target variable. Finally, the code splits the combined text data and the corresponding target values into training and validation sets using an 80-20 split. This prepares the data for model training by separating it into inputs (the combined text) and outputs ('Column for the training data') while also reserving a portion of the data for validation.

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name)

This code initializes a sequence-to-sequence model using the "t5-small" architecture. First, the model name "t5-small" is defined. Then, the AutoTokenizer is loaded with the pre-trained tokenizer specific to this model, which handles converting text into tokens that the model can understand. Finally, the TFAutoModelForSeq2SeqLM is loaded with the pre-trained "t5-small" model, which is ready to be fine-tuned or used for tasks like text summarization, translation, or other sequence-to-sequence tasks.

def tokenize_data(texts, summaries):
inputs = tokenizer(texts, max_length=512, truncation=True, padding='max_length', return_tensors="tf")
targets = tokenizer(summaries, max_length=150, truncation=True, padding='max_length', return_tensors="tf")
return inputs, targets

train_inputs, train_targets = tokenize_data(train_texts, train_summaries)
val_inputs, val_targets = tokenize_data(val_texts, val_summaries)

This code defines a function called tokenize_data that tokenizes both the input texts and summaries using the pre-trained tokenizer. The function processes the texts by truncating them to a maximum length (512 for inputs and 150 for summaries) and padding them to ensure consistent input sizes. The tokenized outputs are returned as TensorFlow tensors. After defining the function, the code tokenizes the training and validation data by passing the respective texts and summaries through the tokenize_data function, resulting in tokenized inputs and targets for both the training and validation sets.

train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_inputs), train_targets['input_ids']))
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_inputs), val_targets['input_ids']))

This code snippet converts the tokenized data into TensorFlow Datasets. Specifically, it uses tf.data.Dataset.from_tensor_slices to create datasets from the tokenized input and target data. For train_dataset, it takes the tokenized input data (train_inputs) and the target input IDs from train_targets and combines them into a dataset. Similarly, val_dataset is created from the tokenized validation inputs (val_inputs) and the target input IDs from val_targets. The result is two TensorFlow Datasets, one for training and one for validation, which can be used for model training and evaluation.

batch_size = 8
train_dataset = train_dataset.shuffle(len(train_texts)).batch(batch_size)
val_dataset = val_dataset.batch(batch_size)

This code snippet prepares the TensorFlow Datasets for training and validation by shuffling and batching the data. For train_dataset, the data is first shuffled using the shuffle method, which randomizes the order of the samples to improve training performance. The shuffle buffer size is set to the length of the training texts to ensure thorough mixing. Then, the data is batched with a batch_size of 8, meaning each batch will contain 8 samples. The val_dataset is batched with the same batch size of 8, but it is not shuffled, as shuffling is typically not necessary for validation data. This setup is essential for efficient training and evaluation of the model.

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5),
loss=model.compute_loss) # Use the model's built-in loss function

model.fit(train_dataset, validation_data=val_dataset, epochs=3)

model.save_pretrained("fine_tuned_t5")

This code snippet configures and trains the model, then saves the fine-tuned version. The model.compile function sets up the training process with the Adam optimizer and a learning rate of 3×10−53 \times 10^{-5}3×10−5. It uses the model's built-in loss function for training. The model.fit function trains the model using the prepared train_dataset, validates it with val_dataset, and runs for 3 epochs. Finally, model.save_pretrained("fine_tuned_t5") saves the trained model to the specified directory, "fine_tuned_t5", for later use or deployment.

new_incident_data = pd.read_csv(r'yourpath.yourfile.csv')
new_incident_text = ' '.join(new_incident_data.applymap(str).iloc[0])

This code snippet reads a new CSV file containing incident data into a pandas DataFrame. The pd.read_csv() function loads the data from the specified path (r'yourpath.yourfile.csv'). Then, it combines all the columns in the first row of the DataFrame into a single string. The applymap(str) converts all values to strings, and iloc[0] selects the first row. The join method concatenates these string values into one long text string, representing the combined data of the first incident.

input_text = "summarize: " + new_incident_text
input_ids = tokenizer.encode(input_text, return_tensors="tf", max_length=512, truncation=True)

summary_ids = model.generate(input_ids, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Original Text:", new_incident_text)
print("Generated Summary:", summary)

This code snippet generates a summary for a new incident description using the fine-tuned model. It starts by prepending the text "summarize: " to the new incident data, creating a prompt for the model. The tokenizer.encode function converts this input text into token IDs, ensuring that it fits within the maximum length of 512 tokens.

The model.generate function then produces a summary of the input text. It specifies parameters such as a maximum summary length of 150 tokens, a minimum length of 40 tokens, a length penalty of 2.0 to discourage overly long summaries, and uses beam search with 4 beams to improve the quality of the generated summary. The early_stopping=True parameter helps terminate the generation process when the model predicts the end of the sequence.

Finally, tokenizer.decode converts the generated summary token IDs back into a readable string, and both the original text and the generated summary are printed.

This example demonstrates a model trained to summarize incident reports from ServiceNow in CSV format. The core idea is to provide a summary of an incident. While it's a simple task, it's a solid foundation for understanding the overall process. Feel free to experiment with the code and run it yourself. Try to grasp what each line does, as these small steps will eventually lead you to train your own models one day.

Clone yourself with an AI.

Milen Kraev — Fri, 09 Aug 2024 14:18:23 GMT

The rapid advancement of AI technology is raising safety concerns, and everyone is asking, "Where is the limit?" But let's stay positive. Isn’t it great to have a virtual twin that handles tedious tasks for you? The ultimate goal of corporations is to automate everything and maximize their profits. This should also apply to individuals. Imagine if some AI models could learn exactly what you know and work on your behalf while you still earn the money. You could focus on your health, schedule vacations, and tailor your activities to your needs.

I believe the future of AI belongs not just to big tech companies or a select few but to everyone. We can start taking small steps towards this inclusive future right now.

So, what are the basics you can start learning to build your own AI model? Even if you have very limited programming experience, it's a great way to gain some skills. Try to make it a hobby: read articles like this one, and then use ChatGPT to ask about any words or concepts you don’t understand.

Let's get started with the following basics:

Python

Python is a versatile and beginner-friendly programming language known for its simple, easy-to-read syntax. It's a great starting point for anyone new to coding because it emphasizes readability and reduces the complexity of writing code. Python can be used for a wide range of applications, from web development to data analysis and artificial intelligence. Its large and supportive community also means you'll have plenty of resources and libraries at your disposal to help you learn and build projects. Whether you’re interested in creating websites, analyzing data, or automating tasks, Python provides the tools and simplicity to get you started on the right foot.

Pandas

Pandas is a powerful and user-friendly library in Python that makes working with data much easier. It’s especially helpful for beginners who want to manipulate and analyze data without getting bogged down by complex coding. With Pandas, you can quickly organize your data into tables, perform calculations, and generate reports with just a few lines of code. It’s perfect for tasks like cleaning up messy data, combining datasets, and visualizing information. Thanks to its intuitive design and extensive documentation, Pandas allows you to focus on extracting insights from your data rather than struggling with technical details.

Numpy

NumPy is a fundamental library in Python for scientific computing, and it's a fantastic tool for beginners interested in working with numerical data. It provides powerful data structures, like arrays, that let you perform complex mathematical operations efficiently. Unlike regular lists in Python, NumPy arrays can handle large amounts of data quickly and allow for operations like addition, multiplication, and more, to be done in a single line of code. It also offers a range of functions for tasks such as linear algebra, statistical analysis, and random number generation. With its straightforward syntax and broad functionality, NumPy is an essential tool for anyone diving into data science or numerical analysis.

TensorFlow

TensorFlow is a popular library developed by Google for building and training machine learning models. It’s especially useful for beginners who want to dive into artificial intelligence and deep learning. TensorFlow simplifies complex processes by providing high-level tools and flexible frameworks that let you create neural networks and perform tasks like image recognition or language translation. Its user-friendly design helps you experiment with and optimize models while handling the heavy lifting of computations. Whether you're just starting out or looking to build sophisticated AI systems, TensorFlow offers the resources and support you need to get started and advance in the world of machine learning.

These three libraries are a great starting point. Dive into the any information related and make sure to ask ChatGPT if you have any questions along the way.

Coming soon

Milen Kraev — Mon, 10 Apr 2023 17:48:09 GMT

This is AI and fintech notes.

Subscribe now