Apa ide dibalik proyek ini ?
Sebenarnya ide ini muncul ketika saya sedang mengembangkan proyek Machine Learning yang dapat memprediksi harga saham/forex. Buat yang ingin belajar, silahkan menuju tautan dibawah ini.
Ide itu muncul karena saya melihat peluang yang sangat besar bagi siapa saja yang tertarik di dunia pasar keuangan, khususnya para investor atau trader, bisa meraup keuntungan yang besar dengan memanfaatkan machine learning ini, ditambah dengan adanya ‘News Sentiment’. Karena, hubungan antara berita (NEWS) dan pergerakan harga saham atau forex sangatlah erat. mereka memiliki korelasi yang sangat tinggi. Berita-berita ekonomi, politik, keuangan, atau faktor-faktor lain yang memengaruhi kondisi makroekonomi suatu negara atau industri dapat memiliki dampak yang signifikan terhadap harga saham atau nilai tukar mata uang di pasar forex. Disamping itu, tantangan dalam News Trading adalah Volatilitas tinggi, artinya harga dapat bergerak dengan cepat dan tidak terduga setelah rilis berita, dan meningkatkan risiko perdagangan. Juga terjadi ‘Slippage’, kesulitan dalam mendapatkan harga yang diinginkan karena perubahan harga yang cepat setelah rilis berita. Serta reaksi (sentimen) pasar yang (mungkin) berlawanan. Terkadang pasar bereaksi berlawanan dengan ekspektasi yang diharapkan dari berita, menyebabkan perdagangan menjadi sulit diprediksi. Oleh karena itu, saya memiliki ide untuk membuat sebuah Machine Learning untuk memprediksi sentimen berita terhadap harga pasar.
Pada bagian ini, saya akan menjelaskan dengan bantuan bahasa pemrograman Python (Google Colab) agar lebih mudah dipahami.
PROJECT : NEWS SENTIMENT ANALYSIS¶
Start : 12-Oct-2023 \ Target End : End of January 2024
# Import Libraries
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob
# Load Data
file_data = 'https://abdusy.troi-z.com/wp-content/uploads/2023/10/allnews_2007_2017.csv'
data = pd.read_csv(file_data) #, index_col = 'Date', parse_dates=['Date'])
# Preview Data
data.shape
(47182, 10)
As we can see, we have 47182 rows.
data.describe
<bound method NDFrame.describe of Date Time Currency Impact Description \ 0 2007/01/01 00:01 CAD N Bank Holiday [All Day] 1 2007/01/01 00:01 CHF N Bank Holiday [All Day] 2 2007/01/01 00:01 EUR N French Bank Holiday [All Day] 3 2007/01/01 00:01 EUR N German Bank Holiday [All Day] 4 2007/01/01 00:01 EUR N Italian Bank Holiday [All Day] ... ... ... ... ... ... 47177 2017/09/01 10:00 USD H ISM Manufacturing PMI 47178 2017/09/01 10:00 USD L Construction Spending m/m 47179 2017/09/01 10:00 USD L ISM Manufacturing Prices 47180 2017/09/01 10:00 USD L Revised UoM Inflation Expectations 47181 2017/09/01 10:00 USD M Revised UoM Consumer Sentiment Actual Forecast Previous NOT_USED Graph 0 NaN NaN NaN NaN 12245 1 NaN NaN NaN NaN 12036 2 NaN NaN NaN NaN 12214 3 NaN NaN NaN NaN 12186 4 NaN NaN NaN NaN 12217 ... ... ... ... ... ... 47177 58.8 56.5 56.3 NaN 64927 47178 -0.6% 0.5% -1.4% -1.3% 66723 47179 62.0 61.9 62.0 NaN 64926 47180 2.6% NaN 2.6% NaN 64986 47181 96.8 97.4 97.6 NaN 64988 [47182 rows x 10 columns]>
data.head(10)
Date | Time | Currency | Impact | Description | Actual | Forecast | Previous | NOT_USED | Graph | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2007/01/01 | 00:01 | CAD | N | Bank Holiday [All Day] | NaN | NaN | NaN | NaN | 12245 |
1 | 2007/01/01 | 00:01 | CHF | N | Bank Holiday [All Day] | NaN | NaN | NaN | NaN | 12036 |
2 | 2007/01/01 | 00:01 | EUR | N | French Bank Holiday [All Day] | NaN | NaN | NaN | NaN | 12214 |
3 | 2007/01/01 | 00:01 | EUR | N | German Bank Holiday [All Day] | NaN | NaN | NaN | NaN | 12186 |
4 | 2007/01/01 | 00:01 | EUR | N | Italian Bank Holiday [All Day] | NaN | NaN | NaN | NaN | 12217 |
5 | 2007/01/01 | 00:01 | GBP | N | Bank Holiday [All Day] | NaN | NaN | NaN | NaN | 12031 |
6 | 2007/01/01 | 00:01 | JPY | N | Bank Holiday [All Day] | NaN | NaN | NaN | NaN | 12086 |
7 | 2007/01/01 | 00:01 | NZD | N | Bank Holiday [All Day] | NaN | NaN | NaN | NaN | 12092 |
8 | 2007/01/01 | 00:01 | USD | N | Bank Holiday [All Day] | NaN | NaN | NaN | NaN | 12004 |
9 | 2007/01/01 | 18:30 | AUD | L | AIG Manufacturing Index | 52.4 | NaN | 54.4 | NaN | 3232 |
data.tail(10)
Date | Time | Currency | Impact | Description | Actual | Forecast | Previous | NOT_USED | Graph | |
---|---|---|---|---|---|---|---|---|---|---|
47172 | 2017/09/01 | 08:30 | USD | H | Average Hourly Earnings m/m | 0.1% | 0.2% | 0.3% | NaN | 64389 |
47173 | 2017/09/01 | 08:30 | USD | H | Non-Farm Employment Change | 156K | 180K | 189K | 209K | 64397 |
47174 | 2017/09/01 | 08:30 | USD | H | Unemployment Rate | 4.4% | 4.3% | 4.3% | NaN | 64393 |
47175 | 2017/09/01 | 09:30 | CAD | L | Manufacturing PMI | 54.6 | NaN | 55.5 | NaN | 65218 |
47176 | 2017/09/01 | 09:45 | USD | L | Final Manufacturing PMI | 52.8 | 52.5 | 52.5 | NaN | 64779 |
47177 | 2017/09/01 | 10:00 | USD | H | ISM Manufacturing PMI | 58.8 | 56.5 | 56.3 | NaN | 64927 |
47178 | 2017/09/01 | 10:00 | USD | L | Construction Spending m/m | -0.6% | 0.5% | -1.4% | -1.3% | 66723 |
47179 | 2017/09/01 | 10:00 | USD | L | ISM Manufacturing Prices | 62.0 | 61.9 | 62.0 | NaN | 64926 |
47180 | 2017/09/01 | 10:00 | USD | L | Revised UoM Inflation Expectations | 2.6% | NaN | 2.6% | NaN | 64986 |
47181 | 2017/09/01 | 10:00 | USD | M | Revised UoM Consumer Sentiment | 96.8 | 97.4 | 97.6 | NaN | 64988 |
Data is collected start from 01-January-2007 till 01-Sept-2017. There are NaNs and unused columns. Please Note that we need to clean the dataset.
EXPLORING & CLEANING DATA¶
data['Description'].value_counts()
Trade Balance 1132 Bank Holiday [All Day] 944 Unemployment Rate 941 Retail Sales m/m 690 Natural Gas Storage 557 ... Doha Oil Summit [All Day] 1 CB Leading Index m/m [Dec Data] 1 CB Leading Index m/m [Feb Data] 1 CB Leading Index m/m [Jan Data] 1 RBA Assist Gov Ellis Speaks 1 Name: Description, Length: 586, dtype: int64
# Count the Impact Distribution
cnt_impact = data['Impact'].value_counts()
print(cnt_impact)
M 17170 L 16956 H 11674 N 1382 Name: Impact, dtype: int64
# Create a bar plot
cnt_impact.plot(kind='bar', color=['orange', 'green', 'red', 'grey'])
plt.title('Impact Distribution')
plt.xlabel('Impact')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()
# Show the HIGH Impact ONLY
High_Impact = data[data['Impact']=='H'][['Date', 'Time', 'Description', 'Actual', 'Forecast']]
print(High_Impact)
Date Time Description Actual Forecast 26 2007/01/03 11:00 ISM Manufacturing PMI 51.4 51.0 29 2007/01/03 15:00 FOMC Meeting Minutes NaN NaN 30 2007/01/03 17:45 Trade Balance -785M -976M 32 2007/01/04 02:45 CPI m/m 0.0% 0.1% 48 2007/01/04 11:00 ISM Non-Manufacturing PMI 57.1 58.0 ... ... ... ... ... ... 47171 2017/09/01 04:30 Manufacturing PMI 56.9 55.0 47172 2017/09/01 08:30 Average Hourly Earnings m/m 0.1% 0.2% 47173 2017/09/01 08:30 Non-Farm Employment Change 156K 180K 47174 2017/09/01 08:30 Unemployment Rate 4.4% 4.3% 47177 2017/09/01 10:00 ISM Manufacturing PMI 58.8 56.5 [11674 rows x 5 columns]
# Show the MEDIUM Impact ONLY
Medium_Impact = data[data['Impact']=='M'][['Date', 'Time', 'Description', 'Actual', 'Forecast', 'Previous']]
print(Medium_Impact)
Date Time Description Actual Forecast \ 11 2007/01/01 22:30 Caixin Final Manufacturing PMI 52.4 NaN 14 2007/01/02 01:30 Commodity Prices y/y 11.0% 13.9% 17 2007/01/02 05:00 Final Manufacturing PMI 56.5 56.5 18 2007/01/02 05:30 Manufacturing PMI 51.9 53.0 21 2007/01/03 04:30 Manufacturing PMI 65.0 66.0 ... ... ... ... ... ... 47153 2017/08/31 09:45 Chicago PMI 58.9 58.7 47154 2017/08/31 10:00 Pending Home Sales m/m -0.8% 0.4% 47160 2017/08/31 21:45 Caixin Manufacturing PMI 51.6 50.9 47165 2017/09/01 03:15 Spanish Manufacturing PMI 52.4 54.4 47181 2017/09/01 10:00 Revised UoM Consumer Sentiment 96.8 97.4 Previous 11 53.0 14 13.5% 17 56.6 18 52.5 21 67.0 ... ... 47153 58.9 47154 1.3% 47160 51.1 47165 54.0 47181 97.6 [17170 rows x 6 columns]
# Delete columns 'NOT_USED' and 'Graph'
data_clean = data.drop(['NOT_USED', 'Graph'], axis=1)
# Keep only the rows with at least 3 non-NA values.
data_clean.dropna(subset=['Actual', 'Forecast', 'Previous'], inplace=True)
# Check for NaN in the entire DataFrame
nan_check = data_clean.isnull().values.any()
# Count NaN values in each column
nan_count_per_column = data_clean.isnull().sum()
print(nan_count_per_column)
Date 0 Time 0 Currency 0 Impact 0 Description 0 Actual 0 Forecast 0 Previous 0 dtype: int64
Now, the dataset is clean.
In order to perform the operation, we need to remove (clean) the symbols such as ‘%’, ‘M’,’K’, etc, and convert the values as float for columns [Actual, Forecast, and Previous]. Special case for the News “MPC Official Bank Rate Votes”. Interest rates are a major driving force in forex markets, as they indicate a country’s economic health and future outlook. Higher interest rates tend to attract foreign investors, increasing the demand for the currency and subsequently, its value.
If the majority of the MPC votes for a rate hike, it signals a bullish sentiment for the GBP. Traders might anticipate this and buy GBP before the announcement, causing its value to rise.
If the MPC votes are split, it may cause uncertainty, leading to potential market volatility. Traders often adopt a wait-and-see approach in such scenarios.
If most MPC members vote to cut rates, it may cause the GBP to depreciate as it indicates potential economic challenges ahead.
## Function to clean and convert values from string to float.
def clean_and_convert(value):
if isinstance(value, str):
if '%' in value:
value = value.replace('<', '') # Removing '<'
value = value.replace('>', '') # Removing '>'
value = value.replace('%', '') # Removing '%'
#value = value.replace('-', '') # Removing '-'
return float(value)
elif 'M' in value:
return float(value.replace('M', '')) * 1 # Assuming 'M' means million 1_000_000
elif 'K' in value:
return float(value.replace('K', '')) * 1 # Assuming 'K' means mille 1_000
elif 'B' in value:
return float(value.replace('B', '')) * 1 # Assuming 'B' means billion 1_000_000_000
elif 'T' in value:
return float(value.replace('T', '')) * 1 # Assuming 'T' means trillion 1_000_000_000_000
elif '|' in value:
split_values = value.split('|')
return float(split_values[0]) if '|' in value else float(value)
#elif '-' in value:
# return float(value.replace('-',''))
elif value == '':
return 0.0
else:
return value # If it's already a float, return as is
# Applying the function to the DataFrame columns
data_clean['Actual'] = data_clean['Actual'].apply(clean_and_convert)
data_clean['Forecast'] = data_clean['Forecast'].apply(clean_and_convert)
data_clean['Previous'] = data_clean['Previous'].apply(clean_and_convert)
data_clean.tail(5)
Date | Time | Currency | Impact | Description | Actual | Forecast | Previous | |
---|---|---|---|---|---|---|---|---|
47176 | 2017/09/01 | 09:45 | USD | L | Final Manufacturing PMI | NaN | NaN | NaN |
47177 | 2017/09/01 | 10:00 | USD | H | ISM Manufacturing PMI | NaN | NaN | NaN |
47178 | 2017/09/01 | 10:00 | USD | L | Construction Spending m/m | -0.6 | 0.5 | -1.4 |
47179 | 2017/09/01 | 10:00 | USD | L | ISM Manufacturing Prices | NaN | NaN | NaN |
47181 | 2017/09/01 | 10:00 | USD | M | Revised UoM Consumer Sentiment | NaN | NaN | NaN |
DATA PREPROCESSING¶
Create a function to define whetever the News is ‘Positive’, ‘Negative’ or ‘Neutral’.
To do so, we need to import TextBlob library
- A polarity score of -1 represents a negative sentiment or a very negative tone.
- A polarity score of 0 represents a neutral sentiment.
- A polarity score of 1 represents a positive sentiment or a very positive tone.
Update: We do not need polarity score since we are nor going to give sentiment analysis and scoring based on the News title.
# This is a Function that define the sentiment of the 'News'
#def analyze_sentiment(text):
# analysis = TextBlob(text)
# if (analysis.sentiment.polarity > 0)
# return 'positive'
# elif (analysis.sentiment.polarity < 0)
# return 'negative'
# else:
# return 'neutral'
#data_clean['Sentiment'] = data_clean['Description'].apply(analyze_sentiment)
ANALYZE SENTIMENT: \
- VERY Positive : if ‘Actual’ > ‘Forecast’ and Impact = H
Positive : if ‘Actual’ > ‘Forecast’ and Impact = M or L
VERY Negative : if ‘Actual’ < ‘Forecast’ and Impact = H
- Negative : if ‘Actual’ < ‘Forecast’ and Impact = M or L
# This is a Function that define the sentiment of the 'News'
def new_analyze_sentiment(data):
if ((data['Actual'] > data['Forecast']) & ((data['Impact'] == 'H') | (data['Impact'] == 'M'))):
return 'positive'
elif (data['Actual'] < data['Forecast']) & ((data['Impact'] == 'H') | (data['Impact'] == 'M')):
return 'negative'
else:
return 'neutral'
# Apply the function into a new column
data_clean['Sentiment'] = data_clean.apply(new_analyze_sentiment, axis=1)
data_clean['Sentiment'].tail(10)
47170 neutral 47171 neutral 47172 negative 47173 negative 47174 positive 47176 neutral 47177 neutral 47178 neutral 47179 neutral 47181 neutral Name: Sentiment, dtype: object
# Count the results to know how many for Neutral, Postive dan Negative sentiments.
cnt_sentiment = data_clean['Sentiment'].value_counts()
print(cnt_sentiment)
neutral 17614 negative 7106 positive 6136 Name: Sentiment, dtype: int64
Now the dataset shows that there are 10928 neutrals, 6136 positives, and 6136 negatives. For a better understanding the distribution, lets create a bar plot.
# Create a bar plot
import matplotlib.pyplot as plt
cnt_sentiment.plot(kind='bar', color=['black', 'red', 'green'])
plt.title('SENTIMENT DISTRIBUTION')
plt.xlabel('SENTIMENT')
plt.ylabel('COUNT')
plt.xticks(rotation=0)
plt.show()
As we can see that the Sentiment Distribution looks fair, especially the ‘Postive’ and ‘Negative’ sentiments, since we focus on those two sentiments.
# Show the Positive sentiments ONLY
positive_sentiment = data_clean[data_clean['Sentiment']=='positive'][['Description', 'Actual', 'Forecast', 'Sentiment']]
print(positive_sentiment)
Description Actual Forecast Sentiment 22 German Unemployment Change -96.0 -110.0 positive 30 Trade Balance -785.0 -976.0 positive 60 Employment Change 62.0 17.0 positive 62 Non-Farm Employment Change 167.0 115.0 positive 64 Average Hourly Earnings m/m 0.5 0.3 positive ... ... ... ... ... 47126 Prelim GDP q/q 3.0 2.7 positive 47135 Private Capital Expenditure q/q 0.8 0.2 positive 47146 CPI Flash Estimate y/y 1.5 1.4 positive 47148 GDP m/m 0.3 0.1 positive 47174 Unemployment Rate 4.4 4.3 positive [6136 rows x 4 columns]
# Show the negative sentiments ONLY
negative_sentiment = data_clean[data_clean['Sentiment']=='negative'][['Description', 'Actual', 'Forecast', 'Sentiment']]
print(negative_sentiment)
Description Actual Forecast Sentiment 14 Commodity Prices y/y 11.0 13.9 negative 25 ADP Non-Farm Employment Change -40.0 120.0 negative 32 CPI m/m 0.0 0.1 negative 46 RMPI m/m 0.9 1.0 negative 49 Pending Home Sales m/m -0.5 0.0 negative ... ... ... ... ... 47149 Unemployment Claims 236.0 237.0 negative 47152 Personal Spending m/m 0.3 0.4 negative 47154 Pending Home Sales m/m -0.8 0.4 negative 47172 Average Hourly Earnings m/m 0.1 0.2 negative 47173 Non-Farm Employment Change 156.0 180.0 negative [7106 rows x 4 columns]
MACHINE LEARNING¶
We create a machine learning based on some existing models.
data_clean.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Actual | 24171.0 | 19.048218 | 102.942078 | -1436.0 | -0.2 | 0.6 | 4.3200 | 2510.0 |
Forecast | 24171.0 | 19.106358 | 99.100445 | -1125.0 | 0.1 | 0.6 | 4.0000 | 2440.0 |
Previous | 24164.0 | 19.071725 | 103.600409 | -1394.0 | -0.2 | 0.6 | 4.3825 | 2510.0 |
data_clean['Description'].value_counts()
Trade Balance 1130 Unemployment Rate 941 Retail Sales m/m 689 Natural Gas Storage 557 Crude Oil Inventories 557 ... Building Permits [Sep Data] 1 Factory Orders m/m [Aug Data] 1 Employment Level [Q2 Data] 1 Irish Lisbon Treaty Vote [All Day] 1 French Prelim Private Payrolls q/q 1 Name: Description, Length: 267, dtype: int64
# Keep and Re-check for NaNs values.
data_clean.dropna(subset=['Actual', 'Forecast', 'Previous'], inplace=True)
# Check for NaN in the entire DataFrame
nan_check = data_clean.isnull().values.any()
# Count NaN values in each column
nan_count_per_column = data_clean.isnull().sum()
print(nan_count_per_column)
Date 0 Time 0 Currency 0 Impact 0 Description 0 Actual 0 Forecast 0 Previous 0 Sentiment 0 dtype: int64
#Define the features
X = data_clean[['Actual', 'Forecast', 'Previous']]
X.columns = X.columns.astype(str)
#Define Target variable
y = data_clean['Sentiment']
# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report
#Data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
print('X_train : ', X_train.shape)
print('X_test : ', X_test.shape)
print('y_train : ', y_train.shape)
print('y_test : ', y_test.shape)
X_train : (21747, 3) X_test : (2417, 3) y_train : (21747,) y_test : (2417,)
CREATE MACHINE LEARNING MODEL(S)¶
Logistic Regression Model
# Define a model and fitting the model by using LOGISTIC REGRESSION model
from sklearn.linear_model import LogisticRegression
model_LR = LogisticRegression(max_iter=2000)
model_LR.fit(X_train, y_train) # --- Training Process
# Making predictions on the test set
y_pred_LR = model_LR.predict(X_test) # -- Testing Process
# Model evaluation
accuracy_LR = accuracy_score(y_test, y_pred_LR)
print(f"Accuracy LR: {accuracy_LR:.2f}")
# Classification report showing precision, recall, F1-score, etc.
print("Classification Report LR:")
print(classification_report(y_test, y_pred_LR))
Accuracy LR: 0.52 Classification Report LR: precision recall f1-score support negative 0.71 0.21 0.32 676 neutral 0.48 0.93 0.63 1095 positive 0.81 0.14 0.24 646 accuracy 0.52 2417 macro avg 0.67 0.43 0.40 2417 weighted avg 0.63 0.52 0.44 2417
# Visualize the classification report
# Create a classification report
report_LR = classification_report(y_test, y_pred_LR, output_dict=True)
# Extract precision, recall, and F1-score from the report
classes_LR = list(report_LR.keys())[:-3] # Extract class names, excluding 'accuracy', 'macro avg', 'weighted avg'
precision_LR = [report_LR[c]['precision'] for c in classes_LR]
recall_LR = [report_LR[c]['recall'] for c in classes_LR]
f1_score_LR = [report_LR[c]['f1-score'] for c in classes_LR]
# Create a bar plot for precision, recall, and F1-score
x = np.arange(len(classes_LR)) # X-axis locations for the bars
width = 0.3 # Width of the bars
fig, ax = plt.subplots(figsize=(10, 6))
# Plot precision, recall, and F1-score for each class
ax.bar(x - width, precision_LR, width, label='Precision LR')
ax.bar(x, recall_LR, width, label='Recall LR')
ax.bar(x + width, f1_score_LR, width, label='F1-score LR')
ax.set_xticks(x)
ax.set_xticklabels(classes_LR)
ax.set_xlabel('Classes')
ax.set_ylabel('Score')
ax.set_title('Classification Report Metrics - Logistic Regression Model')
ax.legend()
plt.tight_layout()
plt.show()
Random Forest Model
# Creating and fitting the RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier
model_RF = RandomForestClassifier(n_estimators=100, random_state=42)
model_RF.fit(X_train, y_train) # Training the model
# Making predictions on the test set
y_pred_RF = model_RF.predict(X_test)
# Model evaluation
accuracy_RF = accuracy_score(y_test, y_pred_RF)
print(f"Accuracy RF: {accuracy_RF:.2f}")
# Classification report showing precision, recall, F1-score, etc.
print("Classification Report:")
print(classification_report(y_test, y_pred_RF))
Accuracy RF: 0.69 Classification Report: precision recall f1-score support negative 0.67 0.74 0.71 676 neutral 0.69 0.62 0.65 1095 positive 0.70 0.76 0.73 646 accuracy 0.69 2417 macro avg 0.69 0.70 0.70 2417 weighted avg 0.69 0.69 0.69 2417
# Create a classification report
report_RF = classification_report(y_test, y_pred_RF, output_dict=True)
# Extract precision, recall, and F1-score from the report
classes_RF = list(report_RF.keys())[:-3] # Extract class names, excluding 'accuracy', 'macro avg', 'weighted avg'
precision_RF = [report_RF[c]['precision'] for c in classes_RF]
recall_RF = [report_RF[c]['recall'] for c in classes_RF]
f1_score_RF = [report_RF[c]['f1-score'] for c in classes_RF]
# Create a bar plot for precision, recall, and F1-score
x = np.arange(len(classes_RF)) # X-axis locations for the bars
width = 0.3 # Width of the bars
fig, ax = plt.subplots(figsize=(10, 6))
# Plot precision, recall, and F1-score for each class
ax.bar(x - width, precision_RF, width, label='Precision RF')
ax.bar(x, recall_RF, width, label='Recall RF')
ax.bar(x + width, f1_score_RF, width, label='F1-score RF')
ax.set_xticks(x)
ax.set_xticklabels(classes_RF)
ax.set_xlabel('Classes')
ax.set_ylabel('Score')
ax.set_title('Classification Report Metrics - Random Forest Model')
ax.legend()
plt.tight_layout()
plt.show()
PREDICT¶
To predict, we need Actual, Forecast, and Previous values as the inputs.
# Test to predict
input_data = {
'Actual':[-16.8], #input Actual value here
'Forecast': [-15], #input Forecast value here
'Previous': [-18.6] #input Previous value here
}
df_input = pd.DataFrame(input_data)
# predict with RF model
pred_sentiment_RF = model_RF.predict(df_input)
print(pred_sentiment_RF)
['negative']
# predict with LR model
pred_sentiment_LR = model_LR.predict(df_input)
print(pred_sentiment_LR)
['neutral']
Until this step, we can predict a news sentiment by providing values as input data such as ‘Actual’, ‘Forecast’ and ‘Previous’ which we can take from https://www.forexfactory.com/calendar
PIPELINE MODEL
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# Splitting the data into features and target
X = data_clean.drop('Sentiment', axis=1) # Features
y = data_clean['Sentiment'] # Target
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing for numerical features
numeric_features = ['Actual', 'Forecast', 'Previous']
numeric_transformer = StandardScaler()
# Preprocessing for text features
text_features = 'Description'
text_transformer = CountVectorizer()
# Combine preprocessing for numerical and text features
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('text', text_transformer, text_features)
])
# Create the pipeline with preprocessing and model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])
# Fit the model
pipeline.fit(X_train, y_train)
# Predictions
predictions = pipeline.predict(X_test)
# Evaluate the model
from sklearn.metrics import accuracy_score, classification_report
accuracy_pipeline = accuracy_score(y_test, predictions)
report_pipeline = classification_report(y_test, predictions)
print(f"Accuracy: {accuracy_pipeline}")
print(f"Classification Report:\n{report_pipeline}")
Accuracy: 0.8127457066004552 Classification Report: precision recall f1-score support negative 0.79 0.80 0.80 1408 neutral 0.85 0.84 0.85 2182 positive 0.78 0.77 0.78 1243 accuracy 0.81 4833 macro avg 0.80 0.81 0.81 4833 weighted avg 0.81 0.81 0.81 4833
# Create a classification report
report_pipeline = classification_report(y_test, predictions, output_dict=True)
# Extract precision, recall, and F1-score from the report
classes_pipeline = list(report_pipeline.keys())[:-3] # Extract class names, excluding 'accuracy', 'macro avg', 'weighted avg'
precision_pipeline = [report_pipeline[c]['precision'] for c in classes_pipeline]
recall_pipeline = [report_pipeline[c]['recall'] for c in classes_pipeline]
f1_score_pipeline = [report_pipeline[c]['f1-score'] for c in classes_pipeline]
# Create a bar plot for precision, recall, and F1-score
x = np.arange(len(classes_pipeline)) # X-axis locations for the bars
width = 0.3 # Width of the bars
fig, ax = plt.subplots(figsize=(10, 6))
# Plot precision, recall, and F1-score for each class
ax.bar(x - width, precision_pipeline, width, label='Precision pipeline')
ax.bar(x, recall_pipeline, width, label='Recall pipeline')
ax.bar(x + width, f1_score_pipeline, width, label='F1-score pipeline')
ax.set_xticks(x)
ax.set_xticklabels(classes_pipeline)
ax.set_xlabel('Classes')
ax.set_ylabel('Score')
ax.set_title('Classification Report Metrics - Pipeline Model')
ax.legend()
plt.tight_layout()
plt.show()
# Test to predict with the pipeline model
input_data = {
'Description':"French Flash Manufacturing PMI", #input Description
'Actual':[42], #input Actual value here
'Forecast': [43.3], #input Forecast value here
'Previous': [42.9] #input Previous value here
}
df_input = pd.DataFrame(input_data)
# predict with pipeline model
pred_sentiment_pipeline = pipeline.predict(df_input)
print(pred_sentiment_pipeline)
['neutral']
PIPELINE METHODE, but this time we include the ‘DATE’, to see the pattern of the NEWS.
data_clean
Date | Time | Currency | Impact | Description | Actual | Forecast | Previous | Sentiment | |
---|---|---|---|---|---|---|---|---|---|
14 | 2007/01/02 | 01:30 | AUD | M | Commodity Prices y/y | 11.0 | 13.9 | 13.5 | negative |
19 | 2007/01/03 | 00:01 | USD | L | Total Vehicle Sales [All Day] | 16.7 | 16.5 | 16.1 | neutral |
22 | 2007/01/03 | 04:55 | EUR | M | German Unemployment Change | -96.0 | -110.0 | -90.0 | positive |
25 | 2007/01/03 | 09:15 | USD | M | ADP Non-Farm Employment Change | -40.0 | 120.0 | 230.0 | negative |
27 | 2007/01/03 | 11:00 | USD | L | Construction Spending m/m | -0.2 | -0.6 | -1.0 | neutral |
… | … | … | … | … | … | … | … | … | … |
47164 | 2017/09/01 | 03:15 | CHF | L | Retail Sales y/y | -0.7 | 1.7 | 1.7 | neutral |
47172 | 2017/09/01 | 08:30 | USD | H | Average Hourly Earnings m/m | 0.1 | 0.2 | 0.3 | negative |
47173 | 2017/09/01 | 08:30 | USD | H | Non-Farm Employment Change | 156.0 | 180.0 | 189.0 | negative |
47174 | 2017/09/01 | 08:30 | USD | H | Unemployment Rate | 4.4 | 4.3 | 4.3 | positive |
47178 | 2017/09/01 | 10:00 | USD | L | Construction Spending m/m | -0.6 | 0.5 | -1.4 | neutral |
24164 rows × 9 columns
Distributions
Categorical distributions
2-d distributions
Time series
Values
2-d categorical distributions
Faceted distributions
data_clean.plot(kind='scatter', x='Forecast', y='Previous', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)
def _plot_series(series, series_name, series_index=0):
from matplotlib import pyplot as plt
import seaborn as sns
palette = list(sns.palettes.mpl_palette('Dark2'))
xs = series['Date']
ys = series['Actual']
plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])
fig, ax = plt.subplots(figsize=(10, 5.2), layout='constrained')
df_sorted = data_clean.sort_values('Date', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('Sentiment')):
_plot_series(series, series_name, i)
fig.legend(title='Sentiment', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Date')
_ = plt.ylabel('Actual')
data_clean.groupby('Impact').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)
# Splitting the data into features and target
X = data_clean.drop('Sentiment', axis=1) # Features
y = data_clean['Sentiment'] # Target
# Engineering time-based features from Date column
X['DayOfWeek'] = pd.to_datetime(X['Date']).dt.dayofweek # Extracting day of the week
X['Month'] = pd.to_datetime(X['Date']).dt.month # Extracting month
X['Year'] = pd.to_datetime(X['Date']).dt.year # Extracting year
X['Hour'] = pd.to_datetime(X['Time'], format='%H:%M').dt.hour #Extracting hour
X['Minute'] = pd.to_datetime(X['Time'], format='%H:%M').dt.minute #Extracting minute
# Drop unnecessary columns (Date, Description) for now
X = X.drop(['Date', 'Description'], axis=1)
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing for numerical features
numeric_features = ['Actual', 'Forecast', 'Previous', 'DayOfWeek', 'Month', 'Hour', 'Minute'] # Include time-based features
numeric_transformer = StandardScaler()
# Preprocessing for text features (if Description is still used)
text_features = [] # Assuming Description column has been dropped
text_transformer = CountVectorizer()
# Combine preprocessing for numerical and text features
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('text', text_transformer, text_features) # Add text feature transformation here if needed
])
# Create the pipeline with preprocessing and model
pipeline_v2 = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier())]) # You can change the classifier as needed
# Fit the model
pipeline_v2.fit(X_train, y_train)
# Predictions
predictions_v2 = pipeline_v2.predict(X_test)
# Evaluate the model
accuracy_pipeline_v2 = accuracy_score(y_test, predictions_v2)
report_pipeline_v2 = classification_report(y_test, predictions_v2, output_dict=True)
print(f"Accuracy: {accuracy_pipeline_v2}")
print(f"Classification Report:\n{report_pipeline_v2}")
Accuracy: 0.7438444030622802 Classification Report: {'negative': {'precision': 0.7184604419101924, 'recall': 0.7159090909090909, 'f1-score': 0.7171824973319103, 'support': 1408}, 'neutral': {'precision': 0.7602468047598061, 'recall': 0.7905591200733272, 'f1-score': 0.7751067175915526, 'support': 2182}, 'positive': {'precision': 0.7424633936261843, 'recall': 0.6934835076427996, 'f1-score': 0.7171381031613977, 'support': 1243}, 'accuracy': 0.7438444030622802, 'macro avg': {'precision': 0.740390213432061, 'recall': 0.7333172395417392, 'f1-score': 0.7364757726949535, 'support': 4833}, 'weighted avg': {'precision': 0.7434994472321115, 'recall': 0.7438444030622802, 'f1-score': 0.7433226725134936, 'support': 4833}}
Viola….
Saya belum sempat memberikan deskripsi atau penjelasan dari masing-masing script Python. Jika ada yang perlu ditanyakan, silahkan tinggalkan komentar Anda pada kolom di bawah.
Colmar, 19 Des 2023, Winter.