Apa ide dibalik proyek ini ?

Sebenarnya ide ini muncul ketika saya sedang mengembangkan proyek Machine Learning yang dapat memprediksi harga saham/forex. Buat yang ingin belajar, silahkan menuju tautan dibawah ini.

[Belajar] Membuat Machine Learning untuk Memprediksi Harga Saham/Forex/Crypto – Part #1

Ide itu muncul karena saya melihat peluang yang sangat besar bagi siapa saja yang tertarik di dunia pasar keuangan, khususnya para investor atau trader, bisa meraup keuntungan yang besar dengan memanfaatkan machine learning ini, ditambah dengan adanya ‘News Sentiment’. Karena, hubungan antara berita (NEWS) dan pergerakan harga saham atau forex sangatlah erat. mereka memiliki korelasi yang sangat tinggi. Berita-berita ekonomi, politik, keuangan, atau faktor-faktor lain yang memengaruhi kondisi makroekonomi suatu negara atau industri dapat memiliki dampak yang signifikan terhadap harga saham atau nilai tukar mata uang di pasar forex. Disamping itu, tantangan dalam News Trading adalah Volatilitas tinggi, artinya harga dapat bergerak dengan cepat dan tidak terduga setelah rilis berita, dan meningkatkan risiko perdagangan. Juga terjadi ‘Slippage’, kesulitan dalam mendapatkan harga yang diinginkan karena perubahan harga yang cepat setelah rilis berita. Serta reaksi (sentimen) pasar yang (mungkin) berlawanan. Terkadang pasar bereaksi berlawanan dengan ekspektasi yang diharapkan dari berita, menyebabkan perdagangan menjadi sulit diprediksi. Oleh karena itu, saya memiliki ide untuk membuat sebuah Machine Learning untuk memprediksi sentimen berita terhadap harga pasar.

Pada bagian ini, saya akan menjelaskan dengan bantuan bahasa pemrograman Python (Google Colab) agar lebih mudah dipahami.

News_Sentiment_Analysis_Abdusy

PROJECT : NEWS SENTIMENT ANALYSIS¶

Start : 12-Oct-2023 \ Target End : End of January 2024

In [ ]:

# Import Libraries
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob

In [ ]:

# Load Data
file_data = 'https://abdusy.troi-z.com/wp-content/uploads/2023/10/allnews_2007_2017.csv'
data = pd.read_csv(file_data) #, index_col = 'Date', parse_dates=['Date'])

In [ ]:

# Preview Data
data.shape

Out[ ]:

(47182, 10)

As we can see, we have 47182 rows.

In [ ]:

data.describe

Out[ ]:

<bound method NDFrame.describe of              Date   Time Currency Impact                         Description  \
0      2007/01/01  00:01      CAD      N              Bank Holiday [All Day]
1      2007/01/01  00:01      CHF      N              Bank Holiday [All Day]
2      2007/01/01  00:01      EUR      N       French Bank Holiday [All Day]
3      2007/01/01  00:01      EUR      N       German Bank Holiday [All Day]
4      2007/01/01  00:01      EUR      N      Italian Bank Holiday [All Day]
...           ...    ...      ...    ...                                 ...
47177  2017/09/01  10:00      USD      H               ISM Manufacturing PMI
47178  2017/09/01  10:00      USD      L           Construction Spending m/m
47179  2017/09/01  10:00      USD      L            ISM Manufacturing Prices
47180  2017/09/01  10:00      USD      L  Revised UoM Inflation Expectations
47181  2017/09/01  10:00      USD      M      Revised UoM Consumer Sentiment

      Actual Forecast Previous NOT_USED  Graph
0        NaN      NaN      NaN      NaN  12245
1        NaN      NaN      NaN      NaN  12036
2        NaN      NaN      NaN      NaN  12214
3        NaN      NaN      NaN      NaN  12186
4        NaN      NaN      NaN      NaN  12217
...      ...      ...      ...      ...    ...
47177   58.8     56.5     56.3      NaN  64927
47178  -0.6%     0.5%    -1.4%    -1.3%  66723
47179   62.0     61.9     62.0      NaN  64926
47180   2.6%      NaN     2.6%      NaN  64986
47181   96.8     97.4     97.6      NaN  64988

[47182 rows x 10 columns]>

In [ ]:

data.head(10)

Out[ ]:

	Date	Time	Currency	Impact	Description	Actual	Forecast	Previous	NOT_USED	Graph
0	2007/01/01	00:01	CAD	N	Bank Holiday [All Day]	NaN	NaN	NaN	NaN	12245
1	2007/01/01	00:01	CHF	N	Bank Holiday [All Day]	NaN	NaN	NaN	NaN	12036
2	2007/01/01	00:01	EUR	N	French Bank Holiday [All Day]	NaN	NaN	NaN	NaN	12214
3	2007/01/01	00:01	EUR	N	German Bank Holiday [All Day]	NaN	NaN	NaN	NaN	12186
4	2007/01/01	00:01	EUR	N	Italian Bank Holiday [All Day]	NaN	NaN	NaN	NaN	12217
5	2007/01/01	00:01	GBP	N	Bank Holiday [All Day]	NaN	NaN	NaN	NaN	12031
6	2007/01/01	00:01	JPY	N	Bank Holiday [All Day]	NaN	NaN	NaN	NaN	12086
7	2007/01/01	00:01	NZD	N	Bank Holiday [All Day]	NaN	NaN	NaN	NaN	12092
8	2007/01/01	00:01	USD	N	Bank Holiday [All Day]	NaN	NaN	NaN	NaN	12004
9	2007/01/01	18:30	AUD	L	AIG Manufacturing Index	52.4	NaN	54.4	NaN	3232

In [ ]:

data.tail(10)

Out[ ]:

	Date	Time	Currency	Impact	Description	Actual	Forecast	Previous	NOT_USED	Graph
47172	2017/09/01	08:30	USD	H	Average Hourly Earnings m/m	0.1%	0.2%	0.3%	NaN	64389
47173	2017/09/01	08:30	USD	H	Non-Farm Employment Change	156K	180K	189K	209K	64397
47174	2017/09/01	08:30	USD	H	Unemployment Rate	4.4%	4.3%	4.3%	NaN	64393
47175	2017/09/01	09:30	CAD	L	Manufacturing PMI	54.6	NaN	55.5	NaN	65218
47176	2017/09/01	09:45	USD	L	Final Manufacturing PMI	52.8	52.5	52.5	NaN	64779
47177	2017/09/01	10:00	USD	H	ISM Manufacturing PMI	58.8	56.5	56.3	NaN	64927
47178	2017/09/01	10:00	USD	L	Construction Spending m/m	-0.6%	0.5%	-1.4%	-1.3%	66723
47179	2017/09/01	10:00	USD	L	ISM Manufacturing Prices	62.0	61.9	62.0	NaN	64926
47180	2017/09/01	10:00	USD	L	Revised UoM Inflation Expectations	2.6%	NaN	2.6%	NaN	64986
47181	2017/09/01	10:00	USD	M	Revised UoM Consumer Sentiment	96.8	97.4	97.6	NaN	64988

Data is collected start from 01-January-2007 till 01-Sept-2017. There are NaNs and unused columns. Please Note that we need to clean the dataset.

EXPLORING & CLEANING DATA¶

In [ ]:

data['Description'].value_counts()

Out[ ]:

Trade Balance                      1132
Bank Holiday [All Day]              944
Unemployment Rate                   941
Retail Sales m/m                    690
Natural Gas Storage                 557
                                   ...
Doha Oil Summit [All Day]             1
CB Leading Index m/m [Dec Data]       1
CB Leading Index m/m [Feb Data]       1
CB Leading Index m/m [Jan Data]       1
RBA Assist Gov Ellis Speaks           1
Name: Description, Length: 586, dtype: int64

In [ ]:

# Count the Impact Distribution
cnt_impact = data['Impact'].value_counts()
print(cnt_impact)

M    17170
L    16956
H    11674
N     1382
Name: Impact, dtype: int64

In [ ]:

# Create a bar plot
cnt_impact.plot(kind='bar', color=['orange', 'green', 'red', 'grey'])
plt.title('Impact Distribution')
plt.xlabel('Impact')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

In [ ]:

# Show the HIGH Impact ONLY
High_Impact = data[data['Impact']=='H'][['Date', 'Time', 'Description', 'Actual', 'Forecast']]
print(High_Impact)

             Date   Time                  Description Actual Forecast
26     2007/01/03  11:00        ISM Manufacturing PMI   51.4     51.0
29     2007/01/03  15:00         FOMC Meeting Minutes    NaN      NaN
30     2007/01/03  17:45                Trade Balance  -785M    -976M
32     2007/01/04  02:45                      CPI m/m   0.0%     0.1%
48     2007/01/04  11:00    ISM Non-Manufacturing PMI   57.1     58.0
...           ...    ...                          ...    ...      ...
47171  2017/09/01  04:30            Manufacturing PMI   56.9     55.0
47172  2017/09/01  08:30  Average Hourly Earnings m/m   0.1%     0.2%
47173  2017/09/01  08:30   Non-Farm Employment Change   156K     180K
47174  2017/09/01  08:30            Unemployment Rate   4.4%     4.3%
47177  2017/09/01  10:00        ISM Manufacturing PMI   58.8     56.5

[11674 rows x 5 columns]

In [ ]:

# Show the MEDIUM Impact ONLY
Medium_Impact = data[data['Impact']=='M'][['Date', 'Time', 'Description', 'Actual', 'Forecast', 'Previous']]
print(Medium_Impact)

             Date   Time                     Description Actual Forecast  \
11     2007/01/01  22:30  Caixin Final Manufacturing PMI   52.4      NaN
14     2007/01/02  01:30            Commodity Prices y/y  11.0%    13.9%
17     2007/01/02  05:00         Final Manufacturing PMI   56.5     56.5
18     2007/01/02  05:30               Manufacturing PMI   51.9     53.0
21     2007/01/03  04:30               Manufacturing PMI   65.0     66.0
...           ...    ...                             ...    ...      ...
47153  2017/08/31  09:45                     Chicago PMI   58.9     58.7
47154  2017/08/31  10:00          Pending Home Sales m/m  -0.8%     0.4%
47160  2017/08/31  21:45        Caixin Manufacturing PMI   51.6     50.9
47165  2017/09/01  03:15       Spanish Manufacturing PMI   52.4     54.4
47181  2017/09/01  10:00  Revised UoM Consumer Sentiment   96.8     97.4

      Previous
11        53.0
14       13.5%
17        56.6
18        52.5
21        67.0
...        ...
47153     58.9
47154     1.3%
47160     51.1
47165     54.0
47181     97.6

[17170 rows x 6 columns]

In [ ]:

# Delete columns 'NOT_USED' and 'Graph'
data_clean = data.drop(['NOT_USED', 'Graph'], axis=1)

In [ ]:

# Keep only the rows with at least 3 non-NA values.
data_clean.dropna(subset=['Actual', 'Forecast', 'Previous'], inplace=True)

In [ ]:

# Check for NaN in the entire DataFrame
nan_check = data_clean.isnull().values.any()
# Count NaN values in each column
nan_count_per_column = data_clean.isnull().sum()
print(nan_count_per_column)

Date           0
Time           0
Currency       0
Impact         0
Description    0
Actual         0
Forecast       0
Previous       0
dtype: int64

Now, the dataset is clean.

In order to perform the operation, we need to remove (clean) the symbols such as ‘%’, ‘M’,’K’, etc, and convert the values as float for columns [Actual, Forecast, and Previous]. Special case for the News “MPC Official Bank Rate Votes”. Interest rates are a major driving force in forex markets, as they indicate a country’s economic health and future outlook. Higher interest rates tend to attract foreign investors, increasing the demand for the currency and subsequently, its value.

If the majority of the MPC votes for a rate hike, it signals a bullish sentiment for the GBP. Traders might anticipate this and buy GBP before the announcement, causing its value to rise.
If the MPC votes are split, it may cause uncertainty, leading to potential market volatility. Traders often adopt a wait-and-see approach in such scenarios.
If most MPC members vote to cut rates, it may cause the GBP to depreciate as it indicates potential economic challenges ahead.

In [ ]:

## Function to clean and convert values from string to float.
def clean_and_convert(value):
    if isinstance(value, str):
        if '%' in value:
            value = value.replace('<', '')  # Removing '<'
            value = value.replace('>', '')  # Removing '>'
            value = value.replace('%', '')  # Removing '%'
            #value = value.replace('-', '')  # Removing '-'
            return float(value)
        elif 'M' in value:
            return float(value.replace('M', '')) * 1  # Assuming 'M' means million 1_000_000
        elif 'K' in value:
            return float(value.replace('K', '')) * 1  # Assuming 'K' means mille 1_000
        elif 'B' in value:
            return float(value.replace('B', '')) * 1  # Assuming 'B' means billion 1_000_000_000
        elif 'T' in value:
            return float(value.replace('T', '')) * 1  # Assuming 'T' means trillion 1_000_000_000_000
        elif '|' in value:
            split_values = value.split('|')
            return float(split_values[0]) if '|' in value else float(value)
        #elif '-' in value:
        #    return float(value.replace('-',''))
        elif value == '':
            return 0.0

    else:
        return value  # If it's already a float, return as is

In [ ]:

# Applying the function to the DataFrame columns
data_clean['Actual'] = data_clean['Actual'].apply(clean_and_convert)
data_clean['Forecast'] = data_clean['Forecast'].apply(clean_and_convert)
data_clean['Previous'] = data_clean['Previous'].apply(clean_and_convert)

In [ ]:

data_clean.tail(5)

Out[ ]:

	Date	Time	Currency	Impact	Description	Actual	Forecast	Previous
47176	2017/09/01	09:45	USD	L	Final Manufacturing PMI	NaN	NaN	NaN
47177	2017/09/01	10:00	USD	H	ISM Manufacturing PMI	NaN	NaN	NaN
47178	2017/09/01	10:00	USD	L	Construction Spending m/m	-0.6	0.5	-1.4
47179	2017/09/01	10:00	USD	L	ISM Manufacturing Prices	NaN	NaN	NaN
47181	2017/09/01	10:00	USD	M	Revised UoM Consumer Sentiment	NaN	NaN	NaN

DATA PREPROCESSING¶

Create a function to define whetever the News is ‘Positive’, ‘Negative’ or ‘Neutral’.

To do so, we need to import TextBlob library

A polarity score of -1 represents a negative sentiment or a very negative tone.
A polarity score of 0 represents a neutral sentiment.
A polarity score of 1 represents a positive sentiment or a very positive tone.

Update: We do not need polarity score since we are nor going to give sentiment analysis and scoring based on the News title.

In [ ]:

# This is a Function that define the sentiment of the 'News'
#def analyze_sentiment(text):
#    analysis = TextBlob(text)
#    if (analysis.sentiment.polarity > 0)
#        return 'positive'
#    elif (analysis.sentiment.polarity < 0)
#        return 'negative'
#    else:
#        return 'neutral'

In [ ]:

#data_clean['Sentiment'] = data_clean['Description'].apply(analyze_sentiment)

ANALYZE SENTIMENT: \

VERY Positive : if ‘Actual’ > ‘Forecast’ and Impact = H
Positive : if ‘Actual’ > ‘Forecast’ and Impact = M or L
VERY Negative : if ‘Actual’ < ‘Forecast’ and Impact = H
Negative : if ‘Actual’ < ‘Forecast’ and Impact = M or L

In [ ]:

# This is a Function that define the sentiment of the 'News'
def new_analyze_sentiment(data):
    if ((data['Actual'] > data['Forecast']) & ((data['Impact'] == 'H') | (data['Impact'] == 'M'))):
        return 'positive'
    elif (data['Actual'] < data['Forecast']) & ((data['Impact'] == 'H') | (data['Impact'] == 'M')):
        return 'negative'
    else:
        return 'neutral'

In [ ]:

# Apply the function into a new column
data_clean['Sentiment'] = data_clean.apply(new_analyze_sentiment, axis=1)

In [ ]:

data_clean['Sentiment'].tail(10)

Out[ ]:

47170     neutral
47171     neutral
47172    negative
47173    negative
47174    positive
47176     neutral
47177     neutral
47178     neutral
47179     neutral
47181     neutral
Name: Sentiment, dtype: object

In [ ]:

# Count the results to know how many for Neutral, Postive dan Negative sentiments.
cnt_sentiment = data_clean['Sentiment'].value_counts()
print(cnt_sentiment)

neutral     17614
negative     7106
positive     6136
Name: Sentiment, dtype: int64

Now the dataset shows that there are 10928 neutrals, 6136 positives, and 6136 negatives. For a better understanding the distribution, lets create a bar plot.

In [ ]:

# Create a bar plot
import matplotlib.pyplot as plt

cnt_sentiment.plot(kind='bar', color=['black', 'red', 'green'])
plt.title('SENTIMENT DISTRIBUTION')
plt.xlabel('SENTIMENT')
plt.ylabel('COUNT')
plt.xticks(rotation=0)
plt.show()

As we can see that the Sentiment Distribution looks fair, especially the ‘Postive’ and ‘Negative’ sentiments, since we focus on those two sentiments.

In [ ]:

# Show the Positive sentiments ONLY
positive_sentiment = data_clean[data_clean['Sentiment']=='positive'][['Description', 'Actual', 'Forecast', 'Sentiment']]
print(positive_sentiment)

                           Description  Actual  Forecast Sentiment
22          German Unemployment Change   -96.0    -110.0  positive
30                       Trade Balance  -785.0    -976.0  positive
60                   Employment Change    62.0      17.0  positive
62          Non-Farm Employment Change   167.0     115.0  positive
64         Average Hourly Earnings m/m     0.5       0.3  positive
...                                ...     ...       ...       ...
47126                   Prelim GDP q/q     3.0       2.7  positive
47135  Private Capital Expenditure q/q     0.8       0.2  positive
47146           CPI Flash Estimate y/y     1.5       1.4  positive
47148                          GDP m/m     0.3       0.1  positive
47174                Unemployment Rate     4.4       4.3  positive

[6136 rows x 4 columns]

In [ ]:

# Show the negative sentiments ONLY
negative_sentiment = data_clean[data_clean['Sentiment']=='negative'][['Description', 'Actual', 'Forecast', 'Sentiment']]
print(negative_sentiment)

                          Description  Actual  Forecast Sentiment
14               Commodity Prices y/y    11.0      13.9  negative
25     ADP Non-Farm Employment Change   -40.0     120.0  negative
32                            CPI m/m     0.0       0.1  negative
46                           RMPI m/m     0.9       1.0  negative
49             Pending Home Sales m/m    -0.5       0.0  negative
...                               ...     ...       ...       ...
47149             Unemployment Claims   236.0     237.0  negative
47152           Personal Spending m/m     0.3       0.4  negative
47154          Pending Home Sales m/m    -0.8       0.4  negative
47172     Average Hourly Earnings m/m     0.1       0.2  negative
47173      Non-Farm Employment Change   156.0     180.0  negative

[7106 rows x 4 columns]

MACHINE LEARNING¶

We create a machine learning based on some existing models.

In [ ]:

data_clean.describe().T

Out[ ]:

	count	mean	std	min	25%	50%	75%	max
Actual	24171.0	19.048218	102.942078	-1436.0	-0.2	0.6	4.3200	2510.0
Forecast	24171.0	19.106358	99.100445	-1125.0	0.1	0.6	4.0000	2440.0
Previous	24164.0	19.071725	103.600409	-1394.0	-0.2	0.6	4.3825	2510.0

In [ ]:

data_clean['Description'].value_counts()

Out[ ]:

Trade Balance                         1130
Unemployment Rate                      941
Retail Sales m/m                       689
Natural Gas Storage                    557
Crude Oil Inventories                  557
                                      ...
Building Permits [Sep Data]              1
Factory Orders m/m [Aug Data]            1
Employment Level [Q2 Data]               1
Irish Lisbon Treaty Vote [All Day]       1
French Prelim Private Payrolls q/q       1
Name: Description, Length: 267, dtype: int64

In [ ]:

# Keep and Re-check for NaNs values.
data_clean.dropna(subset=['Actual', 'Forecast', 'Previous'], inplace=True)
# Check for NaN in the entire DataFrame
nan_check = data_clean.isnull().values.any()
# Count NaN values in each column
nan_count_per_column = data_clean.isnull().sum()
print(nan_count_per_column)

Date           0
Time           0
Currency       0
Impact         0
Description    0
Actual         0
Forecast       0
Previous       0
Sentiment      0
dtype: int64

In [ ]:

#Define the features
X = data_clean[['Actual', 'Forecast', 'Previous']]
X.columns = X.columns.astype(str)

In [ ]:

#Define Target variable
y = data_clean['Sentiment']

In [ ]:

# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report

In [ ]:

#Data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

print('X_train : ', X_train.shape)
print('X_test : ', X_test.shape)
print('y_train : ', y_train.shape)
print('y_test : ', y_test.shape)

X_train :  (21747, 3)
X_test :  (2417, 3)
y_train :  (21747,)
y_test :  (2417,)

CREATE MACHINE LEARNING MODEL(S)¶

Logistic Regression Model

In [ ]:

# Define a model and fitting the model by using LOGISTIC REGRESSION model
from sklearn.linear_model import LogisticRegression

model_LR = LogisticRegression(max_iter=2000)
model_LR.fit(X_train, y_train) # --- Training Process

# Making predictions on the test set
y_pred_LR = model_LR.predict(X_test) # -- Testing Process

# Model evaluation
accuracy_LR = accuracy_score(y_test, y_pred_LR)
print(f"Accuracy LR: {accuracy_LR:.2f}")

# Classification report showing precision, recall, F1-score, etc.
print("Classification Report LR:")
print(classification_report(y_test, y_pred_LR))

Accuracy LR: 0.52
Classification Report LR:
              precision    recall  f1-score   support

    negative       0.71      0.21      0.32       676
     neutral       0.48      0.93      0.63      1095
    positive       0.81      0.14      0.24       646

    accuracy                           0.52      2417
   macro avg       0.67      0.43      0.40      2417
weighted avg       0.63      0.52      0.44      2417

In [ ]:

# Visualize the classification report

# Create a classification report
report_LR = classification_report(y_test, y_pred_LR, output_dict=True)

# Extract precision, recall, and F1-score from the report
classes_LR = list(report_LR.keys())[:-3]  # Extract class names, excluding 'accuracy', 'macro avg', 'weighted avg'

precision_LR = [report_LR[c]['precision'] for c in classes_LR]
recall_LR = [report_LR[c]['recall'] for c in classes_LR]
f1_score_LR = [report_LR[c]['f1-score'] for c in classes_LR]

# Create a bar plot for precision, recall, and F1-score
x = np.arange(len(classes_LR))  # X-axis locations for the bars
width = 0.3  # Width of the bars

fig, ax = plt.subplots(figsize=(10, 6))

# Plot precision, recall, and F1-score for each class
ax.bar(x - width, precision_LR, width, label='Precision LR')
ax.bar(x, recall_LR, width, label='Recall LR')
ax.bar(x + width, f1_score_LR, width, label='F1-score LR')

ax.set_xticks(x)
ax.set_xticklabels(classes_LR)
ax.set_xlabel('Classes')
ax.set_ylabel('Score')
ax.set_title('Classification Report Metrics - Logistic Regression Model')
ax.legend()

plt.tight_layout()
plt.show()

Random Forest Model

In [ ]:

# Creating and fitting the RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier

model_RF = RandomForestClassifier(n_estimators=100, random_state=42)
model_RF.fit(X_train, y_train)  # Training the model

# Making predictions on the test set
y_pred_RF = model_RF.predict(X_test)

# Model evaluation
accuracy_RF = accuracy_score(y_test, y_pred_RF)
print(f"Accuracy RF: {accuracy_RF:.2f}")

# Classification report showing precision, recall, F1-score, etc.
print("Classification Report:")
print(classification_report(y_test, y_pred_RF))

Accuracy RF: 0.69
Classification Report:
              precision    recall  f1-score   support

    negative       0.67      0.74      0.71       676
     neutral       0.69      0.62      0.65      1095
    positive       0.70      0.76      0.73       646

    accuracy                           0.69      2417
   macro avg       0.69      0.70      0.70      2417
weighted avg       0.69      0.69      0.69      2417

In [ ]:

# Create a classification report
report_RF = classification_report(y_test, y_pred_RF, output_dict=True)

# Extract precision, recall, and F1-score from the report
classes_RF = list(report_RF.keys())[:-3]  # Extract class names, excluding 'accuracy', 'macro avg', 'weighted avg'

precision_RF = [report_RF[c]['precision'] for c in classes_RF]
recall_RF = [report_RF[c]['recall'] for c in classes_RF]
f1_score_RF = [report_RF[c]['f1-score'] for c in classes_RF]

# Create a bar plot for precision, recall, and F1-score
x = np.arange(len(classes_RF))  # X-axis locations for the bars
width = 0.3  # Width of the bars

fig, ax = plt.subplots(figsize=(10, 6))

# Plot precision, recall, and F1-score for each class
ax.bar(x - width, precision_RF, width, label='Precision RF')
ax.bar(x, recall_RF, width, label='Recall RF')
ax.bar(x + width, f1_score_RF, width, label='F1-score RF')

ax.set_xticks(x)
ax.set_xticklabels(classes_RF)
ax.set_xlabel('Classes')
ax.set_ylabel('Score')
ax.set_title('Classification Report Metrics - Random Forest Model')
ax.legend()

plt.tight_layout()
plt.show()

PREDICT¶

To predict, we need Actual, Forecast, and Previous values as the inputs.

In [ ]:

# Test to predict
input_data = {
    'Actual':[-16.8],    #input Actual value here
    'Forecast': [-15], #input Forecast value here
    'Previous': [-18.6]  #input Previous value here
}
df_input = pd.DataFrame(input_data)

In [ ]:

# predict with RF model
pred_sentiment_RF = model_RF.predict(df_input)
print(pred_sentiment_RF)

['negative']

In [ ]:

# predict with LR model
pred_sentiment_LR = model_LR.predict(df_input)
print(pred_sentiment_LR)

['neutral']

Until this step, we can predict a news sentiment by providing values as input data such as ‘Actual’, ‘Forecast’ and ‘Previous’ which we can take from https://www.forexfactory.com/calendar

PIPELINE MODEL

In [ ]:

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Splitting the data into features and target
X = data_clean.drop('Sentiment', axis=1)  # Features
y = data_clean['Sentiment']  # Target

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing for numerical features
numeric_features = ['Actual', 'Forecast', 'Previous']
numeric_transformer = StandardScaler()

# Preprocessing for text features
text_features = 'Description'
text_transformer = CountVectorizer()

# Combine preprocessing for numerical and text features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('text', text_transformer, text_features)
    ])

# Create the pipeline with preprocessing and model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', RandomForestClassifier())])

# Fit the model
pipeline.fit(X_train, y_train)

# Predictions
predictions = pipeline.predict(X_test)

# Evaluate the model
from sklearn.metrics import accuracy_score, classification_report

accuracy_pipeline = accuracy_score(y_test, predictions)
report_pipeline = classification_report(y_test, predictions)

print(f"Accuracy: {accuracy_pipeline}")
print(f"Classification Report:\n{report_pipeline}")

Accuracy: 0.8127457066004552
Classification Report:
              precision    recall  f1-score   support

    negative       0.79      0.80      0.80      1408
     neutral       0.85      0.84      0.85      2182
    positive       0.78      0.77      0.78      1243

    accuracy                           0.81      4833
   macro avg       0.80      0.81      0.81      4833
weighted avg       0.81      0.81      0.81      4833

In [ ]:

# Create a classification report
report_pipeline = classification_report(y_test, predictions, output_dict=True)

# Extract precision, recall, and F1-score from the report
classes_pipeline = list(report_pipeline.keys())[:-3]  # Extract class names, excluding 'accuracy', 'macro avg', 'weighted avg'

precision_pipeline = [report_pipeline[c]['precision'] for c in classes_pipeline]
recall_pipeline = [report_pipeline[c]['recall'] for c in classes_pipeline]
f1_score_pipeline = [report_pipeline[c]['f1-score'] for c in classes_pipeline]

# Create a bar plot for precision, recall, and F1-score
x = np.arange(len(classes_pipeline))  # X-axis locations for the bars
width = 0.3  # Width of the bars

fig, ax = plt.subplots(figsize=(10, 6))

# Plot precision, recall, and F1-score for each class
ax.bar(x - width, precision_pipeline, width, label='Precision pipeline')
ax.bar(x, recall_pipeline, width, label='Recall pipeline')
ax.bar(x + width, f1_score_pipeline, width, label='F1-score pipeline')

ax.set_xticks(x)
ax.set_xticklabels(classes_pipeline)
ax.set_xlabel('Classes')
ax.set_ylabel('Score')
ax.set_title('Classification Report Metrics - Pipeline Model')
ax.legend()

plt.tight_layout()
plt.show()

In [ ]:

# Test to predict with the pipeline model
input_data = {
    'Description':"French Flash Manufacturing PMI", #input Description
    'Actual':[42],    #input Actual value here
    'Forecast': [43.3], #input Forecast value here
    'Previous': [42.9]  #input Previous value here
}
df_input = pd.DataFrame(input_data)

In [ ]:

# predict with pipeline model
pred_sentiment_pipeline = pipeline.predict(df_input)
print(pred_sentiment_pipeline)

['neutral']

PIPELINE METHODE, but this time we include the ‘DATE’, to see the pattern of the NEWS.

In [ ]:

data_clean

Out[ ]:

	Date	Time	Currency	Impact	Description	Actual	Forecast	Previous	Sentiment
14	2007/01/02	01:30	AUD	M	Commodity Prices y/y	11.0	13.9	13.5	negative
19	2007/01/03	00:01	USD	L	Total Vehicle Sales [All Day]	16.7	16.5	16.1	neutral
22	2007/01/03	04:55	EUR	M	German Unemployment Change	-96.0	-110.0	-90.0	positive
25	2007/01/03	09:15	USD	M	ADP Non-Farm Employment Change	-40.0	120.0	230.0	negative
27	2007/01/03	11:00	USD	L	Construction Spending m/m	-0.2	-0.6	-1.0	neutral
…	…	…	…	…	…	…	…	…	…
47164	2017/09/01	03:15	CHF	L	Retail Sales y/y	-0.7	1.7	1.7	neutral
47172	2017/09/01	08:30	USD	H	Average Hourly Earnings m/m	0.1	0.2	0.3	negative
47173	2017/09/01	08:30	USD	H	Non-Farm Employment Change	156.0	180.0	189.0	negative
47174	2017/09/01	08:30	USD	H	Unemployment Rate	4.4	4.3	4.3	positive
47178	2017/09/01	10:00	USD	L	Construction Spending m/m	-0.6	0.5	-1.4	neutral

24164 rows × 9 columns

Distributions

Categorical distributions

2-d distributions

Time series

Values

2-d categorical distributions

Faceted distributions

In [ ]:

data_clean.plot(kind='scatter', x='Forecast', y='Previous', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

In [ ]:

def _plot_series(series, series_name, series_index=0):
  from matplotlib import pyplot as plt
  import seaborn as sns
  palette = list(sns.palettes.mpl_palette('Dark2'))
  xs = series['Date']
  ys = series['Actual']

  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(10, 5.2), layout='constrained')
df_sorted = data_clean.sort_values('Date', ascending=True)
for i, (series_name, series) in enumerate(df_sorted.groupby('Sentiment')):
  _plot_series(series, series_name, i)
  fig.legend(title='Sentiment', bbox_to_anchor=(1, 1), loc='upper left')
sns.despine(fig=fig, ax=ax)
plt.xlabel('Date')
_ = plt.ylabel('Actual')

In [ ]:

data_clean.groupby('Impact').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

In [ ]:

# Splitting the data into features and target
X = data_clean.drop('Sentiment', axis=1)  # Features
y = data_clean['Sentiment']  # Target

# Engineering time-based features from Date column
X['DayOfWeek'] = pd.to_datetime(X['Date']).dt.dayofweek  # Extracting day of the week
X['Month'] = pd.to_datetime(X['Date']).dt.month  # Extracting month
X['Year'] = pd.to_datetime(X['Date']).dt.year  # Extracting year
X['Hour'] = pd.to_datetime(X['Time'], format='%H:%M').dt.hour #Extracting hour
X['Minute'] = pd.to_datetime(X['Time'], format='%H:%M').dt.minute #Extracting minute


# Drop unnecessary columns (Date, Description) for now
X = X.drop(['Date', 'Description'], axis=1)

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing for numerical features
numeric_features = ['Actual', 'Forecast', 'Previous', 'DayOfWeek', 'Month', 'Hour', 'Minute']  # Include time-based features
numeric_transformer = StandardScaler()

# Preprocessing for text features (if Description is still used)
text_features = []  # Assuming Description column has been dropped
text_transformer = CountVectorizer()

# Combine preprocessing for numerical and text features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('text', text_transformer, text_features)  # Add text feature transformation here if needed
    ])

# Create the pipeline with preprocessing and model
pipeline_v2 = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', RandomForestClassifier())])  # You can change the classifier as needed

# Fit the model
pipeline_v2.fit(X_train, y_train)

# Predictions
predictions_v2 = pipeline_v2.predict(X_test)

# Evaluate the model
accuracy_pipeline_v2 = accuracy_score(y_test, predictions_v2)
report_pipeline_v2 = classification_report(y_test, predictions_v2, output_dict=True)

print(f"Accuracy: {accuracy_pipeline_v2}")
print(f"Classification Report:\n{report_pipeline_v2}")

Accuracy: 0.7438444030622802
Classification Report:
{'negative': {'precision': 0.7184604419101924, 'recall': 0.7159090909090909, 'f1-score': 0.7171824973319103, 'support': 1408}, 'neutral': {'precision': 0.7602468047598061, 'recall': 0.7905591200733272, 'f1-score': 0.7751067175915526, 'support': 2182}, 'positive': {'precision': 0.7424633936261843, 'recall': 0.6934835076427996, 'f1-score': 0.7171381031613977, 'support': 1243}, 'accuracy': 0.7438444030622802, 'macro avg': {'precision': 0.740390213432061, 'recall': 0.7333172395417392, 'f1-score': 0.7364757726949535, 'support': 4833}, 'weighted avg': {'precision': 0.7434994472321115, 'recall': 0.7438444030622802, 'f1-score': 0.7433226725134936, 'support': 4833}}

In [ ]:

Viola….

Saya belum sempat memberikan deskripsi atau penjelasan dari masing-masing script Python. Jika ada yang perlu ditanyakan, silahkan tinggalkan komentar Anda pada kolom di bawah.

Part #3

Colmar, 19 Des 2023, Winter.

Mengungkap Pasar Keuangan (Forex/Stock) dengan News Sentiment Analysis Menggunakan Python – Part #1

1 thought on “Mengungkap Pasar Keuangan (Forex/Stock) dengan News Sentiment Analysis Menggunakan Python – Part #2”

https://rb-str.ru/ July 7, 2025 at 09:47


Hey there! Do you know if they make any plugins to safeguard against hackers? I’m kinda paranoid about losing everything I’ve worked hard on. Any tips?

Distributions

Categorical distributions

2-d distributions

Time series

Values

2-d categorical distributions

Faceted distributions

Pengantar Machine Learning

Kecerdasan Buatan

Sempurnakan Keahlian Anda dengan Python

Jelajahi Library Python untuk Data Science

Mengungkap Pasar Keuangan (Forex/Stock) dengan News Sentiment Analysis Menggunakan Python – Part #2

PROJECT : NEWS SENTIMENT ANALYSIS¶

EXPLORING & CLEANING DATA¶

DATA PREPROCESSING¶

MACHINE LEARNING¶

CREATE MACHINE LEARNING MODEL(S)¶

PREDICT¶

Distributions

Categorical distributions

2-d distributions

Time series

Values

2-d categorical distributions

Faceted distributions

1 thought on “Mengungkap Pasar Keuangan (Forex/Stock) dengan News Sentiment Analysis Menggunakan Python – Part #2”

Leave a Reply Cancel reply