Community Tip - Did you know you can set a signature that will be added to all your posts? Set it here! X
Hello,
I'm working on a Sentiment Analysis project utilising a huge movie reviews dataset I obtained from here, and when I execute my code, I keep getting an out-of-memory issue. Here's an example of pertinent code from my code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
# Load the large movie reviews dataset
data = pd.read_csv('large_movie_reviews.csv')
# Preprocess the data
# ... (code for data preprocessing)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42)
# Vectorize the text data using TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
# Train the Support Vector Machine (SVM) model
model = SVC(kernel='linear')
model.fit(X_train_vectorized, y_train)
# Evaluate the model
accuracy = model.score(X_test_vectorized, y_test)
print(f"Accuracy: {accuracy}")
Unfortunately, because my dataset is pretty enormous, I get an out-of-memory problem when I run this code. TfidfVectorizer, I believe, creates a big sparse matrix that takes a substantial amount of memory. I'm seeking for memory-efficient alternatives or strategies that will allow me to deal with enormous datasets while still training an accurate sentiment analysis model.
Could you kindly provide some memory-efficient ways or alternatives that I may use to prevent the out-of-memory problem and keep working with my enormous movie reviews dataset?
Thank you so much!
Hi @sachinbhatt ,
The forum where you posted your message is a ThingWorx specific forum.
Is there a specific ask related to ThingWorx? If not, stackoverflow has a far higher chance of giving you the advice you need.