Comparative Analysis of Musical Preferences in Moscow and St. Petersburg

Answered using ChatGPT 4o by ChatGPT4o

Question:

Analysis of musical preferences of streaming service users Project objective

Using real data from the Yandex.Music service, check the data and compare the behavior and preferences of users of two capitals - Moscow and St. Petersburg. Project description

Required using the Pandas library

Check the data for errors and evaluate their impact on the study Compare the behavior of users of the two capitals and test the hypotheses: User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg, this manifests itself differently. On Monday morning, some genres prevail in Moscow, and others in St. Petersburg. Likewise, on Friday evening, different genres prevail - depending on the city. Moscow and St. Petersburg prefer different genres of music. In Moscow, they listen to pop music more often, in St. Petersburg - Russian rap.

Answer:

To analyze the musical preferences of streaming service users from Moscow and St. Petersburg using the Pandas library, we need to follow a structured approach. Here’s a step-by-step guide to achieve the project objectives:

Step 1: Data Collection and Loading

First, we need to load the dataset into a Pandas DataFrame. Assuming the data is in a CSV file, we can use the pd.read_csv() function.

import pandas as pd

# Load the dataset
data = pd.read_csv('yandex_music_data.csv')

Step 2: Data Inspection and Cleaning

Inspect the data for any errors or missing values. This step is crucial to ensure the accuracy of our analysis.

# Display the first few rows of the dataset
print(data.head())

# Check for missing values
print(data.isnull().sum())

# Drop or fill missing values if necessary
data = data.dropna()  # or use data.fillna(method='ffill') or another appropriate method

Step 3: Data Preprocessing

Ensure that the data types are correct and convert any necessary columns to appropriate data types.

# Convert date column to datetime
data['date'] = pd.to_datetime(data['date'])

# Ensure other columns are of correct data type
print(data.dtypes)

Step 4: User Activity Analysis

Analyze user activity based on the day of the week and compare between Moscow and St. Petersburg.

# Extract day of the week from the date
data['day_of_week'] = data['date'].dt.day_name()

# Group by city and day of the week to get user activity
activity = data.groupby(['city', 'day_of_week']).size().unstack().fillna(0)

# Plot the user activity
activity.plot(kind='bar', figsize=(12, 6), title='User Activity by Day of the Week')

Step 5: Genre Analysis on Specific Days and Times

Compare the genres listened to on Monday morning and Friday evening in both cities.

# Filter data for Monday morning and Friday evening
monday_morning = data[(data['day_of_week'] == 'Monday') & (data['time'].between('06:00:00', '12:00:00'))]
friday_evening = data[(data['day_of_week'] == 'Friday') & (data['time'].between('18:00:00', '23:59:59'))]

# Group by city and genre
monday_genres = monday_morning.groupby(['city', 'genre']).size().unstack().fillna(0)
friday_genres = friday_evening.groupby(['city', 'genre']).size().unstack().fillna(0)

# Plot the genre distribution
monday_genres.plot(kind='bar', figsize=(12, 6), title='Genres on Monday Morning')
friday_genres.plot(kind='bar', figsize=(12, 6), title='Genres on Friday Evening')

Step 6: General Genre Preferences

Compare the overall genre preferences between Moscow and St. Petersburg.

# Group by city and genre for overall preferences
overall_genres = data.groupby(['city', 'genre']).size().unstack().fillna(0)

# Plot the overall genre preferences
overall_genres.plot(kind='bar', figsize=(12, 6), title='Overall Genre Preferences')

Step 7: Hypothesis Testing

Test the hypotheses using statistical methods to determine if the observed differences are significant.

from scipy.stats import chi2_contingency

# Hypothesis 1: User activity depends on the day of the week
contingency_table = activity.T
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f'Chi2: {chi2}, p-value: {p}')

# Hypothesis 2: Different genres on Monday morning and Friday evening
monday_contingency = monday_genres.T
friday_contingency = friday_genres.T

chi2_monday, p_monday, _, _ = chi2_contingency(monday_contingency)
chi2_friday, p_friday, _, _ = chi2_contingency(friday_contingency)

print(f'Monday Chi2: {chi2_monday}, p-value: {p_monday}')
print(f'Friday Chi2: {chi2_friday}, p-value: {p_friday}')

# Hypothesis 3: Different overall genre preferences
overall_contingency = overall_genres.T
chi2_overall, p_overall, _, _ = chi2_contingency(overall_contingency)
print(f'Overall Chi2: {chi2_overall}, p-value: {p_overall}')

Conclusion

Based on the analysis and hypothesis testing, we can draw conclusions about the musical preferences and behavior of users in Moscow and St. Petersburg. The p-values from the chi-squared tests will help us determine if the differences observed are statistically significant.

This structured approach ensures a comprehensive analysis of the musical preferences of streaming service users in the two cities.