Comparative Analysis of Musical Preferences in Moscow and St. Petersburg

Answer:

To analyze the musical preferences of streaming service users from Moscow and St. Petersburg using the Pandas library, we need to follow a structured approach. Here’s a step-by-step guide to achieve the project objectives:

Step 1: Data Collection and Loading

First, we need to load the dataset into a Pandas DataFrame. Assuming the data is in a CSV file, we can use the pd.read_csv() function.

import pandas as pd

# Load the dataset
data = pd.read_csv('yandex_music_data.csv')

Step 2: Data Inspection and Cleaning

Inspect the data for any errors or missing values. This step is crucial to ensure the accuracy of our analysis.

# Display the first few rows of the dataset
print(data.head())

# Check for missing values
print(data.isnull().sum())

# Drop or fill missing values if necessary
data = data.dropna()  # or use data.fillna(method='ffill') or another appropriate method

Step 3: Data Preprocessing

Ensure that the data types are correct and convert any necessary columns to appropriate data types.

# Convert date column to datetime
data['date'] = pd.to_datetime(data['date'])

# Ensure other columns are of correct data type
print(data.dtypes)

Step 4: User Activity Analysis

Analyze user activity based on the day of the week and compare between Moscow and St. Petersburg.

# Extract day of the week from the date
data['day_of_week'] = data['date'].dt.day_name()

# Group by city and day of the week to get user activity
activity = data.groupby(['city', 'day_of_week']).size().unstack().fillna(0)

# Plot the user activity
activity.plot(kind='bar', figsize=(12, 6), title='User Activity by Day of the Week')

Step 5: Genre Analysis on Specific Days and Times

Compare the genres listened to on Monday morning and Friday evening in both cities.

# Filter data for Monday morning and Friday evening
monday_morning = data[(data['day_of_week'] == 'Monday') & (data['time'].between('06:00:00', '12:00:00'))]
friday_evening = data[(data['day_of_week'] == 'Friday') & (data['time'].between('18:00:00', '23:59:59'))]

# Group by city and genre
monday_genres = monday_morning.groupby(['city', 'genre']).size().unstack().fillna(0)
friday_genres = friday_evening.groupby(['city', 'genre']).size().unstack().fillna(0)

# Plot the genre distribution
monday_genres.plot(kind='bar', figsize=(12, 6), title='Genres on Monday Morning')
friday_genres.plot(kind='bar', figsize=(12, 6), title='Genres on Friday Evening')

Step 6: General Genre Preferences

Compare the overall genre preferences between Moscow and St. Petersburg.

# Group by city and genre for overall preferences
overall_genres = data.groupby(['city', 'genre']).size().unstack().fillna(0)

# Plot the overall genre preferences
overall_genres.plot(kind='bar', figsize=(12, 6), title='Overall Genre Preferences')

Step 7: Hypothesis Testing

Test the hypotheses using statistical methods to determine if the observed differences are significant.

from scipy.stats import chi2_contingency

# Hypothesis 1: User activity depends on the day of the week
contingency_table = activity.T
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f'Chi2: {chi2}, p-value: {p}')

# Hypothesis 2: Different genres on Monday morning and Friday evening
monday_contingency = monday_genres.T
friday_contingency = friday_genres.T

chi2_monday, p_monday, _, _ = chi2_contingency(monday_contingency)
chi2_friday, p_friday, _, _ = chi2_contingency(friday_contingency)

print(f'Monday Chi2: {chi2_monday}, p-value: {p_monday}')
print(f'Friday Chi2: {chi2_friday}, p-value: {p_friday}')

# Hypothesis 3: Different overall genre preferences
overall_contingency = overall_genres.T
chi2_overall, p_overall, _, _ = chi2_contingency(overall_contingency)
print(f'Overall Chi2: {chi2_overall}, p-value: {p_overall}')

Conclusion

Based on the analysis and hypothesis testing, we can draw conclusions about the musical preferences and behavior of users in Moscow and St. Petersburg. The p-values from the chi-squared tests will help us determine if the differences observed are statistically significant.

This structured approach ensures a comprehensive analysis of the musical preferences of streaming service users in the two cities.