Data Preperation

Load Data

Loading each csv file seperately then concatinating the pandas dataframes.

Quick Look at Dataset

Looking at histograms, format of attributes, among other things. Added Talc for total consumtion through the week by adding Dalc and Walc. I then dropped Dalc and Walc.

Create Test Set

I used a stratified shuffle split using the 'goout' attribute as it had the strongest correlation to the 'Talc' attribute

Gaining Insights on the Data

Talc (Total Alcohol Consumtion) was most correlated with the attributes 'goout' and 'studytime'.

Experimenting with Attribute Combinations

I couldn't find any combinations that had a significant correlation to the Total Alcohol Consumtion, so I did not add any attributes.

Data Cleaning

There are no missing values and no other cleaning is needed.

Text and Categorical Attributes

I will be using the OneHotEncoder to encode the categories.

Creation of Pipelines

Training

Evaluating on Test Set

Conclusion

Through looking at the correlation matrix, you can see that how a student spends his/her time is the main factor on how much the student drinks. Through the feature importances, it is shown that how much a student goes out, their age, their final grade, and their abscences determine how much the student will drink.

You can also see that the random forest regression model with {'bootstrap': False, 'max_features': 4, 'n_estimators': 30} at the parameters was the best model tested to predict the consumtion of a student.

These are based on the assumption that the data accurately portrays the population, otherwise these conclusions only remain true to the dataset.