Price Prediction of Avocados

Data Preperation

Quick Look at Dataset

Looking at histograms, format of attributes, among other things.

Create Train and Test Sets

I used a simple shuffled data set with test and train ratios splitting the data set into two.

Gaining Insights on the Data

Total Volume is the sum of 4225, 4770, and 4046. I created attributes to show the percentage of the total volume for each of the PLU attributes. The new 4046_ratio attribute turned out to be the number one correlated attribute with AveragePrice.

Data Cleaning

Using the mean value of each attribute, the missing and null values are replaced.

Changed date to month to get a more accurate representation of which season the purchase was in.

Text and Categorical Attributes

I will be using the OneHotEncoder to encode the categories.

Scaling and Pipeline

I scaled all the values with sklearn's StandardScaler.

I also created a num and cat pipeline along with a pipeline combining the two.

Training

The random forest regression model had the lowest rmse, so we will be moving forward with that model.

Conclusion

The Random Forest Regression Model worked best with this dataset. The main factors (in respective order) that go into the price of an avocado are:

All conclusions made from this report rely on the fact that this database is representative of all large orders of avocados. If this is not true, all conclusions should be taken lightly.