Photo by Markus Spiske on Unsplash

I just did a Kaggle competition with Gary Lu. We studied the dataset of Boston housing prices, and built up prediction models. The result turns out to be within top 9%.

Soon after getting started, I noticed that the datasets have many features (80 in total: 37 continuous and 43 categorical). Making sense of all of them definitely paves concrete path to feature engineering and model building process. However, there are also too many ways to do that, I was confused about the ‘right’ way for a while…

After some trial and error I feel that though there’s no absolute…


I am interested in the topic too. Before the class I have not thought about the interaction term can be applied across continuous and categorical variables.

It seems by doing this, we set free the continuous variable (minute) here. In reality, if there are multiple categorical variables, shall we apply multiple interaction terms? I am not quite sure about the impact...looking forward to learning more :)


Photo by Andrew Bain on Unsplash

Business runners may have the common sense that it is exhaustive to treat all customers exactly the same, instead, it is wise to focus on star customers. This is true, as star customers are the ones bring the biggest chunk of profit to business. Then here comes a question: who are they?

Before getting to know them, we need to know what makes a customer as a ‘star’. So first let us familiarize ourselves with the concept of ‘CLV’: Customer Lifetime Value, which serves as the ‘star classifier’. …


Photo by Hush Naidoo on Unsplash

Recently, we made a staffing assignment model for a hypothetic clinic. The goal is to provide clinic managers a tool to assign staff to each shift with the lowest cost.

The model is achieved by using Excel-based OpenSolver. I found that though the Excel model is easy to read and operate, it lacks flexibility especially when we want to add new staff or adjust current staff number.

So, I can’t help think about the question: can this be achieved by coding? Then Google quickly showed me to Gurobipy Python Package:

https://www.gurobi.com/documentation/9.1/quickstart_mac/cs_grbpy_the_gurobi_python.html

Users can set up a Gurobi model by adding…


In a recent school project we are asked to predict ‘demand’ of Airbnb market against the change of listing prices with a scatter plot, namely, to show that for every 10% increase in Airbnb prices how much percentage change in demand can be expected.

With the 2009–2019 Airbnb listing data ( each row is a unique listing) at hand, our study group came up with 3 ideas, and one of them is to count the number of listings against each price and then make a scatter plot (#of listings vs. price). This maybe not a perfect representation of demand but…


This is a summary of some visualization methods can be used when treating different types of data. I tried out multiple visualization practices during my first two EDA attempts, so here I would summarize my trials for future references.

Here are my two attempts:
Who, Where and What — 2019 NYC Airbnb Analysis

https://www.kaggle.com/juliayyy/who-where-and-what-2019-nyc-airb-b-analysis

EDA, Visualization & NLP on US Data Analyst Jobs

https://www.kaggle.com/juliayyy/eda-visualization-nlp-on-us-data-analyst-jobs

I mainly used 3 packages as below:

  1. Matplotlib
  2. Seaborn
  3. plotly.express

In general, I felt that the core of visualization is always about making sense. Though pretty look makes report more readable, we’d better also be cautious…

Julia Yang

Master of Business Analytics of UBC. www.linkedin.com/in/julia-yuxiao-yang/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store