Directed Study: EDM and LA: Classification

Much of my focus this week has been on methods of classification. The methods I have reviewed so far are based in regression, including step regression, logistic regression, and decision trees.

First, a disclaimer: I am a stats beginner! A lot of this stuff is going over my head. I'm trying to get the gist of these different statistical analyses by turning to other resources in addition to the EDM MOOC I am following. I feel like I am getting a decent understanding of things, but I still have questions about particulars, and I am sure I've missed something here and there. Take my explanations of this week's analyses below with a grain of salt!

Classification is good when you want to sort your data into groups that share patterns. For example, as you study the data, you notice that those who succeed in class have shared data patterns (similar number of discussion posts, similar amount of time spent on assignments, etc.), while those who tend to drop out have shared data patterns. You could set up a classification algorithm made from a previous data set to sort data in a new data set into the two groups (those who succeed and those who tend to drop out). Those with X number of page views would go into this group while those with less that X would go in the other. You could then compare your algorithm's accuracy rate to see if it is good enough to use on new data sets, or if it needs to be adjusted (like changing the number of page views required to be classified as a successful student).

Remember kindergarten, where you are given a shape and you have to find all the other shapes that match it? When building an algorithm, it's like you are taking a model shape that data for a specific category or group should look like. Maybe your successful group's data should look like a star, while those likely to dropout looks like a moon. When you use statistical analysis, it's like you are taking the data shape a student makes and comparing it to your two group's models, asking "Is this student's data shape close to a star or closer to a moon?"

Step regression and logistic regression are two types of analysis that determine whether one student's data matches better to one group's predetermined value or another group's. Both are binary types of analysis (you are dealing with two groups). With step regression (not step-wise regression!), you set a cut-off value that determines whether a data point will be classified as a 0 or 1 (0 means it belongs in one group and 1 means it belongs in the other). The data point is made up of all the variables you are analyzing. You assign weights to each variable group. Maybe you want to give more precedence to time on task for a week rather than how many emails the student sent in a week. You give a higher weight value to time on task, and a lesser value to emails sent. In the end, you would build an algorithm that looked something like this: Y=0.2a + 0.5b + 0.3c - 0.1d, where Y is the combined value of all the variables in a data point, a b c d are values of your different variables, and 0.2, etc. are your assigned weight values. After applying the algorithm to the data point, you would take the score of Y and compare it to the cut-off value you set that determines whether a data point will be a 0 or 1. If it is greater than the set value, it would go in one group. If it was less, it would go in the other.

With logistic regression, you determine the frequency or odds of a specific value of a dependent variable. Rather than a simple multiplication and addition problem like step regression, logic regression does something more complex: p(m)= 1/(1+e^-m). I don't even know what the function means, but there it is. By applying the function to a dependent variable value, you'd be able to compute whether the variable belongs to one group or another (even a trip to other resources, like Wikipedia, couldn't explain this better for me).

One problem with either of these approaches is that they do not take into account interaction effects. Baker gives this example to explain the problem: You have an algorithm made up of a value of the effectiveness of an educational software and a value of how on task a student is. Bad educational software is bad and being off task isn't any good, but maybe being off-task while you are supposed to be using bad educational software is a good thing if you are doing something more productive. The combination of certain values of your variables could lead to a need for more than two groups in a case like Baker describes. This is where decision trees can be handy.

Decision trees is another classification algorithm that allows you to apply different functions to your data set based on the value of your variables, allowing you to obtain a variety of outcomes. For instance, (another Baker example) you are trying to predict if a student will get quiz answers right or wrong. If a student taking the quiz doesn't have a lot of knowledge on a topic, but they spend little time answering the question, they tend to get the answer right. This is one group or outcome. If they don't have a lot of knowledge on the quiz topic and spend a lot of time trying to answer, they tend to get the answer wrong. On the flip side, if the student has a lot of knowledge on the quiz topic, maybe time spend isn't an appropriate variable to look at, but rather how many actions is taken to complete an answer. Perhaps a certain range of number of actions tends to lead to correct answers while a different range of task actions tends to lead to wrong answers. Using decision trees, you can have different paths that lead to your categories based on the values of specific variables.

In regards to the project I am working on dealing with data analysis, I can see methods for classification as being very important to what I am doing. I am gathering a number of different types of data that can be used to build algorithms that determine whether a student is engaged or disengaged. I don't really see interaction effects having much bearing with the data I am collecting, but it is hard to say until I see the data. I would probably first start with step regression or logic regression (if I figure out how to use it!). If the different combinations of variables tends to signify greater complexity, I might try decision trees.

Directed Study: EDM and LA

Tuesday, January 28, 2014

Classification

No comments:

Post a Comment