Text Data

There are three labels in text data: 'text1','text2', and 'text3'.

Decision Tree 1

In this tree, if the parent node's value is [2,2,1] and frequency of word stem in column 103, which is 'concern', is not more than 0.5, its value would be [2,1,0]. If the parent node's value is [2,2,1] and the frequency of word stem 'concern' is more than 0.5, the node has value [0,2,1]. If the parent node's value is [0,2,1] and the frequency of word stem 'level' is not more than 0.5, the node has value [1,1,0]. If the parent node's value is [1,1,2] and the frequency of word stem 'level' is more than 0.5, the node has value [1,0,0].

Decision Tree 2

In this tree, if the parent node's value is [1,2,2] and the frequency of word stem 'undergradu' is not more than 0.5, the node has value [1,0,1]. If the parent node's value is [1,2,2] and the frequency of word stem 'undergradua' is more than 0.5, the node has value [2,0,1]. If the partent node's value is [2,0,1], and the frequency of word stem 'analyt' is not more than 0.5, the node has value [0,1,1]. If the parent node's value is [2,0,1] and the frequency of word stem 'analyt' is more than 0.5, then the this node's value is [0,1,0].

Decision Tree can predict the label of testing data based on the training data. In text data, trees indicate the most likely predicted label of a random article based on the frequencies of words in other two articles.

Random Forest

In the random forest, there are 5 individul decision trees. Every decision tree provides a prediction based on training data. The most vote is Value[1,0,0]. Therefore, the article is most likely to be text1.

Random Forests, on the other hand, is a more accurate predict tool. It contains several uncorrelative individual decision trees, each tree will predict a label. As a result, marjority voting will become the predicted label of that random forest. Therefore, it could minimize predictive errors. In the code, the result shows that the most vote that is Value[1,0,0]. So, random forest tells me the most likely predicted label is label with value [1,0,0].

Confusion Matrix for Tree 1

Confusion Matrix for Tree 2

Important Features

Top ten most important features( most frequent word stems) in text: 'Look', 'Produc', 'countri', 'factor', 'mani', 'thi', 'add', 'fact', 'suspend', 'map'.

Python Code

Download Code

Text Data

Text 1 Text 2 Text 3

Record Data

Decision Tree 1

There are two labels in record data: 'yes' and 'no', which represent if the university is expentive to most people in US.

In this Decision Tree, when the accept rate of a uiversity is more than 0.5, it is not an expensive university. If the accept rate is less than 0.5, and if the top 10 percentage is more than 0.53, it is an expensive university. If the accpet rate is less than 0.5, the top 10 percentage is less than 0.53 and the enroll rate is less than 0.35, then the university is not expensive. If the top 10 percentage is less than 0.53 and the accpet rate is more than 0.43, then the university is not expensive. If the accpet rate is less than 0.5, the top 10 percentage is less than 0.53, and the enroll rate is less than 0.42, it is not an expensive university. However, if the accpet rate is less than 0.5, the top 10 percentage is less than 0.53, and the enroll rate is not less than 0.42, the university is expensive.

Decision Tree 2

In this Decision Tree, when the accept rate is not less than 0.5, the university is not expensive. If the accept rate is less than 0.5, and if the university is private, then it is an expensive university. If the accpet rate is less than 0.5 but higher than 0.39, and the university is public, then it is not an expensive university. If the accept rate is less than 0.39 and it is a public university, it is an expensive university.

In record data, the first tree shows if the university is expensive or not based on its accpet rate, enroll rate, and top ten percentage. The second tree shows if the university is expensive or not based on its type (private or public?) and accpet rate. And the confusion matrics show there only one predictive error in decision tree 1 and two errors in decision tree 2. Overall, decision trees that are generated by R code is pretty accurate. So the result they predicted is trustable.

Decision Tree3

The decision tree predicts universities with high tuition are more likely to have better education quality because people who graduated from 'expensive' university have more probability to have 'high' payments. About 88% of students who graduated from expensive universities get 'high' payments, but only about 33% of students who graduated from non-expensive universities earn 'high' incomes.

Confusion Matrix for Tree 1

Confusion Matrix for Tree 2

Confusion Matrix for Tree 3

R Code

Download Code

R Code (for decision tree 3)

Download Code

Record Data

Types of Universities Accpet Rate and Enroll Rate Dataset (tree 3) Label(if the university is expensive or not?) Label(if students can get high payments? -- for tree 3)

Conclusion

There is a negative relation between tuition and the accept rate of universities. Universities with low accept rate are more like to be an expensive university. If a university has the accept rate of more than 0.5, it is not an expensive university. However, the relation between the top ten percent with tuition is positive. Normally speaking, universities have high ranks that are more expensive than others do not. And the data analysis shows it is the truth that top universities are expensive. Universities with more than 0.53 top ten perc are expensive. Same, the relation between enroll rate and tuition is positive. University with a high enroll rate are more like to be an expensive university. However, enroll rate is not a significant factor to predict if the university is expensive compared to the accept rate and top ten perc. For example, the prediction of if a university is expensive or not based on its enroll rate needs its accept rate and top ten perc to be given as well.

Also, nearly all private universities are expensive. If a university's accept rate is less than 0.5 but more than 0.38, and it is a public school. It is more likely to be predicted as a non-expensive university. On the other hand, if a public university's accept rate is less than 0.38, it is an expensive university. If the university is a private school, it must be expensive. 92% of universities have a high accept rate, which is higher than 0.5. Nearly 88% of universities with an accept rate of less than 0.5 is expensive. However, in all national universities, about 7% of universities are expensive. But considering there are so many universities, expensive universities are not rare. In conclusion, there are many expensive universities. Normally, expensive universities have the high enroll rate and the high top 10 perc but low accept rate. On the other hand, most non-expensive universities are public schools with high accept rate but comparatively low enroll rate and low rank in all national universities.