I’m sorry it took a while for me to circle back to answering this question in written form.
1) An answer for this question is highly data-dependent. But the rules of thumb for data training is the more the better. And then you set that aside for training 70-80% of all the data you have.
If there are clear patterns hidden in your data, then not much data is needed for training (as clearly seen in our sample dataset from the lectures). If you find that even the large amount of data could not produce satisfying decision trees, perhaps the right attributes for prediction are not present in your data, or perhaps they were there but there were confounding effects at play, in which case a preprocessing of data might lead to better results.
This idea applies to the appropriate number of attributes. If you have the “right” attributes, then you will not need many of them to predict the data. But, of course, real-life data usually are not perfect. Another thing to consider is the more number of attributes are being fed into the algorithms, the more time it takes and the more complex the resulting decision trees will be.
2) The decision tree algorithm that we use in class for r (rpart) should be able to handle this. For other tools / programming languages / software, you will need to read the documentation of that specific tool/language/software to see if they have this ability embedded.