Recent research in the field of automatic music generation lacks rigorous and comprehensive evaluation methods, creating plagiarism risks and partial understandings of generation performance. To contribute to evaluation methodology in this field, I first introduce the originality report for measuring the extent to which an algorithm copies from the input music. It starts with constructing a baseline to determine the extent to which human composers borrow from themselves and each other in some existing music corpus. I then apply the similar analysis to musical outputs of runs of MAIA Markov and Music Transformer generation algorithms, and compare the results to the baseline. Results indicate that the originality of Music Transfomer's output is below the 95% confidence interval of the baseline, while MAIA Markov stays within that interval. Second, I conduct a listening study to comparatively evaluate music generation systems along six musical dimensions: stylistic success, aesthetic pleasure, repetition or self-reference, melody, harmony, and rhythm. A range of models are used to generate 30-second excerpts in the style of Classical string quartets and classical piano improvisations. Fifty participants with relatively high musical knowledge rate unlabelled samples of computer-generated and human-composed excerpts. I use non-parametric Bayesian hypothesis testing to interpret the results. The results show that the strongest deep learning method, Music Transformer, has equivalent performance to a non-deep learning method, MAIA Markov, and there still remains a significant gap between any algorithmic method and human-composed excerpts. Third, I introduce six musical features: statistical complexity, transitional complexity, arc score, tonality ambiguity, time intervals and onset jitters to investigate correlations to the collected ratings. The result shows human composed music remains at the same level of statistical complexity, while the computer-generated excerpts have either lower or higher statistical complexity and receive lower ratings. This thesis contributes to the evaluation methodology of automatic music generation by filling the gap of originality report, comparative evaluation and musicological analysis.