Modify to use proper multinomial sampling, with temperature to control diversity. This seems to generate qualitatively better results and is technically more correct.
Both the training features and labels can be represented as numpy
booleans instead of float32 / float64. This enables standard low RAM
machines to scale up to large datasets. Especially important if you
either have many characters (ASCII), long sequences, or a large dataset.