generating music using ai
You may have been following recent trends in machine learning and generative AI and started incorporating them into your productivity workflows, leveraging their novel interactions. I didn't realize how useful an AI assistant was until I had one of sufficient quality. The productivity boost and abilities are impressive. For generating new stuff and creating new processes, I have been exploring various ideas in the algorithmic space, researching machine learning, algorithmic music composition, and other topics. As well as text-to-music, which is essentially taking an instruction like "relaxing arm chair jazz bebop from Mars" and getting back a .wav or .mp3 within a reasonable amount of time.
The actual sound quality isn't as high fidelity but it's good enough. There will be models that come out in the future that may improve the fidelity dramatically. For now, I have used audio sometimes by itself or layered with something else. The range of musical output will be constrained by the training data used, but interestingly, we are free to use our own training data to fine-tune the model. There are examples of musicgen being finetuned on Replicate to a more specific subgenre of music.
I think that prompt engineering, fine-tuning of models with your own datasets combined with machine learning and using machine learning models to categorize a large corpus of audio could be a fruitful avenue of experimentation. Especially combined with APIs, local machine learning models, and other crazy ideas. There could even be some genetic algorithm or reinforcement learning AI layer that chooses new audio segments. I am yet to really try out these ideas, but I can't wait to give it a shot at some point. This is also related to the concept of a music agent system where virtual performers with some brain, AI are able to listen to other virtual performers to collectively derive a new piece of music, either offline or in real-time. Having some experience with Pure Data or SuperCollider might help in realizing ideas like this.
I have also successfully used the continuation feature of musicgen by providing a 15-second audio sample and configuring the continuation from 15 to 29 seconds. It can be tricky to work with because you need to keep the duration below 30 seconds in continuation mode. You might want to automate this to create an extended piece of music. The AI-generated music can be directed with a prompt as far as I can tell from my experimentation. You could even try putting in famous audio and see what the AI will make up from it... This article will probably be out of date by the time you read it.
Companies are already beginning to commoditize, form cults/religious movements, and generally cash in on this new AI boom. Not naming names, but some companies are trying to emulate Apple's cult following with slick keynotes & hype. The commercial FUD of how to package up machine learning as a product and service for enterprise has already begun. The snake oil, marketing hype, fear, and paranoia are rampant.
For artists, there is a chance to hold a mirror to society and use this frontier tech to both generate new content and works at unimaginable pace and to use machine learning as part of artistic works for personal gain in the cloud or offline on your own computers. What can generative AI do? A large language model or chat-based AI assistant will typically accept a text prompt like this: "Relaxing armchair jazz bebop from Mars." The AI-generated music can be directed with a prompt as far as I can tell from my experimentation. You could even put in famous audio and see what the AI will make up from it...
What can generative AI do?
A large language model or chat based AI assistent will typically accept a text prompt like this...
"what is the meaning of life, the universe and everything?"
The LLM will then respond with a well formed answer. It can follow instructions and work on data, code, images and soon any conceivable data data type.
The response can vary from accurate to wildly incorrect... the LLM uses predictive word modelling to generate sentences based on the likelihood of the next word or sentence. Its way more complex than that under the hood and packages up 3 or more separate models but that is what its doing in a nutshell.
You can use generative AI to generate code, images, videos, text etc. You will find companies offering different AI products and machine learning models of varying quality.
Running in the cloud
Depending on what your trying to do you may benefit from using a cloud based interface or service to do the actual number crunching. The obvious reason for this is that setting up your own machine learning computer is expensive and non-trivial.
Your data will be potentially used by the third party company so don't use any personal or sensitive info when using cloud based AI systems. Unless you feel comfortable doing it and its worth the trade off.
Your computer may well become old hat quickly and it will be cumbersome to keep up with the Joneses. When renting servers in the cloud (VPS - virtual private servers) you will get the biggest and best GPU's at a price.
With platforms like huggingface.co and replicate you will be charged by the minute or second of server usage when running the model.
I will show you how to use replicate to make music, video, pictures and more later on.
Running locally
Running locally gives you more control over your project because your data never leaves your computer and the machine learning models available can have less guardrailing (which personally think is a needless restriction).
Moderation and guardrails might be needed in some commercial applications to avoid law suits. But we are venturing into the unchartered frontiers of AI research and having the computer "say no" or "you can't do that dave" like on the little britain sketch or 2001 A space dyssey is a needless restriction.
You will need to configure your own very beefy computer with a decent GPU (probably nvidia to be compatible with most machine learning). Think 32+ GB of RAM.. I'm working on 7b models (7 billion parameter) models locally using ollama.ai with linux, a pretty modest nvidia gaming GPU card and 8GB of RAM. It works well enough for 7B models but is far too slow for 13B or higher.
Replicate

I have racked-up some bills on huggingface.co, replicate and loads more during my testing. My suggestion to any of you that want to have a go generating music, audio, images or other media is to firstly get a github account then sign up for replicate.
They offer a nice api for lots of languages and an ever increasing list of models to try out. You can try out the model from the website to get a feel for it before automating it with the API.
You can find documentation for using the replicate API and ask chatGPT to generate code for you if your not a strong software developer.
Models & prompt engineering
prompt engineering is the process of trying different text prompts to get different results from a machine learning model. For artists like us that means experimenting with different input to get output that we desire.
When generating images, music or other output you will need to play around with your text prompts to inch closer to your desired result. This iterative process is called prompt engineering when applied to people generating code but it equally applies to us.
Musicgen on replicate
I've been using this musicgen model: aussielabs/musicgen
to generate music that I then edit together with some audio I recorded off my radio and played around with
a bit in
Musicgen is an open source machine learning model from facebook that generates audio/music from text prompts like...
Drill evolves into Trap, which then merges into EDM, transforming seamlessly into Drum and Bass, and finally culminates in Funky House. This progression is marked by an increasing tempo and a shift from gritty, street-inspired beats to more polished electronic rhythms. The journey from Drill's raw, loop-heavy beats to Trap's characteristic hi-hats and 808s, then to EDM's pulsating electronic melodies, Drum and Bass's rapid breakbeats, and ending in Funky House's upbeat, disco-infused grooves, showcases a diverse spectrum of contemporary music genres.
The aussielabs/musicgen on replicate model at the time of writing accepts a max output duration of more than 30 seconds (unlike other musicgen models on replicate.com) and it my recommendation for trying out prompt engineering for music making.
Some other techniques I've tried are using the API with randomly generated prompts and stitching the audio together from the command line using the linux command line audio editing tool sox.
Without sounding facetious you can use an LLM to copy or try out something similar so I won't give you any code or instructions for doing anything yet for fear of it being out of date by the time you read it.
m-onz : cool_folder
This is an example of music I made using musicGen via replicate, audacity and the command line.
Conclusion
The pace of innovation and change in AI is astounding and surprising. It offers huge opportunities for enterprise, every day people and artists working at the intersection of art and technology.
I have been dabbling and trying out new artistic processes and have ultimately settled on using local machine learning models, replicate and cloud based LLMs sometimes too. The benefit of using replicate is that you will have some hope of replicating this whereas other commercial websites/ products and machine learning often break, charge you too much money or impose silly copyright restrictions on you.