README setup results in 500 error
I followed the README instructions on the llm branch and tried running the curl:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "hermes-2-pro-mistral",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
I got the following response: {"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}}
In the docker logs I see:
6:54PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:Hermes-2-Pro-Mistral-7B.Q6_K.gguf ContextSize:4096 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:90 MainGPU: TensorSplit: Threads:8 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/Hermes-2-Pro-Mistral-7B.Q6_K.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
6:54PM INF [stablediffusion] Loads OK
[172.17.0.1]:55404 500 - POST /v1/completions