有人用本地模型替换了Claude/GPT进行日常编程吗?
阅读原文· news.ycombinator.com06月15日,Hacker News 上有用户发起讨论:是否有人已用本地模型取代了 Claude 或 GPT 用于日常编码工作,并希望分享实际经验。
Hacker Newsnew | past | comments | ask | show | jobs | submitloginAsk HN: Has anyone replaced Claude/GPT with a local model for daily coding?1224 points by cloudking 1 day ago | hide | past | favorite | 515 commentsHas anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s) Greenpants 1 day ago | next [–] I have! I care about data privacy and LLMs being free. I'm using the Pi coding harness but containerized and sandboxed, to make sure it's running completely offline. On my Mac Studio with 128GB RAM (or MacBook with 36GB RAM) I'm using Qwen3.6 35b, with only 3b active parameters so that it runs really fast. I've done a complete redesign for my website's homepage and blog with Django + Wagtail. The latter is interesting, because Wagtail is a bit less well-known, so the agent, without giving it internet access, doesn't always know how to develop for Wagtail. I've used Qwen3.5 122b for when things get more complex. At 10b active parameters, it's significantly slower though.I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)replylambda 1 day ago | parent | next [–] This is very similar to my setup. Pi in a container (I do let it have network access, just no access to creds or anything, only the one directory that I'm working on at the time and my ~/.pi directory), talking to llama.cpp in another container. I'm on a Strix Halo 128 GiB unified memory laptop.I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.For other chat tasks and translation, I'll frequently use Gemma 4 31B.For audio, I'll use Gemma 4 12B.I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.replychakspak 1 day ago | root | parent | next [–] Hopefully this isn't off-topic, but your setup sounds just like mine, Strix Halo and (I'm assuming) llama.cpp on ROCm, and I'm finding that the Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?replylambda 23 hours ago | root | parent | next [–] I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan.The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.In my models.ini, I have this for the Qwen3.6 models: chat-template-kwargs = {"preserve_thinking": true} There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.replyndom91 23 hours ago | root | parent | next [–] +1 using llama.cpp Vulkan releases with the Qwen models - runs much better than the ROCm releases.I'll have to give the preserve_thinking a shot.replyjderekw 20 hours ago | root | parent | next [–] Thanks for sharing have been running ROCm primarily with Qwen 3.6 and Qwen Coder, on the runs much better statement is that a stability, performance or other capability your experiencing?replythefroh 14 hours ago | root | parent | prev | next [–] I'm a little surprised that preserve_thinking would matter here for cache purposes. for actual capabilities/intelligence, yes, I'd imagine it helps to have past reasoning traces in multi-turn setups.but for caching, all you are doing is leaving off a fraction of the most recent assistant message generation, which will have little/no impact on cache hit rate.replystymaar 13 hours ago | root | parent | next [–] > all you are doing is leaving off a fraction of the most recent assistant message generationTrue, but not a tiny fraction, qwen is very verbose in its thinking traces. And it basically means that for every (nonthinking) generated token you have to compute the KV twice (once as tg, the second one as pp).replyhavfo 12 hours ago | root | parent | prev | next [–] I was able to solve this for my setup, 7900XTX and llama.cpp on ROCM in the oh-my-pi fork of pi.dev harness. I documented my setup on github, check under my username/omp-config, but the important thing is making sure the context is strictly append-only, and starting llama.cpp with --chat-template-kwargs '{"preserve_thinking":true}'replyanaisbetts 10 hours ago | root | parent | prev | next [–] If you're hitting this you have a bug, this is not related to the model. Either your harness is editing the messages between turns incorrectly (i.e. it is not append-only), or sometimes this is because of llama.cpp bugs, but bet on the former. Setting up something like Tailscale's Aperture will let you capture the requests and then you can diff them.replydnautics 22 hours ago | root | parent | prev | next [–] > Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?replylambda 21 hours ago | root | parent | next [–] So, one of the ways that this problem manifests is that most local models aren't trained on preserving the full reasoning between turns. Every turn, they skip passing the reasoning trace from previous turns to the the LLM. So if on one turn you have a long interleaved chain of reasoning and tool calls, then it responds to you, and then you give a new prompt to fix something, it has to re-process all of those tools calls now with the reasoning stripped out.Qwen 3.6 has finally been trained both with and without preserving thinking, so you can optionally enable preserving thinking. This will use up a bit more context, but it will avoid having to do this re-processing of long agentic turns, and also the preserved thinking can avoid having to re-do some of the same reasoning over again in later turns.Besides that, modern LLMs don't only use full attention (apparently, attention is not all you need). Full attention is very expensive to compute and store (0(n^2)). But additionally, full attention is actually bad at certain kinds of reasoning; keeping track of some value that gets replaced over the course of time, for example. So most models these days use various forms of local attention which is fixed length and gets updated as you go; sliding window attention, Mamba-2 state space models, etc.But one advantage of attention is that you can go back and reprocess by truncating the KV cache and starting over. You can't do that with other forms of local attention; you've lost the state earlier in the sequence.So to allow you to go back without fully recomputing the cache all over again, your engine will save snapshots of the local attention state at various times, so if you need to go back to recompute the cache, you can start from the last snapshot. However, these snapshots can get large, you can't keep too many of these, so sometimes you need to go back quite far to get to one, or they're all past the point you need to go back to and you need to start over again from the beginning.There have been particular bugs in llama.cpp that have caused this to be triggered more often than it should; for instance, it wouldn't take snapshots before turns that included images at one point, so if you had an image heavy agentic workflow, that issue plus the lack of preserving thinking would mean you would frequently have to go back and start over from scratch.Some of these issue have been fixed, some are addressed by preserving thinking. There are still some issues sometimes; for instance, one that's hard to fix is that the tokens generated autoregressively don't always parse the same when doing prefill. For instance, you could generate something as two tokens "pre" and "fill", but it turns out that "prefill" is also a single token so the tokenizer will use that, so when you send that back again on the next turn, it will see a divergence and have to recompute from that point. It might be possible to ignore that and use the not fully greedy tokenization that's in the cache, but I've definitely seen llama.cpp have to do some cache recomputation due to that.replycarterschonwald 19 hours ago | root | parent | next [–] thats a harness issue not a model issue. eg i have my own reasoninf harness that forced persisted cotreplylambda 2 hours ago | root | parent | next [–] Not a harness issue. The harness (pi in my case) passes back the cot for all previous turns.The jinja template is what renders the openai-format request sent by the harness, into the actual string of text that will be tokenized and fed to the model. For models without preserve thinking support, the jinja template drops the reasoning from all but the current turn.Here is the default jinja for Gemma 4: https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_... {#- Render reasoning/reasoning_content as thinking channel -#} {%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%} {%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%} {{- 'thought\n' + thinking_text + '\n' -}} {%- endif -%} You see that it only preserves the thinking for indexes that are later than the last user message; thinking is only preserved for a single turn (which can include a lot of interleaved thinking and tool calls), once it goes back to the user and the user replies, it will replay the tool calls but not the thinking between them.Here's Qwen 3.6 by comparison: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/chat_t... {%- if (preserve_thinking is defined and preserve_thinking is true) or (loop.index0 > ns.last_query_index) %} {{- '' + message.role + '\n\n' + reasoning_content + '\n\n\n' + content }} {%- else %} {{- '' + message.role + '\n' + content }} {%- endif %} It additionally has a preserve_thinking flag that you can set. If that's set, it will include all turns thinking in the text passed to the model. But you do have to set that, it's not the default.It's possible to modify the jinja file that you're using with a model. Some people do that with models that haven't been specifically trained for it, and report good results; but some report that because it wasn't trained for that, they get worse results if they include thinking from previous turns.So for models like Gemma, you would have to modify the default jinja to enable this. For Qwen, you can just set the preserve_thinking flag to get this behavior; and apparently they have trained in this mode so you get better results than models that have not trained this way.replythefossguy69 13 hours ago | root | parent | prev | next [–] Would you mind sharing your harness for reasoning?replydnautics 20 hours ago | root | parent | prev | next [–] wait do sota models use mamba-like SSMs? this is the first im hearing thisreplynl 19 hours ago | root | parent | next [–] Qwen 3.5 and above use Gated DeltaNet which alternate attention and SSM layers:https://sebastianraschka.com/llms-from-scratch/ch04/08_delta...replyLoganDark 23 hours ago | root | parent | prev | next [–] What harness are you using? Some of them (e.g. OpenCode) mutate the system prompt every turn, and therefore can't work with a KV cache.I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)replymbitai 9 hours ago | root | parent | next [–] I've also had good results with Pi, and I got used to the new workflows without subagents, MCP, etc.replyverdverm 15 hours ago | root | parent | prev | next [–] There is a bug in llama-cpp for qwen/gemma models, use vLLM insteadreplypdyc 14 hours ago | root | parent | next [–] what bug and it affects what?replyverdverm 4 hours ago | root | parent | next [–] it's a prompt cache invalidation bug that causes all input to be reprocessed instead of getting preloadedThere are other reasons to prefer vllm to llama-cpp as wellreplyfjdjshsh 18 hours ago | root | parent | prev | next [–] >I'm still a AI skepticWhat does this mean in June 2026 wrt coding?To me it sounds like being a "rice cooker skeptic". Some people don't like using rice cookers, some do.replysvantana 7 hours ago | root | parent | next [–] I'm a housekeeper skeptic. While I concede that a professional housekeeper would probably do a better job than me on most domestic tasks, I still think everyone should clean their own home, cook their own dinner, and write their own code.replyfemto113 16 hours ago | root | parent | prev | next [–] For me the distinction is that your rice only needs to be edible once, while your code may need to last for decades. Using AI to code anything I could comfortably throw away if needed is a lot less fraught than letting it make choices that I and anybody who inherits the code is gonna have to live with, especially if by outsourcing those choices I reduce my understanding of the implications of those choices.replydeadeye 6 hours ago | root | parent | next [–] I don't let the AI make any choices. I have a lot of instructions and sample code for it to follow. It is basically a glorified code generator at that point.replyluipugs 13 hours ago | root | parent | prev | next [–] Don't you read through all the output of the agent before committing them?replysecult 12 hours ago | root | parent | next [–] That's not the way how human brain works.replyluipugs 8 hours ago | root | parent | next [–] I'm not getting it. OP said they are wary of letting the agent make choices for them, and outsourcing those choices lessens their understanding of them. They could interrogate the agent on why those choices were made until they have sufficient understanding, and they can also change the solution if they want to.replyincrudible 10 hours ago | root | parent | prev | next [–] I think the idea that code should last decades is now questionable, if not problematic. If we can now produce code at 10x the rate, that means we can have 10x more code (probably not desirable) or we can have 10x as many revisions. Whoever inherits the code can have it rewritten to their liking and understanding. Nothing helps better in understanding a system than to rebuild it, even if just by handholding an LLM.replybluGill 6 hours ago | root | parent | next [–] Only for simple problems. As the problem becomes complex you can't remember all the requirements to prompt the AI with.replyincrudible 3 hours ago | root | parent | next [–] As the problem becomes complex, you can't remember all the requirements, period.replybluGill 3 hours ago | root | parent | next [–] Exactly, but if I start from working code with a lot of tests I don't need to remember the requirements. I just need to know my current requirement and figure out the ones I'm changing with my new requirement. It doesn't catch everything, but in most cases if I break some other requirement I find out about it and can figure out just that one more requirement and not the millions of others that still work.replysfn42 8 hours ago | root | parent | prev | next [–] The thing about this is that you can choose how high level you go.For example you can just tell it to make a website for a business with a webshop and it'll just generate thousands of lines of code and you have no control over anything. Or you can spend hours/days writing the specification and then have it generate it.Or you can do what I do and work iteratively one feature at a time making sure everything is exactly the way you want it. I generally solve the problem myself then tell it what to do, or if I'm not sure what the best solution is I might discuss with the AI until we agree on a plan and then have it execute it. Often this leads to me learning useful things, like it will suggest a tool/feature that I didn't know about that's perfect for my usecase or it will identify a problem in my plan that I wouldn't have found until after spending hours on the implementation.I've always been very detail oriented and I care a lot about code quality, I want my solutions to be clean, consistent and as simple as possible while solving the problem. To me, AI tools let me do that more quickly and better, it's not a compromise it's just flat out better in every dimension. It's about how you use it.A lot of people seem to think that it's a binary choice, either hand craft a high quality bespoke solution or just vibe code a pile of trash. There's a whole spectrum in between those two, and I think there's a sweet spot where you still maintain control and understanding, it's just much faster and the result is actually better because it's not just you and the knowledge in your brain it's also the AI that practically knows everything - it will teach you things and suggest solutions you wouldn't have thought about, it makes you a better developer. It's a force multiplier and the smarter you are the better you will be at using it.It's not a replacement it's an enhancement. It's like imagine a developer with Google vs one without, obviously the one with Google will be better because they have access to more information. The AI is like automatic google that just googles everything all the time, things you wouldn't have even thought to Google or things you couldn't possibly formulate a good search term for. With AI you can just show it a screenshot or describe an issue in detail and get a really solid answer a lot of the time. It's like having an expert on standby all the time, sure it's sometimes wrong but most of the time it's not and if you're smart you'll recognize when it isn't.I'd say anyone who isn't using AI today aren't using their full potential. I don't see how anyone could possibly perform better without this tool than with it. I do see how someone who doesn't care could produce a lot of slop, but the people who refuse to use it aren't that guy. That guy has been using it to produce slop for years already. You can use it to produce top quality code if you choose to.replyHWR_14 16 hours ago | root | parent | prev | next [–] I assume it means they are not sure it gives them a speed up. Which, since I don't know what they are trying to do, may be reasonable.replyIolaum 11 hours ago | root | parent | prev | next [–] Haven't used for actual coding but was testing locally - for example running some swebench instances - whether qwen-3.6-35b-a3b@Q8 was better than qwen-3.5-122b-a10b@Q4. With MTP the former runs at around 55t/s and the latter at around 30t/s meaning the latter is also usable. It looked like qwen-3.5-122b-a10b@Q4 performed a bit better.replymahadevank 14 hours ago | root | parent | prev | next [–] Thanks a lot for your comment. I was using Qwen3 but asn't aware ofo the A3B Mixture-of-experts model. Works much better, thanksreplyadyavanapalli 1 day ago | parent | prev | next [–] For the edit tool, you should consider implementing a hash-based approach where each line of code is hashed and referenced by it when doing replacements. You can read up on the approach here: https://blog.can.ac/2026/02/12/the-harness-problem/I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMVreplypieterk 20 hours ago | root | parent | next [–] Yup, I used this for a while and IME it may get you a few percentages more of useful context initially, so quality feels a bit higher, but things start breaking down in funnier ways when you do run out of that quality for any reason later, so definitely caveat emptor.replyojr 19 hours ago | parent | prev | next [–] I can use Gemini 3 Flash with the harness I built for around 8 years and still not exceed the cost of a Mac Studio with 128GB, the price for privacy is very high. Agentic flows that get stuck can be worked around but I prefer developer velocity.replyClikeX 9 hours ago | root | parent | next [–] > the price for privacy is very highNot sure if you intended this to be this philosophical, but this is basically the slogan for modern life now.replygwerbin 7 hours ago | root | parent | next [–] Yeah but the price for, say, private email is a lot less.replyihateolives 6 hours ago | root | parent | prev | next [–] Sure, but Gemini subscription gives you just that - Gemini subscription, but new computer allows you to do other stuff with it as well. When you're upgrading anyway for other reasons then it's not fair to compare full Studio price to just one subscription.replydisqard 19 hours ago | root | parent | prev | next [–] Under-rated take, thanks for stating this!Not everyone can plough $$$$ into hardware right now (more power to those who can), so choosing to rent is an A-Ok strategy.replytpm 15 hours ago | root | parent | next [–] It's ok if you can send your code and data to the provider. Some of us can't.reply_zoltan_ 13 hours ago | root | parent | next [–] We're discussing home use.You can. You just don't want to. Huge difference.replymonooso 7 hours ago | root | parent | next [–] > We're discussing home use.You may be, but the topic of discussion is whether anyone is using a local model as their main coding tool.reply_zoltan_ 7 hours ago | root | parent | next [–] for corporate use it's a mistake not to use a frontier model.replytpm 4 hours ago | root | parent | prev | next [–] Well plenty of people work from home.For corporate use, if the corporation would break the law sending anything to the open internet or to the US, then you can't use any model that's not hosted in house. And there are many such cases.replydanans 13 hours ago | root | parent | prev | next [–] > I can use Gemini 3 Flash with the harness I built for around 8 years and still not exceed the cost of a Mac Studio with 128GBAnd sounds like you haven't factored in the cost of electricity to run that Mac Studio as an LLM machine. Probably get a few more years.replyelectronsoup 23 hours ago | parent | prev | next [–] > It gets into loops quite often, and surprisingly often gets the edit tool call wrongI find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churnUsing 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays offreplygirvo 20 hours ago | root | parent | next [–] Right. Tokens/s decode isn't the most important thing to me: wall clock time for task completion is. And tracking all of that, on my GB10-based Asus box, Step 3.7 Flash at IQ4_XS beats Qwen 3.6 27B despite the latter having MTP, on all of my actual coding task evaluations in real codebases.Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.There's also a regression in llama.cpp wrt. Step Flash, where quantisation is getting worse KLD and Perplexity than it otherwise was previously, for the exact same quants. Very odd, but it's being looked into at least!replygwerbin 7 hours ago | root | parent | next [–] Do you think the choice of quantization matters that much for other models? I've seen a lot of discussion about different quantization and FP formats but I feel totally unequipped to make an informed decision about what to try.What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.replyrhdunn 5 hours ago | root | parent | next [–] I use promptfoo for evaluation. I'm experimenting with tests for my workflow/use cases.I have a custom assert for loop/repeat detection that works well: def count_repeats(text: str, length: int) -> int: n = len(text) pattern = text[n - length : n] count = 1 # Include the end of the string as matching the substring. text = text[: -length] while text.endswith(pattern): text = text[: -length] count = count + 1 return count def repeats(output: str, context: dict[str, any]) -> bool|float|dict[str, any]: threshold = context.get('config', {}).get('threshold', 3) count = 0 length = 0 for n in range(1, (len(output) // 2) + 1): n_count = count_repeats(output, n) if n_count > count: count = n_count length = n if count >= threshold: return { 'pass': True, 'score': 1.0, 'reason': f'Output repeats {count} times with length {length}.' } else: return { 'pass': False, 'score': 0.0, 'reason': f'Output doesn\'t repeat {threshold} or more times.' } def no_repeats(output: str, context) -> dict[str, any]: result = repeats(output, context) result['pass'] = not result['pass'] result['score'] = 1.0 - result['score'] return result Just add it to your promptfooconfig.yaml: defaultTest: assert: - # ----- The output doesn't repeat/get stuck in a loop. type: python value: file://asserts/repeat.py:no_repeatsreplyttoinou 10 hours ago | root | parent | prev | next [–] I tried Step 3.7 Flash on my mac 128GB and it seemed very dumb. antirez ds4 flash is much better !replygirvo 7 hours ago | root | parent | next [–] It isn’t though, I’ve run both through a bunch of coding evals. You nearly certainly didn’t have the right sampling parameters or quantised the KV cache?Ds4 is impressive for what it is, but it loops and over thinks even more, burning massive wall clock time to not even get great outcomes. It’s also limited to a slow speed on my Sparkreplyttoinou 7 hours ago | root | parent | next [–] I tried a bunch of stuff with step 3.5 and step 3.7 maybe not as much as you. Could you tell me what parameters and launched you’re using ? Antirez ds4 flash q2-q4 works almost out of the box for mereplykristopolous 13 hours ago | parent | prev | next [–] I've got a tool that sits in between the harness and inference engine called petsitter. It is a middleman validator to avoid just these kinds of issues. You can stack the fixes as needed (they're called tricks in the petsitter parlance)It's what I use. Fixes the problemhttps://github.com/day50-dev/petsitterreplystared 10 hours ago | parent | prev | next [–] Why I do like Qwen 3.6 35B A3B, I have found that the difference improvement of Qwen 3.6 27B is massive. Sure, it is 3x slower (https://github.com/stared/benching-local-llms-on-apple-silic...), but for the total development time it felt that still 27B is faster to get the goal.Is it that in your case is it different?replyrobertlagrant 10 hours ago | parent | prev | next [–] How are you sandboxing your Pi coding harness? Directly only mounting certain folders, using capabilities to kill the network and not giving it all your shell env vars, that sort of thing? Or do you use a tool?replythrow10920 4 hours ago | root | parent | next [–] And, is the sandboxing for security (avoid RCE on the host) or merely guardrails for the models?I've wanted the latter quite a bit for Pi, because weaker models like Deepseek V4 have extreme issues with obeying prompts (e.g. I'll instruct it to find a bug but not fix it, and it'll "helpfully" try to fix it anyway), so having a "read-only mode" actually backed by the OS would be very useful.replyltononro 23 hours ago | parent | prev | next [–] What kind of coding do you do? Do you keep track of frontier models to vibe check the differences and re-evaluate constantly or are you ok with having a nerfed model forever? (not being judmental, just really wanto to know your framework here)replyGreenpants 23 hours ago | root | parent | next [–] Some of the work I do, I do for an (EU) organisation that doesn't have clear rules or guidelines on the use of AI yet. Though I have seen colleague-developers blatantly putting source code into external Claude-like models, I stay true to my principles and don't. I know for certain that everything that I run through my local, offline Pi Container Sandbox cannot leave the machine, and thus can't result in a data breach. I do this for the peace of mind.I do (unscientifically) experiment whenever a new capable local LLM ( Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture.that's why i use the frontier models because its a senior co-worker vs a junior. if you use the junior for the sake of privacy i think you're missing out on the best insights for a specific task.replyphysix 18 hours ago | root | parent | next [–] The dilemma I am facing is cost.Consumer-grade subscriptions of the frontier models give you superb capabilities per dollar, them being heavily subsidized. But if you're working in an enterprise setting, that won't work. You need to upgrade, and that gets significantly more expensive.Furthermore, basing the SDLC on leveraging the bargain subscriptions risks falling apart in the future, both from a cost perspective as well as the question of availability (e.g. Mythos).So from a strategic perspective, going local on the LLM and still achieving great results with the right approach is very relevant.replywillisrocks 13 hours ago | root | parent | next [–] Or you can get the best of both worlds--use frontier models to build a spec/plan, and use cheap models (open source or not) for implementation. Your max or team plan can go a lot further this way without giving up much for quality. Play with something like Superpowers to make this really approachable.replybxk76 17 hours ago | root | parent | prev | next [–] Best insights can be over rated due to bandwith limitation of the brain. Even if Einstein is sitting next to you the whole day and helping out Theory of Bounded Rationality applies.replypieterk 20 hours ago | parent | prev | next [–] Yup, it's fantastically useful.Maybe even more useful than Opus when I have all the constraints to an issue. There is less "knowledge" in the model (I get by with 48GB of RAM allocated to an 8b quant), so it has fewer things to hallucinate about.I've been getting to know its limits pretty well over the last few weeks and would say it's an excellent code search/replacement/generation* engine.It's got the "in-context script generation" flow down as well, so it will easily help automate tasks that you describe with text and perhaps example commands, or tools, or skills* that you provide.*Think of it + Pi as an NLP abstraction layer over grep, or a shell, rather than a jack of all trades + world knowledge all-in-one.replygwerbin 18 hours ago | parent | prev | next [–] I've noticed the same about the edit tool, in both Gemma and Qwen. Maybe I'm not running them with the right sampler settings, but I'm happy to hear I'm not the only one. Lots of mismatched whitespace and stuff, the model ends up doing hex dumps and maybe 5 or 6 attempts at editing a 5-line function into a 250-line Python file.All of these models also seem to get stuck in long thinking loops, sometimes tripling the tokens of a frontier closed model which is really painful when inference is already on the slow side (on my Macbook).reply0xbadcafebee 1 day ago | parent | prev | next [–] The harness and the LLM parameters are pretty essential to getting better results and reducing loops. Tweak the parameters and you can mostly eliminate loops without negatively affecting performance (it's a bit complex but ask a SOTA AI to guide you and it's not hard). The harness should also react more intelligently to failures; it can do things like return additional context or hints as it tracks error rates and avg duration of calls. Pi can be easily extended, and it's suggested by the author you modify it to perform better for your use case.replyhparadiz 23 hours ago | parent | prev | next [–] I am right there with you. Mind-boggling. It's a indistinguishable from magic technology!! I tried running some basic tasks through Qwen with Opencode on a 10 year old dual Xeon server for shits and giggles. I gave it a simple task like "use ffprobe first but convert this webm to mp4" and it was able to complete the task with zero network calls outside my network. On 10 year old hardware. It took about 3 minutes to complete the task. Now you may be saying 3 minutes? pfft. But I dare you to do it yourself. You're gonna be googling the CLI switches for at least 10 minutes and setting up your command. I had it actually optimize all the switches on the fly for me based on an initial ffprobe to see what is optimal.replybluerooibos 21 hours ago | root | parent | next [–] > 10 year old dual Xeon server...On 10 year old hardware.Hold on, what are the specs of your rig? How much RAM?I've been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.replyhparadiz 21 hours ago | root | parent | next [–] I inherited a box with dual Xeons and 256 GB of DDR4. I then ran several tests and benchmarks of the hardware with several models.I've been meaning to write a blog post but well whatever here's the md.https://gist.github.com/hparadiz/f3596d00a62d8ebb2dadcc46ee5...Qwen3.5 9B performed best.You can absolutely still use this to do some basic stuff like tell opencode to convert a video file from one format to another. But frankly you're better off getting two AMD GPUs. Say a dual 7900XT would get way better performance.replybandrami 12 hours ago | root | parent | prev | next [–] > You're gonna be googling the CLI switches for at least 10 minutesSo there's this really amazing program called "man"replygmac 9 hours ago | root | parent | next [–] Which is generally slower than Googling, because it's paged content in a terminal which can search only for literal strings?replyhparadiz 10 hours ago | root | parent | prev | next [–] Yea there's something called a phone book too.replybandrami 10 hours ago | root | parent | next [–] And that would be a much better source for a phone number than Googling. Similarly, the docs that ship with software are a better source for command line switches for that software than a search engine or LLM.replyhparadiz 9 hours ago | root | parent | next [–] My lived experience right now is a lot of super talented people around me using these tools all day every day to build awesome things and then there's the randos like you on HN who think they know better. Protip: You don't know squat.replycruffle_duffle 6 hours ago | root | parent | prev | next [–] The docs that ship with it are a great source for the LLM who will be running the command and monitoring its output, fixing or adjusting whatever in order to complete my goal. Why on earth would I be calling it by hand?replyololobus 6 hours ago | root | parent | prev | next [–] You are right, but I think you miss the whole point of the agentic workflows that are being discussed in this post comments.Yes, you surely can read man, docs, whatever, then DIY. The point is that in many areas people don’t really want to become an expert, like in ffmpeg cli arguments, they just want the work to be done. Above is an example of agent being able to do it locally, and I think it’s greatreplydotancohen 22 hours ago | parent | prev | next [–] > you really need to know what you're asking, and be precise Any chance that you could share some recent prompts to give other HNers a head start on his to approach Qwen? If you are uncomfortable posting them here, my Gmail username is the same as my HN username.Thank you.replyGreenpants 22 hours ago | root | parent | next [–] I'm glad you're asking. I already started writing a blog post on how to best make use of local models. I'll share it as soon as I have a complete enough list. If anyone else reading this would like to chime in with their tips & tricks, let us know!For the time being, off the top of my head, I'd say:- Prompt Engineering tips & tricks apply here (like being complete in the relevant context you provide in your question, and the specific task(s) the agent should do like reasoning, modifying one file, or trying to fix a complex task all at once (not recommended)).- If you already know which files the agent should look into, mention them to save time and potentially context.- In my personal workflow, I write down lots of atomic TODOs needed to solve a problem. As I write it down, I'll notice assumptions I'm making, or the fact that the TODO could still be decomposed further into (atomic) subtasks.- It's best to get a feeling yourself for how Qwen handles your repository. I noticed if I don't specify an architecture for development, it'll make quick & dirty fixes. If I don't tell it to remove debug statements, it won't. This is what was meant with "be precise" – Claude Opus might think for you and act in your best interest. Smaller Qwen models will just do what you ask them to, and no more. They have design knowledge, but you have to explicitly ask them to "activate" that part of their knowledge.replythefossguy69 4 hours ago | root | parent | next [–] Is there a way to be notified of your blog post on this?replydotancohen 8 hours ago | root | parent | prev | next [–] Thank you, that was extraordinarily helpful.I look forward to that blog post!replytsss 5 hours ago | root | parent | prev | next [–] But if you have to write everything down in such detail, isn't it faster to just do the task yourself?replyjmuguy 1 day ago | parent | prev | next [–] Given your knowledge on this - do you think we'll see an open source model with Opus levels of capability? IMO if/when this happens - I would 100% stop using Anthropic.replyGreenpants 1 day ago | root | parent | next [–] Let me put it like this. I started with local LLMs when ChatGPT still used GPT-3.5. I was amazed how my MacBook with 8GB RAM could run openhermes2.5-mistral: a 7b parameter model that could generate short stories that sort of made sense. Incredible!Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then.I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :)replyjmuguy 1 day ago | root | parent | next [–] Haha well I ask because I don't really want/need anything beyond Opus most of the time. And I'm paranoid that Anthropic is going to be forced to charge the true cost of all this before too long.replyGreenpants 23 hours ago | root | parent | next [–] The other upside of running local LLMs is that there's no cloud provider to suddenly charge more for the same, or even less, model use.It's personal, but I prefer CapEx over OpEx for this. If you can purchase a device upfront that runs a decent local LLM, you get the peace of mind that your setup won't suddenly change over time and can only get better.replylambda 1 day ago | root | parent | prev | next [–] If you believe the benchmarks, Qwen 3.6 35B-A3B already outperforms Claude 4 Opus.Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.replyMrScruff 23 hours ago | root | parent | next [–] You really need to take the benchmarks with a massive pinch of salt. I’ve been testing local LLMs since the original llama and there’s nothing I’ve tried that is in the same category as Opus.replylambda 23 hours ago | root | parent | next [–] Which Opus? They certainly outperform Claude 3 Opus.Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks.replymapontosevenths 22 hours ago | root | parent | next [–] There's a guy on Youtube named Bijan Bowen who tests all the models (open and frontier) on a series of one/few shot programming exercises and has been for a long while now. You can pretty much watch him compare the results for any two models you're likely to be interested in.I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.replylambda 21 hours ago | root | parent | next [–] OK, it looks like he did a browser OS test with both Claude 4 Opus and Qwen 3.6 35B-A3B.Claude 4 Opus: https://youtu.be/J7omabtqnBM?t=193Qwen 3.6 35B A3B: https://youtu.be/gVU-DQeqkI0?t=215Qwen 3.6 produced far more working functionality than Claude 4 Opus did.Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.replyMrScruff 23 hours ago | root | parent | prev | next [–] I’m normally comparing frontier open/cheap models against frontier closed source. I use deepseek/glm regularly, they’re fine and you can get real work done with them but it’s super obvious when you switch back to opus or even sonnet. A 3B active param MoE model is not comparable.replylambda 21 hours ago | root | parent | next [–] Yeah. I was pointing out that local 3b active models outperform frontier models from a year ago.Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago.Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run.It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are.replyshimman 16 hours ago | root | parent | prev | next [–] Agreed, but at their current prices Deepseek + GLM are clear winners in my book. This weekend I spent $5 between the two where as I'd probably have to pay $20-30 to Anthropic (and that's still with the massive VC subsidies).For web development (or anything else with an extreme amount of training data) it's number one for sure. You can't beat it at its costs. US companies will not be able to compete on a competitive market, which is why they rely on so much US government protection + corporate welfare.replyzozbot234 1 day ago | root | parent | prev | next [–] People can't seem to agree on what "Opus class" even means (the latest Opus is apparently pretty weak) but DeepSeek Pro, Kimi and GLM all are quite capable.replycomputerex 23 hours ago | root | parent | next [–] Nothing compares to Opus when it comes to "taste" in web design in my experience. Nothing compares to opus in very difficult HPC/model inference development. I worked on this with opus: https://github.com/computerex/dlgoOpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.replylambda 20 hours ago | root | parent | next [–] Which Opus?Anthropic has been releasing models named Opus since 2024 with Claude 3 Opus.Opus has gotten vastly more capable since then.Local model far surpass Opus 3. They even surpass Opus 4 on most benchmarks.Sure, if you compare to the latest Opus 4.8 or even 4.6, they're not there yet. But there's a huge difference in performance between 4 and 4.8.replyjkells 19 hours ago | root | parent | next [–] Can't speak for anyone else but there was a step change in frontier models last November. Opus 4.5 and GPT 5.2 I think.When I colloquially say Opus level I really mean Opus 4.5 or laterreplylambda 19 hours ago | root | parent | next [–] Right. Local models haven't quite hit that level yet. The biggest open models, which you need tens of thousands of dollars of hardware to run at reasonable speed, have pretty much hit that level of capability, but most models you can reasonably run at home aren't quite there yet. But given the gap, if local models keep improving, you'd expect to maybe see that level by this November.replyzozbot234 13 hours ago | root | parent | next [–] My understanding is that we could in fact run the largest models on "reasonable" home hardware by focusing on throughput rather than raw speed and having them do unattended inference in large batches. The big proprietary suppliers have no interest in this because their own incentive is to fill all the physical space available with top-performing hardware and doing huge amounts of inference as quickly as possible. A home user with limited hardware investment has very different constraints.replyrvnx 23 hours ago | root | parent | prev | next [–] To me totally yes, even further, if they keep their existing route, over time people will stop using Anthropic.More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval).In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?Just use Gemma/Gemini/Siri or whatever.Pornography and uncensored models is also pushing toward local models.It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).replyspullara 20 hours ago | parent | prev | next [–] This is the only setup that I think is reasonable to use locally right now. I had an agent set it up for me from this guys recipe:https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...One thing I did change was the context length to 256k rather than 64k.replynicman23 13 hours ago | parent | prev | next [–] about the edit tool it is almost always trailing white spaces. if you give it a skill with a sed 's/( )*$//g' or something like that it speeds up thingsreplyawllau 17 hours ago | parent | pr