More

cchianel · 2025-12-01T04:19:15 1764562755

That depends; do you want the optimal solution?

If so, I agree it is impossible for a fully general problem solver to find the optimal solution to a problem in a reasonable amount of time (unless P = NP, which is unlikely).

However, if a "good enough" solution that is only 1% worse than optimal works, then a fully general solver can do the job in a reasonable amount of time.

One such example of a fully general solver is Timefold; you express your constraints using plain old Java objects, so you can in theory do whatever you want in your constraint functions (you can even do network calls, but that is extremely ill-advised since that will drastically slow down score calculation speeds).

Disclosure: I work for Timefold.

kragen · 2025-12-01T05:04:52 1764565492

No, guaranteeing that a solution to a general computational puzzle found in a finite amount of time is within some percentage of optimality is impossible. You must be talking about a restricted class of problems that enjoy some kind of tractability guarantee.

cchianel · 2025-12-01T05:36:55 1764567415

By 1% of optimal, I was giving an example percentage to clarify there are solutions that exists, that are almost as good as optimal, that can be found in reasonable amount of time.

There cannot be a guarantee to find a solution to a given percentage worse than optimal for a fully general problem, since you would need to know optimal to give such a guarantee (and since the problem fully general, you cannot use the structure of the problem to reduce it).

Most constraint problems have many feasible solutions, and have a way to judge how much worse or better one solution is to another.

There are good and bad way to write constraints.

One bad way to write constraints is score traps, where between one clearly better solution has the same score as a clearly worse solution.

For example, for shift scheduling, a solution with only 1 overlapping shift with the same employee is better than a solution with 2 overlapping shifts with the same employee.

A bad score function would penalize both solutions by 1, meaning a solver have no idea which of the two solutions are better.

A good score function would penalize the schedule with 1 overlapping shift with the same employee by 1, and the schedule with 2 overlapping shifts with the same employee by 2.

The class of problems I am talking about is the class of problems where you can assign a score to a possible solution, with limited score traps.

Timefold has no guarantees about finding a solution in reasonable time (but unless you done something terribly wrong or have a truly massive dataset, it finds a good solution really quickly 99.99% of the time). Instead, you set the termination condition of the solver; it could be time-based (say 60 minutes), unimproved time spent (solve until no new better solutions are found after 60 minutes), or the first feasible solution (there are other termination conditions that can be set).

ted_dunning · 2025-12-01T05:54:05 1764568445

I think from what you say that you are not familiar with the content of the No Free Lunch theorem.

It says that averaging over all possible objective functions that no optimization algorithm is better than median.

If you don't know anything about the objective, then the probability that Timefold is better than a randomly chosen optimization algorithm is 50%.

The key is the point about averaging over all possible objective functions. You don't presumably expect to be even close to successful on such a general set.

cchianel · 2025-12-01T06:41:38 1764571298

I thought "No Free Lunch Theorem" was a joke from its name (although I should know better since "Hairy Ball Theorem" exists).

So if I understand correct, it states that all optimization algorithm must choose an order to evaluate solutions, and the faster it evaluates the "best" solution, the "better" the algorithm.

For example, say the search space is {"a", "b", "c"} and the best solution is "c". Then there are 3! (6) ways to evaluate all solutions:

"a", "b", "c"

"a", "c", "b"

"b", "a", "c"

"b", "c", "a"

"c", "a", "b"

"c", "b", "a"

(of course, in a real problem, the search space is much, MUCH larger).

The way Timefold tackles this is move selectors. In particular, you can design custom moves that uses knowledge of your particular problem which in turn affect when certain solutions are evaluated. Additionally, you can design custom phases that runs your own algorithm and then pass the solution found to local search. One of the projects we are working on currently is the neighborhood API, to make it much easier to write your own custom moves. That being said, for most problems that humans deal with, you don't need to configure Timefold and just work with the defaults (which is best-fit followed by a late acceptance local search with change and swap moves).

For what it's worth: metaheuristics solvers tend to be better than linear solvers for shift scheduling and vehicle routing, but worse for bin packing. But it still works, and is still fast.

cchianel · 2025-12-01T04:12:53 1764562373

Some additional optimization resources (for metaheuristics, where you only have the objective/score function and no derivative):

- "Essentials of Metaheuristics" by Sean Luke https://cs.gmu.edu/~sean/book/metaheuristics/

- "Clever Algorithms" by Jason Brownlee https://cleveralgorithms.com/

Timefold uses the metaheuristic algorithms in these books (Tabu Search, Late Acceptance, Simulated Annealing, etc.) to find near-optimal solutions quickly from a score function (typically defined in a Java stream-like/SQL-like syntax so score calculation can be done incrementally to improve score calculation speed).

You can see simplified diagrams of these algorithms in action in Timefold's docs: https://docs.timefold.ai/timefold-solver/latest/optimization....

Disclosure: I work for Timefold.

abhgh · 2025-12-01T05:40:42 1764567642

Timefold looks very interesting. This might be irrelevant but have you looked at stuff like InfoBax [1]?

[1] https://willieneis.github.io/bax-website/

cchianel · 2025-12-01T06:58:21 1764572301

I haven't; from a quick reading, InfoBax is for when you have an expensive function and want to do limited evaluations. Timefold works with cheap functions and does many evaluations. Timefold does this via Constraint Streams, so a function like:

    var score = 0;
    for (var shiftA : solution.getShifts()) {
        for (var shiftB : solution.getShifts()) {
            if (shiftA != shiftB && shiftA.getEmployee() == shiftB.getEmployee() && shiftA.overlaps(shiftB)) {
                score -= 1;
            }
        }
    }
    return score

usually takes shift * shift evaluations of overlaps, we only check the shifts affected by the change (changing it from O(N^2) to O(1) usually).

That being said, it might be useful for a move selector. I need to give it a more in depth reading.

abhgh · 2025-12-01T07:11:39 1764573099

Thanks for the example. Yes, true, this is for expensive functions - to be precise functions that depend on data that is hard to gather, so you interleave the process of computing the value of the function with gathering strategically just as much data as is needed to compute the function value. The video on their page [1] is quite illustrative: calculate shortest path on a graph where the edge weights are expensive to obtain. Note how the edge weights they end up obtaining forms a narrow band around the shortest path they find.

[1] https://willieneis.github.io/bax-website/

cchianel · 2025-06-11T00:53:18 1749603198

Games on old consoles such as Nintendo 64 and the Playstation One use a variety of tricks to be playable on limited hardware. For the specific example of Final Fantasy VII, there is one trick that allows you to skip almost the entirely of Disk 1 once you reach the open world. In particular, the trick involves a use-after-free in the world collision cache (fun fact: a lot of speedrun techniques are security flaws!). This video explains it without spoilers: https://www.youtube.com/watch?v=llQnrUHS0yA ; a better video is https://www.youtube.com/watch?v=JHn6jpKsYCk, but that might have minor spoilers IIRC (Note: this trick is relatively new, and probably not used in that video).

mkfs · 2025-06-11T06:37:10 1749623830

> the trick involves a use-after-free in the world collision cache (fun fact: a lot of speedrun techniques are security flaws!

In other words:

> Hey, you know those super difficult parts from games of your childhood that you're probably curious to see how we, the supposed speediest of gamers, swiftly navigate? Well, guess what? We just skip them altogether by glitching! Any% new record!

That's why no one cares about this. These people will play the same game or level thousands of times to trigger 1/1,000,000 glitches to skip huge swaths of games, then claim some made-up "record."

This is not skill. Even the non-glitch speedruns are more down to persistence (if not mental illness) than skill--which is a result of deliberate practice--like one sees in competitive shooters or fighting games. And these people throw tantrums constantly, like children, and curse the "RNG" (an admission this isn't about skill) the way a primitive tribesman might damn the gods for a prolonged drought.

ohdeargodno · 2025-06-11T07:46:45 1749628005

If only speed running communities existed, where they made different categories like any%, any% No major glitches/skips, all with varying prestige and competitiveness, as well as agreed upon which skips are so soul crushingly shit to do that they're excluded from every category aside from any% :(

You might consider therapy, if playing games fast makes you angry.

winterbloom · 2025-06-11T10:12:40 1749636760

They are being genuine, no need to be toxic

purplesyringa · 2025-06-11T11:55:59 1749642959

I appreciate that calling a hobby "no one" (read: many people) enjoy a mental illness is genuine, but I think I find the grandparent much less toxic than the comment it replies to. There's no need to shame people for things they find interesting.

mkfs · 2025-06-11T08:20:35 1749630035

> If only speed running communities existed, where they made different categories like any%, any% No major glitches/skips, all with varying prestige and competitiveness, as well as agreed upon which skips are so soul crushingly shit to do that they're excluded from every category aside from any% :(

Arbitrarily delineating good vs. bad glitches is dumb, and even glitch-free speedruns are more about persistence, just trying over and over again until you get the exact right sequence of inputs (or "RNG") to claim the record, rather than repeatable skill. This isn't like some high-level CS:GO player who can consistently headshot fast-moving targets or a high-level SF player who's mastered techniques like hit confirms.

> You might consider therapy, if playing games fast makes you angry.

I'm not the one downvoting comments, and unlike speedrunners, I don't throw tantrums over games.

ohdeargodno · 2025-06-11T17:43:47 1749663827

> This isn't like some high-level CS:GO player who can consistently headshot fast-moving targets

Saying this about the one competitive shooter that has random first shot inaccuracy is an incredibly funny take.

>just trying over and over again until you get the exact right sequence of inputs (or "RNG") to claim the record, rather than repeatable skill.

There are pretty much zero active speedrun games that require "the right RNG" to win. There are some games with explicitly bad random things that are guaranteed to fuck up a run, like RE4, but skilled players all have backup strategies and most top runs did not get perfect RNG everywhere. Sums of best on most games are 20% lower than the time. Do you know why ? Because runs that end up requiring a 1/100000 strategy to win get bumped into different categories, otherwise the category dies. Sometimes, a reliable way to do that skip (like, say, Mi'ihen skip in FFX) is found, and what was once a cosmic bit flip is now a regular route that everyone knows and gets done.

Some players reset after 3 minutes, because they cannot accept the idea of having a suboptimal start, but runners like Siglemic or ZFG, who are responsible for the growth of speedrunning as a watchable thing got there because they push runs to the end, no matter what. As for "no repeatable skills", runners learned to manual superswim on The Wind Waker (requiring frame perfect input, sometimes without pause buffering), have game knowledge to the point where a randomizer ALTTP run is a fun live speedrun challenge, can setup ACE on games like Super Mario World, and so, so many others.

How many failed BLJs and 70 star stairs skip did it take you to get so angry at people playing games fast ?

marci · 2025-06-11T11:37:34 1749641854

It reads like you said "Arbitrarily delineating good vs. bad glitches is dumb", then immediately proceeded to arbitrarily delineating good vs. bad skill.

People who don't like CS:GO type of games won't find any value or interest in some "player who can consistently headshot fast-moving targets". Some people just like seeing their favorite game break in unexpected ways. It's a bit like going into someone else home in the same building with the same floor plan, or just the mirror tracks in a racing game, everything is uncannily familiar, and this uncanny feeling is enjoyed. Maybe you just don't see the appeal, which is fine as well.

cchianel · 2025-06-11T00:40:10 1749602410

Related: Archipelago (https://archipelago.gg/), a framework that randomizes multiple games, across multiple clients. For example, Player A is playing Mario 64, and Player B is playing Pokemon Red. Some items from Player B's Pokemon Red (such as the Rock Smash HM) would be in Player A's Mario 64 world (for example, from collecting a Star). As a result, the two players need to coordinate and help each other progress.

Archipelago GitHub: https://github.com/ArchipelagoMW/Archipelago

cchianel · 2025-05-20T17:09:09 1747760949

N-Queens can be solved (find a single valid solution) in polynomial time [1]. That being said, Local Search is a powerful technique that can solve a lot of problems such as employee scheduling, vehicle routing, or maintenance scheduling. You might want to take a look at Timefold [2], which is a local search solver that can be configured to use simulated annealing, tabu search, late acceptance and other local search algorithms. Note: I work for Timefold.

[1] https://stackoverflow.com/a/13109557

[2] https://solver.timefold.ai/

cchianel · 2025-04-28T13:15:33 1745846133

The primary reason, in my opinion, is the vast majority of Python libraries lack type annotations (this includes the standard library). Without type annotations, there is very little for a non-JIT compiler to optimize, since:

- The vast majority of code generation would have to be dynamic dispatches, which would not be too different from CPython's bytecode.

- Types are dynamic; the methods on a type can change at runtime due to monkey patching. As a result, the compiler must be able to "recompile" a type at runtime (and thus, you cannot ship optimized target files).

- There are multiple ways every single operation in Python might be called; for instance `a.b` either does a __dict__ lookup or a descriptor lookup, and you don't know which method is used unless you know the type (and if that type is monkeypatched, then the method that called might change).

A JIT compiler might be able to optimize some of these cases (observing what is the actual type used), but a JIT compiler can use the source file/be included in the CPython interpreter.

hwpythonner · 2025-04-28T13:25:33 1745846733

You make a great point — type information is definitely a huge part of the challenge.

I'd add that even beyond types, late binding is fundamental to Python’s dynamism: Variables, functions, and classes are often only bound at runtime, and can be reassigned or modified dynamically.

So even if every object had a type annotation, you would still need to deal with names and behaviors changing during execution — which makes traditional static compilation very hard.

That’s why PyXL focuses more on efficient dynamic execution rather than trying to statically "lock down" Python like C++.

pjmlp · 2025-04-28T14:20:40 1745850040

Solved by Smalltalk, Self, and Lisp JITs, that are in the genesis of JIT technology, some of it landed on Hotspot and V8.

dragonwriter · 2025-04-28T16:24:37 1745857477

Python starting with 3.13 also has a JIT available.

pjmlp · 2025-04-28T16:30:01 1745857801

Kind of, you have to compile it yourself, and is rather basic, still early days.

PyPy and GraalPy is where the fun is, however they are largely ignored outside their language research communities.

jonathaneunice · 2025-04-28T14:31:06 1745850666

"Addressed" or "mitigated" perhaps. Not "solved." Just "made less painful" or "enough less painful that we don't need to run screaming from the room."

pjmlp · 2025-04-28T14:55:06 1745852106

Versus what most folks do with CPython, it is indeed solved.

We are very far from having a full single user graphics workstation in CPython, even if those JITs aren't perfect.

Yes, there are a couple of ongoing attempts, while most in the community rather write C extensions.

jonathaneunice · 2025-04-28T16:26:37 1745857597

Is "single user graphics workstation" even still a goal? Great target in the Early to Mid Ethernetian when Xerox Dorados and Dandelions, Symbolics, and Liliths roamed the Earth. Doesn't feel like a modern goal or standard of comparison.

I used those workstations back in the day—then rinsed and repeated with JITs and GCs for Self, Java, and on to finally Python in PyPy. They're fantastic! Love having them on-board. Many blessings to Deutsch, Ungar, et al. But for 40 years JIT's value has always been to optimize away the worst gaps, getting "close enough" to native to preserve "it's OK to use the highest level abstractions" for an interesting set of workloads. A solid success, but side by side with AOT compilation of closer-to-the-machine code? AOT regularly wins, then and now.

"Solved" should imply performance isn't a reason to utterly switch languages and abstractions. Yet witness the enthusiasm around Julia and Rust e.g. specifically to get more native-like performance. YMMV, but from this vantage, seeing so much intentional down-shift in abstraction level and ecosystem maturity "for performance" feels like JIT reduced but hardly eliminated the gap.

kragen · 2025-04-28T20:09:10 1745870950

"Single-user graphical workstation" may not be a great goal anymore, but it's at least a sobering milestone to keep failing to reach.

AFAIK there isn't an AOT compiler from JVM bytecode to native code that's competitive with either HotSpot or Graal, which are JIT compilers. But the JVM semantics are much less dynamic than Python or JS, whose JIT compilers don't perform nearly as well. Even Jython compiled to JVM bytecode and JITted with HotSpot is pretty slow.

However, LuaJIT does seem to be competitive with AOT-compiled C and with HotSpot, despite Lua being just as dynamic as Python and more so than JS.

pjmlp · 2025-04-28T16:32:33 1745857953

It is solved to the point the users on those communities are not writing extensions in C all the time, to compensate for the interpreter implementation.

AOT winning over JITs on micro benchmarks hardly wins in meaningful way for most business applications, especially when JIT caches and with PGO data sharing across runs is part of the picture.

Sure there are always going to be use cases that require AOT, and in most of them is due to deployment constraints, than anything else.

Most mainstream devs don't even know how to use PGO tooling correctly from their AOT toolchains.

Heck, how many Electron apps do you have running right now?

Qem · 2025-04-28T15:40:13 1745854813

> We are very far from having a full single user graphics workstation in CPython, even if those JITs aren't perfect.

Some years ago there was an attempt to create a linux distribution including a Python userspace, called Snakeware. But the project went inactive since then. See https://github.com/joshiemoore/snakeware

pjmlp · 2025-04-28T16:08:22 1745856502

I fail to find anything related to have a good enough performance for a desktop system written in Python.

homarp · 2025-04-28T19:30:27 1745868627

Sugar is built with python

https://github.com/sugarlabs/sugar

Qem · 2025-04-28T15:25:56 1745853956

> The primary reason, in my opinion, is the vast majority of Python libraries lack type annotations (this includes the standard library).

When type annotations are available, it's already possible to compile Python to improve performance, using Mypyc. See for example https://blog.glyph.im/2022/04/you-should-compile-your-python...

cchianel · 2025-04-28T12:54:10 1745844850

As someone who did a CPython Bytecode → Java bytecode translator (https://timefold.ai/blog/java-vs-python-speed), I strongly recommend against the CPython Bytecode → PySM Assembly step:

- CPython Bytecode is far from stable; it changes every version, sometimes changing the behaviour of existing bytecodes. As a result, you are pinned to a specific version of Python unless you make multiple translators.

- CPython Bytecode is poorly documented, with some descriptions being misleading/incorrect.

- CPython Bytecode requires restoring the stack on exception, since it keeps a loop iterator on the stack instead of in a local variable.

I recommend instead doing CPython AST → PySM Assembly. CPython AST is significantly more stable.

hwpythonner · 2025-04-28T15:13:37 1745853217

Thanks — really appreciate your insights.

You're absolutely right that CPython bytecode changes over time and isn’t perfectly documented — I’ve also had to read the CPython source directly at times because of unclear docs.

That said, I intentionally chose to target bytecode instead of AST at this stage. Adhering to the AST would actually make me more vulnerable to changes in the Python language itself (new syntax, new constructs), whereas bytecode changes are usually contained to VM-level behavior. It also made it much easier early on, because the PyXL compiler behaves more like a simple transpiler — taking known bytecode and mapping it directly to PySM instructions — which made validation and iteration faster.

Either way, some adaptation will always be needed when Python evolves — but my goal is to eventually get to a point where only the compiler (the software part of PyXL) needs updates, while keeping the hardware stable.

cchianel · 2025-04-28T17:19:51 1745860791

CPython bytecode changes behaviour for no reason and very suddenly, so you will be vulnerable to changes in Python language versions. A few from the top of my head:

- In Python 3.10, jumps changed from absolute indices to relative indices

- In Python 3.11, cell variables index is calculated differently for cell variables corresponding to parameters and cell variables corresponding to local variables

- In Python 3.11, MAKE_FUNCTION has the code object at the TOS instead of the qualified name of the function

For what it's worth, I created a detailed behaviour of each opcode (along with example Python sources) here: https://github.com/TimefoldAI/timefold-solver/blob/main/pyth... (for up to Python 3.11).

nurettin · 2025-04-28T13:07:19 1745845639

This was my first thought as well. They will be stuck at a certain python version

cchianel · 2025-02-14T18:00:00 1739556000

(Disclosure: I work for Timefold) OR Tools is a MIPS/SAT solver, while Timefold is a meta-heuristic solver. As a result, the developer experience is quite different:

- In OR Tools, you need to write your constraints as mathematical equations. You can only provide your own function in a few scenarios, such as calculating the distance between two points. This allows OR Tools to eliminate symmetries and calculate gradients to improve its solution search. In Timefold, your constraints are treated like a black box, so your constraints can use any function and call any library you like (although you shouldn't do things like IO in them). However, since said constraints are a black box, Timefold is unable to eliminate symmetries or calculate gradients.

- In OR Tools, you have very little control in how it does its search. At most, you can select what algorithm/strategy it uses. In Timefold, you have a ton of control in how it does its search: from what moves its tries to adding your own custom phases. If none are configured, it will use sensible defaults. That being said, certain problems strongly benefit from custom moves.

- In OR Tools, you do not need to create custom classes; you create variables using methods on their domain model classes. In Timefold, you need to define your domain model by creating your own classes. This adds initial complexity, but it makes the code more readable: instead of your variable being an int, it is an Employee or a Visit. That being said, it can be difficult for someone to design an initial model, since it requires an understanding of the problem they are solving.

All in all, when using Timefold, you need to think in a more declarative approach (think Prolog, SQL, etc). For example, let say you have a room assignment problem where a teacher can only teach at a single room at once, and minimize room changes. In Timefold, it may look like this:

    # Typically would be stored in a parameterization object, but a global var
    # is used for brevity
    teacher_count = 3
    
    @planning_entity
    @dataclass
    class Room:
        id: Annotated[int, PlanningId]
        date: int
        name: str
        # This can also be a Teacher object, but str is used for brevity here
        teacher: Annotated[str, PlanningVariable] = field(default=None)
    
    
    @planning_solution
    @dataclass
    class Timetable:
        rooms: Annotated[list[Room], PlanningEntityCollectionProperty]
        teachers: Annotated[list[str], ValueRangeProvider]
        score: Annotated[HardSoftScore, PlanningScore] = field(default=None)


    @constraint_provider
    def timetable_constraints(cf: ConstraintFactory):
        return [
            cf.for_each_unique_pair(Room, Joiners.equal(lambda room: room.date), Joiners.equal(lambda room: room.teacher))
              .penalize(HardSoftScore.ONE_HARD)
              .as_constraint('Teacher time conflict'),
            cf.for_each_unique_pair(Room, Joiners.equal(lambda room: room.teacher))
              .filter(lambda a, b: a.name != b.name)
              .penalize(HardSoftScore.ONE_SOFT)
              .as_constraint('Minimize room change')
        ]


    solver_config = SolverConfig(
        solution_class=Timetable,
        entity_class_list=[Room],
        score_director_factory_config=ScoreDirectorFactoryConfig(
            constraint_provider_function=timetable_constraints
        ),
        termination_config=TerminationConfig(
            spent_limit=Duration(seconds=5)
        )
    )

    solver_factory = SolverFactory.create(solver_config)
    solver = solver_factory.build_solver()
    problem: Timetable = build_problem()
    solver.solve(problem)

Compared to OR Tools:

    teacher_count = 3
    problem: Timetable = build_problem()
    model = cp_model.CpModel()
    objective = []
    room_vars = [cp_model.model.new_int_var(0, teacher_count - 1, f'room_assignment_{i}' for i in range(len(problem.rooms)))]
    room_vars_by_date = {date: [room_vars[i] for i in range(len(problem.rooms)) if problem.rooms[i].date == date] for date in {room.date for room in problem.rooms}}
    room_vars_by_name = {name: [room_vars[i] for i in range(len(problem.rooms)) if problem.rooms[i].name == name] for name in {room.name for room in problem.rooms}}
    for date, date_vars in room_vars_by_date.items():
        model.AddAllDifferent(date_vars)
    for name, name_vars in room_vars_by_name.items():
        for assignment_1 in names_vars:
            for assignment_2 in names_vars:
                if assignment_1 is not assignment_2:
                    objective.add(assignment_1 != assignment_2)
    
    model.minimize(sum(objective))
    solver = cp_model.CpSolver()
    status = solver.solve(model)

cchianel · 2025-02-13T05:51:51 1739425911

I had to deal with a lot of FFI to enable a Java Constraint Solver (Timefold) to call functions defined in CPython. In my experience, most of the performance problems from FFI come from using proxies to communicate between the host and foreign language.

A direct FFI call using JNI or the new foreign interface is fast, and has roughly the same speed as calling a Java method directly. Alas, the CPython and Java garbage collectors do not play nice, and require black magic in order to keep them in sync.

On the other hand, using proxies (such as in JPype or GraalPy) cause a significant performance overhead, since the parameters and return values need to be converted, and might cause additional FFI calls (in the other direction). The fun thing is if you pass a CPython object to Java, Java has a proxy to the CPython object. And if you pass that proxy back to CPython, a proxy to that proxy is created instead of unwrapping it. The result: JPype proxies are 1402% slower than calling CPython directly using FFI, and GraalPy proxies are 453% slower than calling CPython directly using FFI.

What I ultimately end up doing is translating CPython bytecode into Java bytecode, and generating Java data structures corresponding to the CPython classes used. As a result, I got a 100x speedup compared to using proxies. (Side note: if you are thinking about translating/reading CPython bytecode, don't; it is highly unstable, poorly documented, and its VM has several quirks that make it hard to map directly to other bytecodes).

For more details, you can see my blog post on the subject: https://timefold.ai/blog/java-vs-python-speed

LinXitoW · 2025-02-13T08:39:51 1739435991

Speaking from zero experience, the FFI stories of both Python and Java to C seems much better. Wouldn't going connecting them via a little C bridge a general solution?

cchianel · 2025-02-13T13:21:19 1739452879

JNI/the new Foreign FFI communicate with CPython via CPython's C API. The primary issue is getting the garbage collectors to work with each other. The Java solver works by repeatedly calling user defined functions when calculating the score. As a result:

- The Java side needs to store opaque Python pointers which may have no references on the CPython side.

- The CPython side need to store generated proxies for some Java objects (the result of constraint collectors, which are basically aggregations of a solution's data).

Solving runs a long time, typically at least a hour (although you can modify how long it runs for). If we don't free memory (by releasing the opaque Python Pointer return values), we will quickly run out of memory after a couple of minutes. The only way to free memory on the Java side is to close the arena holding the opaque Python pointer. However, when that arena is closed, its memory is zeroed out to prevent use-after-free. As a result, if CPython haven't garbage collected that pointer yet, it will cause a segmentation fault on the next CPython garbage collection cycle.

JPype (a CPython -> Java bridge) does dark magic to link the JVM's and CPython's garbage collector, but has performance issues when calling a CPython function inside a Java function, since its proxies have to do a lot of work. Even GraalPy, where Python is ran inside a JVM, has performance issues when Python calls Java code which calls Python code.

ignoramous · 2025-02-13T14:21:30 1739456490

Also see: cgo is not Go

  Go code and C code have to agree on how resources like address space, signal handlers, and thread TLS slots are to be shared — and when I say agree, I actually mean Go has to work around the C code's assumption. C code that can assume it always runs on one thread, or blithely be unprepared to work in a multi threaded environment at all.

  C doesn't know anything about Go's calling convention or growable stacks, so a call down to C code must record all the details of the goroutine stack, switch to the C stack, and run C code which has no knowledge of how it was invoked, or the larger Go runtime in charge of the program.

  It doesn't matter which language you’re writing bindings or wrapping C code with; Python, Java with JNI, some language using libFFI, or Go via cgo; it is C's world, you're just living in it.

https://dave.cheney.net/2016/01/18/cgo-is-not-go / https://archive.vn/GZoMK

high_na_euv · 2025-02-13T13:12:45 1739452365

How IPC methods would fit such cases?

Like, talk over some queue, file, http, etc

cchianel · 2025-02-13T13:36:33 1739453793

IPC methods were actually used when constructing the foreign API prototype, since if you do not use JPype, the JVM must be launched in its own process. The IPC methods were used on the API level, with the JVM starting its own CPython interpreter, with CPython and Java using `cloudpickle` to send each other functions/objects.

Using IPC for all internal calls would probably take significant overhead; the user functions are typically small (think `lambda shift: shift.date in employee.unavailable_dates` or `lambda lesson: lesson.teacher`). Depending on how many constraints you have and how complicated your domain model is, there could be potentially hundreds of context switches for a single score calculation. It might be worth prototyping though.

cchianel · 2025-02-04T04:38:08 1738643888

I had the misfortune of translating CPython bytecode to Java bytecode, and I do not wish that experience on anyone:

- CPython's bytecode is extremely unstable. Not only do new opcodes are added/removed each release, the meaning of existing opcodes can change. For instance the meaning for the argument to JUMP_IF_FALSE_OR_POP changes depending on the CPython version; in CPython 3.10 and below, it is an absolute address, in CPython 3.11 and above, it a relative address.

- The documentation for the bytecode in dis tends to be outdated or outright wrong. I often had to analyze the generated bytecode to figure out what each opcode means (and then submit a corresponding PR to update said documentation). Moreover, it assumes you know how the inner details of CPython work, from the descriptor protocol to how binary operations are implemented (each of which are about 30 lines functions when written in Python).

- CPython's bytecode is extremely atypical. For instance, a for-loop keeps its iterator on the stack instead of storing it in a synthetic variable. As a result, when an exception occurs inside a for-loop, instead of the stack containing only the exception, it will also contain the for-loop iterator.

As for why I did this, I have Java calling CPython in a hot loop. Although direct FFI is fast, it causes a memory leak due to Java's and Python's Garbage collectors needing to track each other's objects. When using JPype or GraalPy, the overhead of calling Python in a Java hot-loop is massive; I got a 100x speedup from translating the CPython bytecode to Java bytecode with identical behaviour (details can be found in my blog post: https://timefold.ai/blog/java-vs-python-speed).

I strongly recommend using the AST instead (although there are no backward comparability guarantees with AST, it is far less likely to break between versions).