Комментарии:
what's the link for range join optimization reference?
Ответитьwatched it
ОтветитьThe best talk about spark optimizations in YT by far, thanks man!
ОтветитьAwesome awesome awesome
Ответитьare all pandas UDF vectorized?
ОтветитьCould you share slide?
ОтветитьIf a system is created that you have to tweak it so much and understand it so much to get good performance I would say it should be redesigned
ОтветитьThis video is Gold stuff
ОтветитьThanks, Daniel, great talk. Please share the ebook link.
Ответитьgood insights, really helpful.
ОтветитьHow to set the spark.sql.shuffle.partition by a variable instead of a constant..means if the shuffle input data size is less then it should automatically choose less number of SQL shuffle partition if input shuffle data stage is more then the job should programmaticaly be able to determine correct partition..rather then given a constant valuem
Ответитьwhat command do we use to use all 96 cores while writing instead of only 10?
Ответитьgood job Daniel Tomes. It help lots
ОтветитьGreat talk! Many Thanks.
Btw, where is the book, Daniel? :)
God Level
ОтветитьHow did you come up with the 16mb maxPartitionBytes? is there a general formula for it?
ОтветитьExcellent explanations.
Cleared so many wrong concepts of mine.
Thanks man!!!!
Hi...Great Presentation for understanding Spark optimizations. Is there any Presentation slides to go through..since in videos..its little difficult to read those numbers...
ОтветитьWhy is the first example a valid comparison? You reduced the size of the data you are working with so obviously it will run faster. What if you actually need to process all years instead of just two?
Ответитьwhat do mean by saying ... have array of table names and parallelize it ... wht you mean parallelize here ?
Ответитьif you are adding "salt" column in groupBy it would give wrong results right ... if any groupBy function results we required ?
Ответитьthank you so much for explaining slat addition clearly.
Ответить@45 min , why broadcast has 4 times 12 =48 ? it should be 3 times 12 = 36 right? as we have 3 executors ?
Ответить@22 min where did you get Stage 21 shuffle input size ?
ОтветитьIn my spark version 2.4.3 job after all my transformations,computations and joins I am writing my final dataframe to s3 in parquet format
But irrespective of my cores count my job is taking fixed amount for completing save action
For distinct cores count-8,16,24 my write action timing is fixed to 8 minutes
Due to this my solution is not becoming scalable
How should I make my solution scalable so that my overall job execution time becomes proportional to cores used
I can never do all of it
ОтветитьCan we get a link to the slides? There are tons of small details on the slides that will be easier to go through if we have the slides rather than pausing the video every time. :)
ОтветитьSuper talk Daniel and great insights, still waiting for the ebook though :)
ОтветитьExcellent talk Daniel 👍 I wish I saw it when I started with Spark :-) How's it looking with mentioned e-book?
ОтветитьLazy loading it is just a matter of adding a filter?
ОтветитьIs there a github or some other place the data used for this exercise and the code?
ОтветитьGreat talk, with a lot takeaways!! Is there any references to the notebooks with datasets so I can recreate some of the optimizations?
ОтветитьHe mentioned that the slide deck would be available. Does anyone know where to find it?
ОтветитьHello folks, thanks for all the support. Sorry for the delay on the Ebook, it's still coming it was just delayed. I will share it here as soon as it's available. I'm hoping Q1 this year. :)
Ответитьit is very helpful ! Can some one share the Ebook
ОтветитьIn the lazy-loading, he filtered the years from 2000-2001 , what if the calculation should be done for all the years? Can't use a filter in this case right ?
ОтветитьIs the e-book available?
Ответитьgreat one
Ответитьis the ebook available?
ОтветитьGreat talk! Really learned alot, looking forward to the book!
ОтветитьCould you share the slides of this topic?
Ответить